Page 222 - Touhpad Ai
P. 222
Merging or Joining DataFrames
Following functions are used to merge or join DataFrames:
Function Description
pd.merge(df1, df2, on='common_column') Merges two DataFrames using a common column (similar to
SQL joins).
df1.join(df2, on='common_column', how='left') Joins two DataFrames, keeping all records from the left DataFrame.
Data Cleaning and Preprocessing
Data cleaning ensures that your dataset is accurate, consistent, and ready for analysis. It involves handling missing
data, duplicates, incorrect formats, and outliers.
Handling Missing Values
Following functions help to handle missing values in the dataset:
Function Description
df.isnull().sum() Checks how many missing (NaN) values are present in each column.
df.dropna() Removes rows containing missing values.
df.fillna(value) Fills missing values with a specified value (mean, median, mode, etc.).
Handling Duplicates
Following functions help to handle duplicate values from the dataset:
Function Description
df.duplicated().sum() Checks the total number of duplicate rows.
df.drop_duplicates() Removes all duplicate rows from the DataFrame.
Data Type Conversion
Following functions help ensure correct data types for analysis:
Function Description
pd.to_numeric() Converts values to numeric type.
pd.to_datetime() Converts values to date/time type.
pd.Categorical() Converts values to categorical type.
String Manipulation
Following functions are used for cleaning and transforming text data:
Function Description
df['column'].str.lower() Converts text to lowercase.
df['column'].str.strip() Removes extra spaces from strings.
df['column'].str.replace(old, new) Replaces text within strings.
Outlier Handling
Outliers are extreme values that can distort analysis. They can be identified and managed using:
u Box plots, Z-scores, or Interquartile Range (IQR).
u Remove extreme outliers or cap them at threshold values.
220 Touchpad Artificial Intelligence - XI

