Page 218 - Touhpad Ai
P. 218
Case Study
A company has collected customer data from their online shopping website. However, the data contains issues
such as:
u Duplicated customer entries
u Missing phone numbers
u Extra spaces in names and cities
u Inconsistent formats in the "Date of Purchase"
u Incorrect category names
A company has collected customer data from their online shopping website. However, the data contains issues
such as:
u Duplicated customer entries
u Missing phone numbers
u Extra spaces in names and cities
u Inconsistent formats in the "Date of Purchase"
u Incorrect category names
Let us clean this data using Pandas step by step.
# Step 1: Import pandas
import pandas as pd
# Step 2: Declare the messy dataset
data = {
'Customer_Name': ['Meera', 'Raj', 'Amit', 'Meera', 'Diya', 'Rajat'],
'Phone_Number': ['9876543210', None, '9123456780', '9876543210', '9988776655',
None],
'City': ['delhi', 'Delhi', ' mumbai ', 'delhi', 'Mumbai', 'delhi'],
'Product_Category': ['Electronics', 'Clothes', 'Electronic', 'Electronics',
'Clothes', 'Clths'],
'Date_of_Purchase': ['2023-01-15', '15/01/2023', '2023.01.15', '2023-01-15',
'2023/01/15', '15-01-2023']
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
# Step 3: Remove duplicate rows
df = df.drop_duplicates()
# Step 4: Strip extra spaces from 'Customer_Name' and 'City'
df['Customer_Name'] = df['Customer_Name'].str.strip()
df['City'] = df['City'].str.strip().str.title() # Capitalize city names
# Step 5: Fill missing phone numbers with placeholder "Not Available"
df['Phone_Number'] = df['Phone_Number'].fillna("Not Available")
# Step 6: Fix incorrect product category names
df['Product_Category'] = df['Product_Category'].replace({
216 Touchpad Artificial Intelligence - XI

