Page 218 - Touhpad Ai
P. 218

Case Study
                A company has collected customer data from their online shopping website. However, the data contains issues
                such as:
                u  Duplicated customer entries
                u  Missing phone numbers
                u  Extra spaces in names and cities

                u  Inconsistent formats in the "Date of Purchase"
                u  Incorrect category names
                A company has collected customer data from their online shopping website. However, the data contains issues
                such as:
                u  Duplicated customer entries
                u  Missing phone numbers
                u  Extra spaces in names and cities
                u  Inconsistent formats in the "Date of Purchase"
                u  Incorrect category names
                Let us clean this data using Pandas step by step.

                # Step 1: Import pandas
                import pandas as pd
                # Step 2: Declare the messy dataset
                data = {
                    'Customer_Name': ['Meera', 'Raj', 'Amit', 'Meera', 'Diya', 'Rajat'],
                        'Phone_Number':  ['9876543210',  None, '9123456780',  '9876543210',  '9988776655',
                None],
                    'City': ['delhi', 'Delhi', ' mumbai ', 'delhi', 'Mumbai', 'delhi'],
                    'Product_Category': ['Electronics', 'Clothes', 'Electronic', 'Electronics',
                'Clothes', 'Clths'],
                    'Date_of_Purchase': ['2023-01-15', '15/01/2023', '2023.01.15', '2023-01-15',
                '2023/01/15', '15-01-2023']
                }
                df = pd.DataFrame(data)

                print("Original Dataset:")
                print(df)
                # Step 3: Remove duplicate rows
                df = df.drop_duplicates()

                # Step 4: Strip extra spaces from 'Customer_Name' and 'City'
                df['Customer_Name'] = df['Customer_Name'].str.strip()
                df['City'] = df['City'].str.strip().str.title()  # Capitalize city names
                # Step 5: Fill missing phone numbers with placeholder "Not Available"

                df['Phone_Number'] = df['Phone_Number'].fillna("Not Available")
                # Step 6: Fix incorrect product category names
                df['Product_Category'] = df['Product_Category'].replace({


                 216    Touchpad Artificial Intelligence - XI
   213   214   215   216   217   218   219   220   221   222   223