Page 213 - Touhpad Ai
P. 213

Pandas provides built-in functions to:
                 u  Remove duplicate rows
                 u  Fill or drop missing values
                 u  Fix wrong data formats
                 u  Trim extra spaces
                 u  Replace incorrect values
                 Using these techniques, data scientists prepare data that is accurate and organised, which helps in building better
                 AI models.

                 Removing Duplicate Values
                 Duplicate rows can cause confusion in analysis. Pandas makes it easy to remove them.

                    Program 21: To removing duplicate values from DataFrame
                   import pandas as pd
                   data = {'Name': ['Aman', 'Riya', 'Aman'], 'Age': [17, 18, 17]}
                   df = pd.DataFrame(data)
                   print("Original Data")
                   print(df)
                   df_cleaned = df.drop_duplicates()
                   print("Cleaned Data (after removing duplicate values)")
                   print(df_cleaned)
                   Output:
                   Original Data
                          Name        Age
                   0      Aman         17
                   1      Riya         18
                   2      Aman         17
                 Cleaned Data (after removing duplicate values)
                          Name        Age
                   0      Aman         17
                   1      Riya         18
                 Handling Missing Values

                 Missing values can be found as NaN (Not a Number) in the dataset. We can do any of the following to handle
                 missing values:
                 u  Check for missing values: Use df.isnull() function
                 u  Remove rows with missing values: Use df.dropna() function

                    Program 22: To handle missing values in a DataFrame

                   import pandas as pd
                   # Create sample data with missing values
                   data = {
                   'Name': ['Aman', 'Riya', 'Karan', 'Sia', None],
                   'Age': [17, None, 16, 18, 17]
                   }
                   df = pd.DataFrame(data)


                                                                      Theoretical and Practical Aspects of Data Processing  211
   208   209   210   211   212   213   214   215   216   217   218