Page 237 - Touhpad Ai
P. 237

9.  What is the role of StandardScaler() in data standardisation?
                        Ans.  StandardScaler() is used to apply Z-score normalization, where it adjusts the data so the mean is 0 and
                             standard deviation is 1, useful in ML models.
                    10.  Why is it important to clean and standardise data before combining datasets from different sources?
                        Ans.  Without cleaning and standardisation, different formats, spellings, or missing values can cause errors
                             when merging datasets. Clean and consistent data ensures smooth integration, correct comparisons,
                             and more accurate results in AI models and analysis.

                 B.  Long answer type questions.
                    1.  What is data cleaning? Describe the steps involved in cleaning data with examples.
                        Ans.  Data cleaning means fixing problems in raw data such as missing values, duplicates, or spelling mistakes.
                             The steps include identifying issues, fixing or removing wrong data, formatting it properly, and validating
                             the final output. For example, we may remove rows with missing names, correct misspelled cities like
                             "Delhii", and ensure dates follow the same format.
                    2.  Describe the steps to work with a dataset from Kaggle using Pandas.
                        Ans.  First, the dataset is downloaded from Kaggle. Then it is loaded using pd.read_csv(). We can use head(),
                             info(), and describe() to explore the data. Cleaning is done using dropna(), fillna(), and standardisation
                             methods like .str.strip() and to_datetime().
                    3.  What is data transformation? Explain it with two examples.

                        Ans.  Data transformation changes data into a suitable format for analysis. For example, converting “Yes” to 1
                             and “No” to 0 is a transformation. Another example is converting weight from pounds to kilograms using
                             a formula in Pandas.

                    4.  Explain the difference between data cleaning and data transformation.
                        Ans.  Data cleaning fixes wrong or missing data, like removing duplicates or correcting typos. Data
                             transformation changes the format or structure of the data, such as converting units or reformatting
                             dates. Both steps are essential before analysis or modelling.
                    5.  Explain Z-score normalization and Min-Max scaling. When is each used?
                        Ans.  Z-score normalization converts data to a mean of 0 and standard deviation of 1, useful for machine
                             learning models. Min-Max scaling changes values to a fixed range (usually 0–1), useful for comparing
                             features with different scales.

                                                                                                   21 st
                 C.  Competency-based questions:     HOTS                                         Century   #Experiential Learning
                                                                                                  Skills
                    1.   You are given a dataset of students’ exam scores. Some rows have blank values in the “City” column and
                        duplicate entries. Describe the steps you would take to clean this dataset using Pandas.
                        Ans.   First, I would use drop_duplicates() to remove repeated entries. Then, I would check for blanks using
                             isnull() and fill them using fillna() with a default city name or remove the rows using dropna(). This would
                             make the data cleaner for analysis.
                    2.   A school wants to analyse student attendance data from three branches. The formats of roll numbers, names,
                        and dates differ across files. How will you make this data usable?
                        Ans.   I would standardise roll numbers to a format like “STU2025_001”, use .str.title() to format names,
                             and apply pd.to_datetime() for dates. This ensures that all three datasets follow a common structure,
                             enabling smooth merging and comparison.






                                                                      Theoretical and Practical Aspects of Data Processing  235
   232   233   234   235   236   237   238   239   240   241   242