Page 61 - Informatics_Practices_Fliipbook_Class12
P. 61

1.  Consider the groceryDF DataFrame and retrieve the details of the products (rows) which are from
                        the Clothes category.
                     2.  Consider the employeeDF DataFrame and retrieve the following data:
                         •   Details of employees earning more than 90000
                         •   Details of employees working in Accounts department



            2.7 Descriptive Statistics
            Oftentimes, we are interested in summary information about the data stored in a DataFrame. This information includes
            statistics such as number of values in the DataFrame, mean, median, and standard deviation. Pandas provide with a
            powerful method describe() which returns a DataFrame comprising the following statistics about each column
            in the DataFrame that store numerical data:
            1.   Count: It denotes the number of non-null values. Null values often provide a clue to the data analyst for taking a
                suitable action. For example, the null values may simply be ignored or replaced by average value in the column.
            2.   Mean: It denotes the average value of numbers in a column. As average represents the centre point of the data, it
                is called a measure of central tendency of the data.

            3.   Standard Deviation: It measures the spread or dispersion of the values around the mean. A higher standard
                deviation indicates greater variability in the data. A low value of the standard deviation indicates that most of the
                values are close to the mean.
            4.   Minimum and Maximum: As indicated by the terms minimum and maximum, they denote the minimum and
                maximum values in a column, respectively. Together, these minimum and maximum values indicate the interval
                from which the values in a column are drawn.
            5.   Quartiles: 25%, 50%, and 75% represents the first quartile (25th percentile), median (50th percentile), and third
                quartile (75th percentile), respectively. 25%, 50%, 75% quartile values are often denoted by Q1, Q2, and Q3
                respectively. Q1 indicates that 25% of the values do not exceed Q1. Similarly, Q2 indicates that 50% of the values
                do not exceed Q2. Therfore, Q2 is also called the median. Finally, Q3 indicates that 75% of the values do not exceed
                Q3. These values help understand the data distribution and identify potential outliers.

             >>> groceryDF = pd.read_csv('Grocery.csv')
             >>> print(groceryDF)
                        Product   Category  Price  Quantity
                  0       Bread       Food     20         2
                  1        Milk       Food     60         5
                  2     Biscuit       Food     20         2
                  3  Bourn-Vita       Food     70         1
                  4        Soap    Hygiene     40         4
                  5       Brush    Hygiene     30         2
                  6   Detergent  Household     80         1
                  7     Tissues    Hygiene     30         5
                  type(summary): <class 'pandas.core.frame.DataFrame'>
             >>> print(groceryDF.describe())
            output:
                               Price          Quantity
                  count       8.000000        8.000000
                  mean       43.750000        2.750000
                  std        23.260942        1.669046
                  min        20.000000        1.000000
                  25%        27.500000        1.750000
                  50%        35.000000        2.000000
                  75%        62.500000        4.250000
                  max        80.000000        5.000000


                                                                             Data Handling using Pandas DataFrame  47
   56   57   58   59   60   61   62   63   64   65   66