Page 62 - Informatics_Practices_Fliipbook_Class12
P. 62

Sometimes, we need summary statistics only on a specific column. For example, we may examine obtain the summary
        statistics on Price by selecting the Price column as follows, as follows:

         >>> groceryDF.describe()['Price']
        output:
              count     8.000000
              mean     43.750000
              std      23.260942
              min      20.000000
              25%      27.500000
              50%      35.000000
              75%      62.500000
              max      80.000000
              Name: Price, dtype: float64
        By default, the describe() method provides summary of the columns having only numerical values. However,
        we can include the summary information about the columns comprising aribitrary object type values, by specifying
        include = 'all' as the keyword argument. For each non-numeric column, the summary information includes:
        1.   Count: It denotes the number of non-null values in the column. As mentioned above, Null values serve as pointer
           to missing data for the data analyst.
        2.   Unique: It indicates the number of unique values in the column, thus indicating the level of diversity in the
           categorical data.
        3.   Top: It denotes the most frequent value (called mode) in the column, i.e., the value that occurs most often in the
           column.
        4.  Frequency: It denotes the frequency of the most frequent value in each column.
         >>> groceryDF.describe(include='all')
        output:
                           Product    Category             Price         Quantity
              count             8              8      08.000000            8.000000
              unique            8              3             NaN               NaN
              top          Bread           Food              NaN               NaN
              freq              1              4             NaN               NaN
              mean           NaN            NaN       43.750000           2.750000
              std            NaN            NaN       23.260942            1.669046
              min            NaN            NaN       20.000000            1.000000
              25%            NaN            NaN       27.500000            1.750000
              50%            NaN            NaN       35.000000            2.000000
              75%            NaN            NaN       62.500000            4.250000
              max            NaN            NaN       80.000000            5.000000
        Note that in the column Category, Food appears most frequently (4 times). So, it is shown as the top value in the
        column. However, the column Product, each value appears only once. So, the first value Bread is shown as the top
        value in the column. Further, note that the count of items in the columns Price and Quantity is shown as floating
        point numbers. Indeed, as mean, std, and the quartile values are floating point values, the type of the entire column
        in the DataFrame groceryDF.describe() is set as Float, as shown below:

         >>> groceryDF.describe().dtypes
        output:
              Price       float64
              Quantity    float64
              dtype: object

               Pandas provide with a powerful method describe() which returns a DataFrame comprising the statistics- count,
               mean, standard deviation, minimum, maximum, and quartiles, about each column in the DataFrame that store
               numerical data. By specifying include = 'all' as the keyword argument, the summary information includes
               count, unique, top, and frequency statistics for each non-numeric column.


          48   Touchpad Informatics Practices-XII
   57   58   59   60   61   62   63   64   65   66   67