Page 125 - Data Science class 10
P. 125

1.5.2. Median

            The "median" is also a form of central tendency. It is the "middle" value in the list of numbers. To find the median,
            your numbers have to be arranged in ascending or descending order. So, you have to rewrite your list before you
            can find the median. If the dataset is sorted from smallest value to biggest value, the exact middle value of the set
            is the median.
            Example 1.6: Consider the below dataset of 5 values.
            Array = [13, 33, 57, 92, 32]
            Now, first sort the dataset.

            Sorted array = [13, 32, 33, 57, 92]
            The value at 3rd position is the middle point of the sorted list. So, 33 is the median for the array.

            In the above example, we had a dataset with an odd number of values. So, we could easily find out the middle
            point. But what if the dataset has an even number of records? In these situations, there will be two middle points.
            Hence, we need to calculate the average of the two to get the median.
            Let us take an example to illustrate how to calculate median from an even number of records. Suppose in the
            above example, one more record is added having value of 47. Now our sorted array shall be:
            Sorted array = [13, 32, 33, 47, 57, 92]
            In this case, there are two numbers at the middle, i.e., 33 and 47, so median value shall be average of these two
                     (33 + 47)
            values =           = 40.
                        2
            Mean vs Median
            Generally, mean and median both represent the central tendency of a dataset.  So when should we use median
            over mean?  Median is a more accurate form of central tendency especially when there are some irregular values
            also known as outliers.
            For example, consider the given situation. Your uncle gets his blood pressure checked every week. But due to some
            error in the device, the recording for one week was too high.
            140, 142, 145, 220, 147

            Here, 220 is the error recorded by the instrument and considered to be an outlier. In this case, the mean is 158.8
            and the median is 145. However, the median value may also give the wrong signal, as in the worst scenario, the
            median value is itself an outlier, as in the given scene.
            Due to this simple error of the device, the mean value deviates greatly from the regular blood pressure values due.

            Whereas the median value still correctly represents the central point of the dataset. Thus, under conditions where
            there are outliers in the dataset, the median is a more effective measure of central tendency.
            But, here, in the above given data, the mean is same and median may be very high. In all such cases the outlier
            must be discarded.
            Also, the median is preferred, especially when the data distribution contains some extremely low and high values.
            In these circumstances, the median is a more accurate measure of central tendency than the mean. The median
            is typically preferred over the mean when determining compensation for the simple reason that the median is far
            less impacted by outliers (abnormally low or high numbers) than the mean.

            The median offers a helpful measure of the centre of a dataset. By comparing the median to the mean, you can
            get an idea of the distribution of a dataset. When the mean and the median are similar, the dataset is more or less
            evenly distributed from the lowest to the highest values.









                                                                               Use of Statistics in Data Science  123
   120   121   122   123   124   125   126   127   128   129   130