Page 117 - Data Science class 10
P. 117

• Testing, development and training environments: Developers frequently want to work with real data, but
              providing a copy of your entire production database would be too costly in time and disk space, especially if you
              need to provide a copy for each developer. Subsetting lets you work with data that contains all the necessary
              links between tables for your programs to function, but for a fraction of the cost.
               • Multiple  organisations  or departments: You may have a database that  comprises data  for  a number of
              different organisations or departments, and providing the whole database would expose data to individuals
              without the correct permissions. Instead, you can provide a subset containing just the data that is relevant to a
              particular organisation or department.
               • Old data: You may want to remove a part of your data based on some criteria. For example, you might want
              to clear out data older than a certain date, or data which you can no longer keep in order to comply with
              regulations like the GDPR.
            Subsetting may be advantageous for the following reasons:
               • To restrict or divide the time range
               • To select cross sections of data
               • To select particular kinds of time series

               • To exclude particular observations


            1.2. SOME METHODS OF SUBSETTING

            Data collection by subsetting is a very important component of data management. The methods of subsetting
            may be quite different depending upon how this data is collected. There are many methods of subsetting the data,
            three of which are discussed below:
               • Row-based subsetting
               • Column-based subsetting

               • Data-based subsetting

            1.2.1. Row-based Subsetting
            In this method of subsetting, we take some rows from the top or bottom of the table  For example, you may need
            to subset the rows of a data frame because you may be interested in understanding a subpopulation in given
            sample. A small example of data collected over all states of India is given below but you may be required only to
            analyse the rows that relate to participants from state of UP.


                                             ID         State      Height      Weight
                                             001         UP          71          190

                                             002         UP          69          176
                                             003         MP          64          130

                                             004         MP          65          154

            1.2.2. Column-based Subsetting

            Occasionally the original dataset  may contain  a large number of columns and all of them may not be
            essential to perform the  analysis. You may then  choose specific columns from the  dataset.  This process of
            subsetting is known as column-based subsetting. In the example below, we have a dataset of 5 columns and 8
            rows. You may require them to subset A, where you shall be using 3 columns and 8 rows or subset B having 3
            columns and 4 rows.




                                                                               Use of Statistics in Data Science  115
   112   113   114   115   116   117   118   119   120   121   122