Page 223 - AI Ver 1.0 Class 10
P. 223
• Online mode: Open-source Govt. portals, WHO websites.
• Offline mode: Surveys, questionnaires, experiments, personal interviews.
While handling data online or offline, the following points should always be remembered:
• The source of data should be authentic and reliable, as the random data source could provide wrong or unusable
data.
• For proper training of AI Model, the authenticity of data is must.
• Privacy of data sources should always be kept in mind, as it is a fundamental right of everyone.
• Consent from the owner of the data should be taken before using their personal dataset.
• Data present in the public domain should preferably be used, if available.
Types of Data
The most suitable way for a dataset is storing it in the form of tables. It’s most easy to maintain and analyse if
data is in the form of tables. The following are some of the popular tabular formats of storing data:
• Spreadsheets: Data stored in the form of rows and columns under a filename is a spreadsheet application.
It’s a powerful tool for analysis, visual representation, calculations and accounting purposes. Some popular
spreadsheet applications are MS Excel, Open Office Spreadsheet, etc.
• Comma Separated Values (CSV): These are files with extension of .csv that contain records with each value
separated with commas. Every line is a single record. These files are created using Excel, Google Sheets, and also
simple word processing programs like Notepad.
• Structured Query Language (SQL): A query language that is used to store, manage and retrieve data from
DBMS. It’s a domain specific language primarily used to handle structured data in database management
systems.
Issues Related to Data
At the time of collecting the data needed for Data Science we might face some issues like:
• Erroneous Data: It means the values in a dataset is not received as per the expectations in that position. There
are two ways in which the data can be erroneous:
✶ Incorrect Values: The values in the dataset at random places are not correct. Either the data is mismatched
or it is not relevant to that position. For example, Marks column does not have values in decimal, phone
number column instead of having 10 digits mobile number has eight digits landline number, Name column
instead of having full name has only the first name.
✶ Invalid or Null Values: It means value either corrupted or has no meaning. These values when occurring in
a dataset need to be removed as they hold no value for data processing. For example, phone number not
appropriately filled, email address with nothing given using @ sign.
• Missing Data: It means data not present at the desired location of a dataset. Missing data is not erroneous
data. Data with the missing value is considered as an incomplete dataset. For example, email address, pin code
missing in a set of student details.
• Outliers Data: It means the data that differs drastically from the rest of the data. This kind of unusual data
needs to be removed or replaced from the dataset for accurate results. For example, value zero given in marks
of a student who is absent instead of exemption. This will not give an accurate class average.
Data Science 221

