Page 393 - AI Ver 3.0 class 10_Flipbook
P. 393
Step 3 Create a Document Vector Table.
Amit and Amita are twins lives with his grandparents in Shimla her parents Delhi
1 1 1 1 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 0 0 1 1 0 0 1 0 1 1 1
C. Competency-based/Application-based questions. 21 st Century #Critical Thinking
Skills #Information Literacy
1. You have a dataset with a variety of abbreviations and misspelled words. How would you use text normalisation to
standardise this dataset for further processing?
Ans. To standardise the dataset, text normalisation would involve:
• Correcting misspellings (e.g., "definately" → "definitely").
• Expanding abbreviations (e.g., "btw" → "by the way", "lol" → "laughing out loud").
• Converting to lowercase to ensure consistency.
• Expanding contractions (e.g., "I'm" → "I am"). These actions help reduce variation in the data and improve model
understanding.
2. Consider the following 4 documents in a corpus:
1. Document 1: "I love programming."
2. Document 2: "Programming is fun."
3. Document 3: "I love coding."
4. Document 4: "Coding is awesome."
Prepare the Term Frequency-Inverse Document Frequency (TF-IDF) for the given corpus.
Ans.
Term Frequency
Document i love programming is fun coding awesome
Doc-1 1 1 1 0 0 0 0
Doc-2 0 0 1 1 1 0 0
Doc-3 1 1 0 0 0 1 0
Doc-4 0 0 0 1 0 1 1
Document Frequency
i love programming is fun coding Awesome
2 2 2 2 1 2 1
0 0 0 1 0 1 1
Inverse Document Frequency
Total Number of Documents
IDF =
Document Frequency
i love programming is fun coding Awesome
4 4 4 4 4 4 4
2 2 2 2 1 2 2
Natural Language Processing (Theory) 391

