Page 393 - AI Ver 3.0 class 10_Flipbook
P. 393

Step 3  Create a Document Vector Table.

                        Amit   and   Amita  are  twins   lives  with  his   grandparents  in  Shimla  her  parents  Delhi

                          1     1      1     1     1      0     0     0        0        0      0      0     0       0
                          1     0      0     0     0      1     1     1        1        1      1      0     0       0

                          0     0      1     0     0      1     1     0        0        1      0      1     1       1

                 C.  Competency-based/Application-based questions.                            21 st  Century   #Critical Thinking
                                                                                                  Skills  #Information Literacy
                    1.  You have a dataset with a variety of abbreviations and misspelled words. How would you use text normalisation to
                       standardise this dataset for further processing?
                   Ans.  To standardise the dataset, text normalisation would involve:

                       •  Correcting misspellings (e.g., "definately" → "definitely").
                       •  Expanding abbreviations (e.g., "btw" → "by the way", "lol" → "laughing out loud").
                       •  Converting to lowercase to ensure consistency.
                       •  Expanding contractions (e.g., "I'm" → "I am"). These actions help reduce variation in the data and improve model
                         understanding.

                    2.  Consider the following 4 documents in a corpus:
                       1.  Document 1: "I love programming."

                       2.  Document 2: "Programming is fun."
                       3.  Document 3: "I love coding."
                       4.  Document 4: "Coding is awesome."
                       Prepare the Term Frequency-Inverse Document Frequency (TF-IDF) for the given corpus.

                   Ans.
                                                              Term Frequency
                         Document         i        love     programming       is         fun       coding     awesome

                           Doc-1          1         1             1           0           0          0           0
                           Doc-2          0         0             1           1           1          0           0
                           Doc-3          1         1             0           0           0          1           0
                           Doc-4          0         0             0           1           0          1           1

                                                           Document Frequency

                                          i        love     programming       is         fun       coding     Awesome
                                          2          2            2            2          1          2           1
                                          0          0            0            1          0          1           1

                                                         Inverse Document Frequency
                                                           Total Number of Documents
                                                     IDF =
                                                              Document Frequency

                                          i        love     programming       is         fun       coding     Awesome
                                          4          4            4            4          4          4           4
                                          2          2            2            2          1          2           2

                                                                          Natural Language Processing (Theory)  391
   388   389   390   391   392   393   394   395   396   397   398