Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

2. (6 marks) Search and Information Retrieval (a) (2 marks) A search engine retu

ID: 3905042 • Letter: 2

Question

2. (6 marks) Search and Information Retrieval (a) (2 marks) A search engine returns 500 search results. There are 300 relevant documents in total, but only 200 of these have been retrieved by the system. Calculate the number of false positives and the recall of the system. (b) (2 marks) Explain briefly what stop words are and how they are often used to improve the quality of the results of a search engine. (c) (2 marks) Explain what is the purpose of cosine similarity within the context of the vector space model of information retrieval. Your answer does not need to include the formula of the cosine similarity, but it should explain the intuition behind the use of cosine similarity.

Explanation / Answer

a) False positive happens when non relevant document is retrieved

Number of documents returned=500

Relevant=300

Retreived=200

False positives=500-200=300

Recall=

Where A=Relevant documents=300

B=Retrieved documents=200

So Recall=200/300

=2/3

                                                                    

b) Stop words are words that are normally filtered out before and after natural language processing. Stop words have highly statistical occurrences while having very low functionalities. For example: a, an, the,is, which, at etc. To avoid indexing useless words, a text retrieval system often associates a stop list with a set of documents. A stop list is a set of words that are deemed irrelevant for search engine information retrieval. Removing stop words help to reduce the indexing (or data) file size and improve efficiency as stop words are not useful for text mining or searching and they have a huge number of hits.

c) Since similar documents are expected to have similar term frequencies, we can measure the similarity among a set of documents or between a document and a query (often defined as a set of keywords),based on similar relative term occurrences in the frequency table. Cosine similarity measures the cosine of the angles between two document vectors. If cosine value is 1 the documents are the same,if cos value is 0 the documents are different.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Chat Now And Get Quote