2. (6 marks) Search and Information Retrieval (a) (2 marks) A search engine retu
ID: 3905042 • Letter: 2
Question
2. (6 marks) Search and Information Retrieval (a) (2 marks) A search engine returns 500 search results. There are 300 relevant documents in total, but only 200 of these have been retrieved by the system. Calculate the number of false positives and the recall of the system. (b) (2 marks) Explain briefly what stop words are and how they are often used to improve the quality of the results of a search engine. (c) (2 marks) Explain what is the purpose of cosine similarity within the context of the vector space model of information retrieval. Your answer does not need to include the formula of the cosine similarity, but it should explain the intuition behind the use of cosine similarity.Explanation / Answer
a) False positive happens when non relevant document is retrieved
Number of documents returned=500
Relevant=300
Retreived=200
False positives=500-200=300
Recall=
Where A=Relevant documents=300
B=Retrieved documents=200
So Recall=200/300
=2/3
b) Stop words are words that are normally filtered out before and after natural language processing. Stop words have highly statistical occurrences while having very low functionalities. For example: a, an, the,is, which, at etc. To avoid indexing useless words, a text retrieval system often associates a stop list with a set of documents. A stop list is a set of words that are deemed irrelevant for search engine information retrieval. Removing stop words help to reduce the indexing (or data) file size and improve efficiency as stop words are not useful for text mining or searching and they have a huge number of hits.
c) Since similar documents are expected to have similar term frequencies, we can measure the similarity among a set of documents or between a document and a query (often defined as a set of keywords),based on similar relative term occurrences in the frequency table. Cosine similarity measures the cosine of the angles between two document vectors. If cosine value is 1 the documents are the same,if cos value is 0 the documents are different.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.