1. Consider the following two simple documents. [50 points] (A) precision is ver
ID: 3586714 • Letter: 1
Question
1. Consider the following two simple documents. [50 points] (A) precision is very very high (B) high precision is very very very important Assume the only stopwords are: “is", “am" and "are" in our system. 1) For each document, write down the normalized vector of term frequency of each term Compute the cosine using the format term: value, term: value, term: value, similarity of the two documents 2) Consider the tf-idf weighting. For each document, write down the normalized vector of 3) Consider the query "precision is high". Transfer the query into vector space by using 4) Now consider a vector space where each dimension is a word bigram instead of single tf-idf weights. Compute the cosine similarity of the two documents TF.IDF, then rank document A and B for given query using cosine similarity term. For each document, write down the normalized vector of bigram frequency {bigram: value, bigram: value,··· } . Compute the cosine similarity between documents A and B 5) For each document, write down the normalized vector of tf-idf weights where each dimension is a word bigram. Compute the cosine similarity between documents A and BExplanation / Answer
1) TF(Term Frequency) = (Number of times term t appears in a document) / (Total number of terms in the document).
For document 1 Normalized Term Frequency is as follows:
here stop words are not included for count.
so total number of terms are=4 (because "is" is a stop word so not calculate)
Precision= 1/4= 0.25
very= 2/4= 0.5
high= 1/4= 0.25
For document 2 Normalized Term Frequency is as follows:
so total number of terms are=6 (because "is" is a stop word so not calculate)
high= 1/6=0.16
precision= 1/6=0.16
very= 3/6= 0.5
important= 1/6=0.166
IDF(Inverse Document Frequency):
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
precision= 2/2=0 ("precision" word appears two time in two documets)
very= 2/5=0.4 ("very" word appears five time in two documets)
high= 2/2=0
important= 2/1=2
for document 1
TF* IDF for precision=0.25*0= 0
very= 0.5 *0.4= 0.2
high= 0.16* 0= 0
3) for query "precision is high" cosine similarity is:
for document 1
TF* IDF for
precision=0.25*0= 0
high= 0.16* 0= 0
=sqrt(0)+(0)=0
so cosine similarity is:=0
4) bigram frequency:
bigram counts the frequency of two characters at multiple times(or counts) means how many times two word appers in two document.
{f1(p1,p2..),f2(p21,p22..)}
here bigram frequency of above two documents is: 3
{(precision,is),{is,very),{very,very)}
i.e. 3 count.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.