Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. Consider the following two simple documents. [50 points] (A) precision is ver

ID: 3586714 • Letter: 1

Question

1. Consider the following two simple documents. [50 points] (A) precision is very very high (B) high precision is very very very important Assume the only stopwords are: “is", “am" and "are" in our system. 1) For each document, write down the normalized vector of term frequency of each term Compute the cosine using the format term: value, term: value, term: value, similarity of the two documents 2) Consider the tf-idf weighting. For each document, write down the normalized vector of 3) Consider the query "precision is high". Transfer the query into vector space by using 4) Now consider a vector space where each dimension is a word bigram instead of single tf-idf weights. Compute the cosine similarity of the two documents TF.IDF, then rank document A and B for given query using cosine similarity term. For each document, write down the normalized vector of bigram frequency {bigram: value, bigram: value,··· } . Compute the cosine similarity between documents A and B 5) For each document, write down the normalized vector of tf-idf weights where each dimension is a word bigram. Compute the cosine similarity between documents A and B

Explanation / Answer

1) TF(Term Frequency) = (Number of times term t appears in a document) / (Total number of terms in the document).

For document 1 Normalized  Term Frequency is as follows:

here stop words are not included for count.

so total number of terms are=4 (because "is" is a stop word so not calculate)

Precision= 1/4= 0.25

very= 2/4= 0.5

high= 1/4= 0.25

For document 2 Normalized  Term Frequency is as follows:

so total number of terms are=6 (because "is" is a stop word so not calculate)

high= 1/6=0.16

precision= 1/6=0.16

very= 3/6= 0.5

important= 1/6=0.166

IDF(Inverse Document Frequency):

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

precision= 2/2=0 ("precision" word appears two time in two documets)

very= 2/5=0.4  ("very" word appears five time in two documets)

high= 2/2=0

important= 2/1=2

for document 1

TF* IDF for precision=0.25*0= 0

very= 0.5 *0.4= 0.2

high= 0.16* 0= 0

3) for query "precision is high" cosine similarity is:

for document 1

TF* IDF for

precision=0.25*0= 0

high= 0.16* 0= 0

=sqrt(0)+(0)=0

so cosine similarity is:=0

4) bigram frequency:

bigram counts the frequency of two characters at multiple times(or counts) means how many times two word appers in two document.

{f1(p1,p2..),f2(p21,p22..)}

here bigram frequency of above two documents is: 3

{(precision,is),{is,very),{very,very)}

i.e. 3 count.