Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

6.) Consider a document-term matrix, where tf ij is the frequency of the i th wo

ID: 662463 • Letter: 6

Question

6.) Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. Consider the variable transformation that is defined by

tf'ij = tfij ? log m/dfi

where dfi is the number of documents in which the ith term appears and is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation.

(a) What is the effect of this transformation if a term occurs in one document?

In every document?

(b) What might be the purpose of this transformation?

Explanation / Answer

a) If the term occurs in only one document the the value of tfij is SCALED UP to tfij*log(m/dfi)

In this case log(m/dfi) >1

whereas,

If a term appears in every document the value of tfij is SCALED DOWN to tfij*log(m/dfi)

In this case log(m/dfi) is equal to zero.

This can be understood intutively as a word which is rare and occurs in a specific single document it is more likely to be of more significance to that particular document instead of a word which occurs in every document. So it is likely that the word is a generic word and has no signnificant relation to the document itself. So in the first case we Scale up the value of similarity quotient to increase the efffect of rare words and scale down the value of generic words which are used in every document.

Clearly this transformation improves the performance of the tf measure.