Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1 from h1dden_11b 1mport count trigrams 2 from h1dden_11b 1mport tra1n class1f1e

ID: 3588534 • Letter: 1

Question

1 from h1dden_11b 1mport count trigrams 2 from h1dden_11b 1mport tra1n class1f1er Write a function score_document (document, lang_counts default_lang_counts) which takes as input a document name as a string and a dictionary of dictionaries containing normalised language counts called lang_counts. It should return a dictionary of scores for each language in lang_counts, as obtained by performing a 'dot product' of trigram counts from the document with the normalised language counts. That is, it should multiply the trigram counts from the document with the trigram counts in lang_counts and add the whole lot up. If a trigram from the document is not in the dictionary for a given language, assume the count for the language as zero. 4 # We train the class1 fler here default-lang-counts tra1n. class1 fler('tra1n.csv.) 5 6 7 def score_document (document, lang_counts-default_lang_counts): # Your code here pass We have provided a stub of code which trains the classifier for you. We have also provided tra1n classifier (tra1ning set) in a hidden library There are also two files included, visible in the tabs at top right. These are en-16 383. txt. Written in English, and de-123 1811 . txt. written in German, and can be loaded and used to test your function, which should behave as follows: >>> test1 = 'en 163083.txt' >>> d = score-document (test1) >>> dNV1etnamese 9.427325768357315 >>> max ([(v, n) for (n, v) 1n d.1tems ()]) (21.428216914833023, 'Engl1sh') >>> test2 'de 1231811.txt' >>> d score document (test2) 7.710346556417009 >> max ([ (v,n) for (n, v) 1n d.1tems )]) (53.12937809633241, German)

Explanation / Answer

default_lang_counts = train_classifier('train.csv') def score_document(document, lang_counts=default_lang_counts): """ score_document takes a string document and a dictionary of language counts per language stored in lang_counts. It returns a dictionary of scores for the document for each language. """ doc_text = open(document, 'r').read() doc_counts = count_trigrams(doc_text) languages = lang_counts.keys() score_dict = {} for lang in languages: # calculate the score for this language by performing a dot product score = 0.0 for trigram in doc_counts.keys(): score += doc_counts[trigram] * lang_counts[lang][trigram] score_dict[lang] = score return score_dict