This homework will refer to concepts that we discussed in class regarding how to

ID: 3711537 • Letter: T

Question

This homework will refer to concepts that we discussed in class regarding how to use pre-tagged corpora.

1. Load the tagged words version of the treebank corpus (nltk.corpus.treebank). Write code to find the most common tags, first using the pre-defined treebank tagset, and then again with the universal tagset.

2. Again using the treebank tagged words corpus, use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’

3. Using a method similar to what was presented in class to find V-“to”-V sequences, search the treebank tagged corpus to find “to”-ADVERB-VERB sequences. You can do this using just the treebank tagged words corpus or (more similar to what we did in class) the treebank tagged sentence corpus.

4. This question does not involve tagging, but it does involve corpus searching with bigrams. Load the text of Moby D from the gutenberg corpus (nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.

Explanation / Answer

from nltk.corpus import treebank

import nltk

#find most common tags

tree_tagged = treebank.tagged_words()

tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)

print(tag_fd.most_common())

#using universal tagset

tree_tagged = treebank.tagged_words(tagset='universal')

tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)

print(tag_fd.most_common())

#use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’

word_tag_pairs = nltk.bigrams(tree_tagged)

det_suc = [b[1] for (a, b) in word_tag_pairs if a[1] == 'DET']

fdist = nltk.FreqDist(det_suc)

print([tag for (tag, _) in fdist.most_common()])

#"TO"-adverb-verb sequences using trigrams

def process(sentence):

for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):

if (t1 == 'TO' and t2.startswith('ADV') and t3.startswith('V')):

print(w1, w2, w3)

for tagged_sent in treebank.tagged_sents():

process(tagged_sent)

#nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.

fi = nltk.corpus.gutenberg.words('melville-moby_dick.txt')

word_tag_pairs = nltk.bigrams(fi)

whale_preceders = [a for (a, b) in word_tag_pairs if b == 'whale']

print(whale_preceders)

#fdist = nltk.FreqDist(whale_preceders)

#print([tag for (tag, _) in fdist.most_common()])

Navigate

This homework will involve modifying the Mathematica code for Monte Carlo simula

This homework will require i) solving an equation using the Newton-Raphson metho

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

This homework will refer to concepts that we discussed in class regarding how to

Question

Explanation / Answer

Related Questions

Navigate