This homework will refer to concepts that we discussed in class regarding how to
ID: 3711537 • Letter: T
Question
This homework will refer to concepts that we discussed in class regarding how to use pre-tagged corpora.
1. Load the tagged words version of the treebank corpus (nltk.corpus.treebank). Write code to find the most common tags, first using the pre-defined treebank tagset, and then again with the universal tagset.
2. Again using the treebank tagged words corpus, use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’
3. Using a method similar to what was presented in class to find V-“to”-V sequences, search the treebank tagged corpus to find “to”-ADVERB-VERB sequences. You can do this using just the treebank tagged words corpus or (more similar to what we did in class) the treebank tagged sentence corpus.
4. This question does not involve tagging, but it does involve corpus searching with bigrams. Load the text of Moby D from the gutenberg corpus (nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.
Explanation / Answer
from nltk.corpus import treebank
import nltk
#find most common tags
tree_tagged = treebank.tagged_words()
tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)
print(tag_fd.most_common())
#using universal tagset
tree_tagged = treebank.tagged_words(tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in tree_tagged)
print(tag_fd.most_common())
#use bigrams to determine the most common part of speech that occurs after the part of speech ‘DET.’
word_tag_pairs = nltk.bigrams(tree_tagged)
det_suc = [b[1] for (a, b) in word_tag_pairs if a[1] == 'DET']
fdist = nltk.FreqDist(det_suc)
print([tag for (tag, _) in fdist.most_common()])
#"TO"-adverb-verb sequences using trigrams
def process(sentence):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
if (t1 == 'TO' and t2.startswith('ADV') and t3.startswith('V')):
print(w1, w2, w3)
for tagged_sent in treebank.tagged_sents():
process(tagged_sent)
#nltk.corpus.gutenberg.words(‘melville-moby_d.txt’). Using bigrams, find all words that precede ‘whale’ in the text.
fi = nltk.corpus.gutenberg.words('melville-moby_dick.txt')
word_tag_pairs = nltk.bigrams(fi)
whale_preceders = [a for (a, b) in word_tag_pairs if b == 'whale']
print(whale_preceders)
#fdist = nltk.FreqDist(whale_preceders)
#print([tag for (tag, _) in fdist.most_common()])
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.