Python I am trying to build a bigram model and to calculate the probability of w
ID: 3854510 • Letter: P
Question
Python
I am trying to build a bigram model and to calculate the probability of word occurrence
. I should: Select an appropriate data structure to store bigrams. Increment counts for a combination of word and previous word. This means I need to keep track of what the previous word was. Compute the probability of the current word based on the previous word count.
Prob of curr word = count(prev word, curr word) / count(previous word)
Consider we observed the following word sequences:
finger remarked
finger on
finger on
finger in
finger .
Notice that "finger on " was observed twice. Also, notice that the period is treated as a separate word. Given the information in this data structure, we can compute the probability (on|finger) as 2/5 = 0.4.
Here is what I got so far:
filename = 'blah-blah.txt'
bigrams ={}
unigrams = {}
prev_word = "START"
# opening the filename in read mode
for line in fp:
words = line.split()
for word in words:
word = word.lower()
bigram = prev_word + ' ' + word
#print(bigram)
if word in unigrams:
unigrams[word] +=1
else:
unigrams[word] =1
#print(unigrams[word])
if bigram in bigrams:
bigrams[bigram] += 1
else:
bigrams[bigram] = 1
prev_word = word
output_file = 'bigram_probs.txt'
with open(output_file, "w") as fs:
for key, value in sorted(bigrams.items()):
prob = value / unigrams[word]
fs.write(key + ": " + str(prob) + " ")
My program works, but I am not sure if it does what it should do. I appreciate any help!
Explanation / Answer
Code:
filename = 'blah-blah.txt'
bigrams ={}
unigrams = {}
prev_word = "START"
fp = open(filename,"r")
for line in fp:
words = line.split()
for word in words:
word = word.lower()
bigram = prev_word + ' ' + word
#print(bigram)
if word in unigrams:
unigrams[word] +=1
else:
unigrams[word] =1
#print(unigrams[word])
if bigram in bigrams:
bigrams[bigram] += 1
else:
bigrams[bigram] = 1
prev_word = word
output_file = 'bigram_probs.txt'
with open(output_file, "w") as fs:
for key, value in sorted(bigrams.items()):
prob = float(value) / unigrams[word]
fs.write(key + ": " + str(prob) + " ")
Input File:
finger remarked
finger on
finger on
finger in
finger
Output File:
START finger: 0.2
finger finger: 0.2
finger in: 0.2
finger on: 0.4
finger remarked: 0.2
in finger: 0.2
on finger: 0.2
remarked finger: 0.2
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.