You will build your language model from a given set of example texts. As the mod

ID: 3818509 • Letter: Y

Question

You will build your language model from a given set of example texts. As the model is based on trigram counts, you must count how many times triples of consecutive words appear in each example text. Words should be treated case-sensitively, meaning "she" and "She" should be considered two different words. And, although the example texts may contain punctuation, you should not treat it specially. That is, if the file contains the phrase "he, she, I", then you can consider the first word as "he,", the second as "she," and the third as "I". Said another way, process your example files as if they contained no punctuation, and consider the two words "she" and "she," as two different words.

You must write a C++ program which when built, creates an executable file named hw7a that takes two command-line arguments. The first argument is the name of a text file containing a list of input filenames.

In order to treat the beginning and end of your example files meaningfully during Part B, you will include in the model you create in Part A the special words "<START_1>", "<START_2>" (to indicate the start of each document), and "<END_1>", "<END_2>" (to indicate the end of each document). In particular, suppose your example text begins with words a b and ends with words c d. Then you must add into your model the four trigrams
"<START_1>", “<START_2>”, a

“<START_2>”, a, b
c, d, "<END_1>"

d, "<END_1>", “<END_2>”

And you will need to add four similar trigrams for each example text that you process.

Each time your program is run, it should build your trigram-based language model by processing each text file specified in the input filename list. What happens after that will depend on the second argument specified at the command line. The second argument is a single letter, and should be one of "a", "r", or "c". Your program should output to the C++ standard output stream (cout) the language model you created, ordering entries as specified by the argument letter as follows:

a - forward alphabetical order. This means that trigrams are output in alphabetical order by the first word in each trigram, using the alphabetical order of the second and then third word in each trigram to break ties.

r - reverse alphabetical order. This means that trigrams are output in descending alphabetical order by the first word in each trigram, using the descending alphabetical order of the second and then third word in each trigram to break ties.

c - count order. The means that trigrams are output in ascending order by frequency, using forward alphabetical ordering of first words and then second and then third words to break ties.

Your output will consist of one trigram with associated count per line. On a given line, the 4 outputs (trigramWord1, trigramWord2, trigramWord3, and count) should be separated by single spaces.

Example

Suppose the list of training texts input for your program resides in a file named tiny_ex.txt, and the contents of the file are names of text files containing excerpts from Dr. Seuss books as follows (click the links to see the contents of the two text files):
sl.txt

ge.txt

For the command ./hw7a tiny_ex.txt a, the expected output is:

<START_1> <START_2> I 1

<START_1> <START_2> they’ve 1

<START_2> I do 1

<START_2> they’ve talked 1

Clause. <END_1> <END_2> 1

I do not 2

Santa Clause. <END_1> 1

a lot about 1

about flaws. they’ve 1

about gauze. they’ve 1

about laws and 1

about old Santa 1

about paws and 1

and they’ve talked 2

anywhere <END_1> <END_2> 1

do not like 2

flaws. they’ve talked 1

gauze. they’ve talked 1

here or there 1

laws and they’ve 1

like them anywhere 1

like them here 1

lot about old 1

not like them 2

old Santa Clause. 1

or there I 1

paws and they’ve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere <END_1> 1

them here or 1

there I do 1

they’ve talked about 4

they’ve talked quite 1

For the command ./hw7a tiny_ex.txt c, the expected output is:

<START_1> <START_2> I 1

<START_1> <START_2> they’ve 1

<START_2> I do 1

<START_2> they’ve talked 1

Clause. <END_1> <END_2> 1

Santa Clause. <END_1> 1

a lot about 1

about flaws. they’ve 1

about gauze. they’ve 1

about laws and 1

about old Santa 1

about paws and 1

anywhere <END_1> <END_2> 1

flaws. they’ve talked 1

gauze. they’ve talked 1

here or there 1

laws and they’ve 1

like them anywhere 1

like them here 1

lot about old 1

old Santa Clause. 1

or there I 1

paws and they’ve 1

quite a lot 1

talked about flaws. 1

talked about gauze. 1

talked about laws 1

talked about paws 1

talked quite a 1

them anywhere <END_1> 1

them here or 1

there I do 1

they’ve talked quite 1

I do not 2

and they’ve talked 2

do not like 2

not like them 2

they’ve talked about 4

Explanation / Answer

// comments are posted in approriate places.

// at the end of code, there is sample output.

#include <iostream>
#include <tuple>
#include <string>
#include <fstream>
#include <sstream>
#include <map>
#include <algorithm>
using namespace std;

/*
* This comparision function is used to sort the map based on value.
* This function sorts the map in ascending order.
*/
bool my_compare(tuple<string,string,string,int> t1, tuple<string,string,string,int> t2)
{
    int value1 = get<3>(t1);
    int value2 = get<3>(t2);
    return value1 < value2;
}

int main(int argc, char* argv[])
{
    // accept inputs
    string filename = argv[1];
    string option = argv[2];

    string line,fname;
    char delim = ' ';
    vector<string> v;

    map<tuple<string,string,string>, int> m;
    map<tuple<string,string,string>, int>::iterator m_itr;
    tuple<string,string,string> t;
    int value;

    string start_1 = "<START_1>";
    string start_2 = "<START_2>";
    string end_1 = "<END_1>";
    string end_2 = "END_2>";

    vector<tuple<string,string,string,int> > freq_vec;
    tuple<string,string,string,int> freq_t;
    ifstream ip(filename);

    // read the input file line by line, each line contains the text file name
    while(getline(ip, line)) {
        fname = line;
        ifstream tfile(fname);
        // read the text file line by line and store the string in vector v
        while(getline(tfile, line)) {
             stringstream ss(line);
             string item;
             while(getline(ss,item,delim)) {
                 v.push_back(item);
             }
        }
        // vector v contains all the strings in text file
        // construct map for trigrams

        // first two trigrams
        m[make_tuple(start_1,start_2,v[0])] = 1;
        m[make_tuple(start_2, v[0], v[1])] = 1;

        // loop through map and insert into map if new trigram or
        // increment count if trigram is present in map
        for(int i=0;i<v.size()-3;i++) {
            t = make_tuple(v[i],v[i+1],v[i+2]);
            m_itr = m.find(t);
            if(m_itr != m.end()) {
                value = m_itr->second + 1;
                m[t] = value;
            } else
                m[t] = 1;
        }
        // last two trigrams
        m[make_tuple(v[v.size()-2], v[v.size()-1], end_1)] = 1;
        m[make_tuple(v[v.size()-1], end_1, end_2)] = 1;
        v.clear();
    }
    // map m now contains the modelled language of trigrams and
    // the map is sorted based on key

    // Now output based on option provided
    if(option.compare("a") == 0) {
        // if option a is provided, then output map in ascending order
        for(auto itr=m.begin();itr!=m.end();itr++) {
            t = itr->first;
            value = itr->second;
            cout << get<0>(t) << " " << get<1>(t) << " " << get<2>(t) << " " << value << endl;
        }
    } else if(option.compare("r") == 0) {
        // if option r is provided, then output map in descending order
        // we use rbegin(), rend()
        for(auto itr=m.rbegin();itr!=m.rend();itr++) {
            t = itr->first;
            value = itr->second;
            cout << get<0>(t) << " " << get<1>(t) << " " << get<2>(t) << " " << value << endl;
        }
    } else if(option.compare("c") == 0) {
        // process option c for frequency based output
        // insert key and values into vector of tuples
        for(auto itr=m.begin();itr!=m.end();itr++) {
            t = itr->first;
            value = itr->second;
            freq_t = make_tuple(get<0>(t), get<1>(t), get<2>(t), value);
            freq_vec.push_back(freq_t);
        }
        // sort this vector based on value (in map)
        sort(freq_vec.begin(),freq_vec.end(),my_compare);
        // output vector in ascending order
        for(int i=0;i<freq_vec.size();i++) {
            freq_t = freq_vec[i];
            cout << get<0>(freq_t) << " " << get<1>(freq_t) << " " << get<2>(freq_t) << " " << get<3>(freq_t) << endl;
        }
    }
    return 0;
}

Sample output :

kali@kali:~/coding_que$ vim tiny_ex.txt

kali@kali:~/coding_que$ vim sl.txt

kali@kali:~/coding_que$ vim ge.txt

kali@kali:~/coding_que$ g++ hw7a.cpp -std=c++11 -o hw7a

kali@kali:~/coding_que$ ./hw7a tiny_ex.txt a

<START_1> <START_2> Here 1

<START_1> <START_2> This 1

<START_2> Here is 1

<START_2> This is 1

Here is the 1

This is test 1

containt for ge 1

content for sl 1

file. <END_1> END_2> 1

files. <END_1> END_2> 1

ge files. <END_1> 1

is test content 1

is the test 1

sl file. <END_1> 1

test containt for 1

test content for 1

the test containt 1

kali@kali:~/coding_que$ ./hw7a tiny_ex.txt r

the test containt 1

test content for 1

test containt for 1

sl file. <END_1> 1

is the test 1

is test content 1

ge files. <END_1> 1

files. <END_1> END_2> 1

file. <END_1> END_2> 1

content for sl 1

containt for ge 1

This is test 1

Here is the 1

<START_2> This is 1

<START_2> Here is 1

<START_1> <START_2> This 1

<START_1> <START_2> Here 1

kali@kali:~/coding_que$ ./hw7a tiny_ex.txt c

file. <END_1> END_2> 1

the test containt 1

test content for 1

test containt for 1

sl file. <END_1> 1

is the test 1

is test content 1

ge files. <END_1> 1

files. <END_1> END_2> 1

<START_1> <START_2> Here 1

content for sl 1

containt for ge 1

This is test 1

Here is the 1

<START_2> This is 1

<START_2> Here is 1

<START_1> <START_2> This 1

kali@kali:~/coding_que$

Navigate

You will build part of a college\'s course registration system (similar to WebRe

You will choose a country, other than the USA or Turkey, that you would like to

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

You will build your language model from a given set of example texts. As the mod

Question

Explanation / Answer

Related Questions

Navigate