Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Zipf’s Law is a curious observation about the statistical distribution of words

ID: 3600587 • Letter: Z

Question

Zipf’s Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank in the frequency table. Frequency is the number of times a word appears in the text. When words are ranked according to frequency, the most frequent word is given rank 1, the next most frequent word is given rank 2, etc. Ties are handled by assigning the middle value in a range of ranks. For example, in a two-way tie for most frequent word, each word is given the rank of 1.5.

Write a C++ program to investigate the validity of Zipf’s Law. Read in words from a text file, counting their frequency of occurrence. Print out the words sorted by frequency (highest frequency first). Within each frequency group, sort the words alphabetically. Then print a table that lists word rank, frequency, and rank * frequency to determine whether Zipf’s Law holds true. According to Zipf’s Law, the product of rank and frequency should be (roughly) constant.

Test your program on this sample textfile:

This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work. Tra la la la la, la la la la.

Implementation Details

The input text file consists of a stream of ASCII characters that you must parse into words. For this assignment, restrict words to strings of letters, possibly containing embedded single quotes (contractions). Words are separated from the surrounding text by anything that is not a letter. Make your program case-insensitive by converting all words to lower case. (Hint: the strtok() routine is useful for parsing text into words.)

When a great deal of searching is required, a hash table is an appropriate data structure. Hash table performance is O(1) average for find, insert and delete operations, allowing you to insert new words and increment word frequencies very efficiently. Using a hash table, write an application to test Zipf’s Law.

Implementation notes

Store the words and their frequencies in a hash table that you implement with a hash table class. (Implement your own hash table. Do not use any STL functionality.) Include a constructor, a destructor, and member functions for inserting, deleting, and finding items in the hash table. (I recognize that there are no deletions in this assignment. Humor me.) The hash function is up to you, but make it a good one for strings. Use open addressing with linear probing to resolve collisions.

Start with a dynamically-allocated hash table of about 1K entries (remember, the hash table size should generally be prime). Whenever the hash table gets over 75% full, rehash the contents into a new hash table, approximately twice as large as the previous one. Print a message to the screen, notifying the user when rehashing occurs.

Open addressing makes it easy to sort the hash table after counting the word frequencies. To sort the table, use the standard library routine qsort(). Words should be sorted first by frequency, and then alphabetically within frequency groups.

Output the results to two files, with the same name as the input text file, but different extensions. For example, given an input file named textfile.txt, print the word concordance to a file named textfile.wrd, and the rank vs. frequency table to a file named textfile.zpf. Output files should include headers with the input filename, the total number of words, and the total number of distinct words.

Program output

The output of your program should include the following:

Print the word frequencies to a file with the same name as the input file, but with the extension .wrd. For example, given an input text file file.txt, write the output to file.wrd. File header info should include the input filename, total number of words processed, and number of distinct words found.

Print words in frequency groups, sorted alphabetically within each group. Do not print empty frequency groups. Print multiple words per line, left justified within columns. For formatting purposes, you may assume that words will not exceed 15 characters in length.

Print rank and frequency information to a CSV (comma separated value) file, suitable for importing into an Excel spreadsheet. This file should have the same name as the input text file, with a .csv extension. Include the same file header info as in the .wrd file.

Print timing results to the console. This will give an indication of how efficiently your program executes.

Program execution

% zipf testfile.txt

Read 31 words from the file testfile.txt.

Inserted 20 distinct words into the hash table.

Elapsed time = 0.0 msec

Contents of input file: testfile.txt

This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work.

Tra la la la la, la la la la.

Contents of output file: testfile.wrd

Zipf's Law: word concordance

----------------------------

File:            testfile.txt

Total words:     31

Distinct words: 20

Word Frequencies                             Ranks     Avg Rank

----------------                             -----     --------

Words occurring 8 times:                         1          1.0 la             

Words occurring 3 times:                         2          2.0 of             

Words occurring 2 times:                       3-4          3.5 is            

this           

Words occurring once:                         5-20         12.5 a             

demonstrating how            importance     no              nonsense      

other          particular     pile           program         supposed      

than           that           to             tra             work           

Contents of output file: testfile.csv

Zipf's Law: rank * freq = const

-------------------------------

File:            testfile.txt

Total words:     31

Distinct words: 20

rank,      freq,        

r*f 1.0,         8,        

8.0 2.0,         3,        

6.0 3.5,         2,        

7.0

12.5,         1,        12.5

Explanation / Answer

Main.cpp
--------------------------------------------------------------------------
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <ctime>

#include "Tokenizer.h"
#include "Hashtable.h"

using namespace std;

typedef pair<int, std::string> tableEntry;


const char* VALID = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'";

int TableComparator( const void* a, const void* b );
void WriteFiles( char* origFile, tableEntry* freq, int nWords, int maxStrLen, int tabSize );

int main( int argc, char** argv )
{

    Hashtable table;
    vector<string> tokens;
    string temp;

    // track longest string count and total word count
    int maxStrLen = 0;
    int wordCount = 0;

    // must be two arguments
    if ( argc != 2 )
    {
        cout << "Wrong number of arguments Usage: zipf <file.txt>" << endl;
        return 1;
    }

    // make sure file opens
    ifstream infile( argv[1] );
    if ( !infile )
    {
        cout << "Unable to open the file " << argv[1] << endl;
        return 2;
    }

    // time process
    clock_t t = clock();

    while ( !infile.eof() )
    {
        // read in line, make all characters lowercase and tokenize into words
        getline( infile, temp );
        transform( temp.begin(), temp.end(), temp.begin(), ::tolower );
        Tokenize( temp, tokens, VALID );
    }

    for ( auto t: tokens )
    {
        // insert each token, count total words, and check string length
        table.Insert( t );
        wordCount++;
        if ( t.length() > (unsigned) maxStrLen )
            maxStrLen = t.length();
    }

    // retrieve total entry count and table size
    int count = table.GetEntryCount();
    int size = table.GetSize();

    // copy table data to array of pairs
    tableEntry* wordList = new pair<int, string> [count+1];
    int j = 0;
    for ( int i = 0; i < size; i++ )
    {
        int c = table.GetCount( i );
        if ( c != 0 )
        {
            wordList[j].first = c;
            wordList[j].second = table.GetKey( i );
            j++;
        }
    }

    // sort list and write data to file
    qsort( wordList, count, sizeof( tableEntry ), TableComparator );
    WriteFiles( argv[1],wordList, wordCount, maxStrLen, count );

    // display program runtime
    t = clock() - t;
    cout << "Time to read in, hash, sort and write out data: " << (float) t / CLOCKS_PER_SEC << " seconds" << endl;

    return 0;
}

int TableComparator( const void* a, const void* b )
{
    // if integers are same, compare strings
    int result = ( (tableEntry *) b )->first - ( (tableEntry *) a )->first;
    if ( !result )
        result = ( (tableEntry *) a )->second.compare( ( (tableEntry *) b )->second );
    return result;
}

void WriteFiles( char* origFile, tableEntry* freq, int nWords, int maxStrLen, int tabSize )
{
    //determine output filenames
    char* wrdFname = strdup( origFile );
    strcpy ( wrdFname + strlen( wrdFname ) - 3, "wrd" );
    char* csvFname = strdup( origFile );
    strcpy( csvFname + strlen( csvFname ) - 3, "csv" );
    ofstream wrdout( wrdFname );
    ofstream csvout( csvFname );

    //print header of wrd file
    wrdout << "Zipf's Law: word concordance" << endl;
    wrdout << "----------------------------" << endl;
    wrdout << "File:            " << right << origFile << endl;
    wrdout << "Total words:     " << right << nWords << endl;
    wrdout << "Distinct words: " << right << tabSize << endl;
    wrdout << endl;
    wrdout << "Word Frequencies                                               Ranks    Avg Rank" << endl;
    wrdout << "----------------                                               -----    --------" << endl;

    //print header of csv file
    csvout << "Zipf's Law: word concordance" << endl;
    csvout << "----------------------------" << endl;
    csvout << "File:            " << right << origFile << endl;
    csvout << "Total words:     " << right << nWords << endl;
    csvout << "Distinct words: " << right << tabSize << endl;
    csvout << endl;
    csvout << "rank, freq, r*f" << endl;

    auto freqit = &freq[tabSize+1];
    auto freqend = &freq[tabSize + 1];
    int rank = 1;
    int nOccs = freq[0].first;
    int rankCount = 0;
    auto rankBegin = freq;

    //iterate over frequency entries detecting changes in number of occurences
    for ( freqit = freq; freqit < freqend; ++freqit )
    {
        if ( nOccs != freqit->first )
        {
            int colIdx = 0;
            float avgRank = rank + double (rankCount - 1) / 2;

            //output line of csv data
            csvout << avgRank << ", " << nOccs << ", " << avgRank*nOccs << endl;

            //handle pluralization
            string wordHeader;
            if ( nOccs > 1 )
                wordHeader = "Words occurring " + to_string( nOccs ) + " times:";
            else
                wordHeader = "Words occurring once:";
            wrdout << wordHeader;

            //handle rank ranges
            if ( rankCount == 1 )
                wrdout << setw( 68 - wordHeader.length() ) << setfill( ' ' ) << right << rank;
            else
            {
                string rankstring = to_string( rank ) + "-" + to_string( rank + rankCount - 1 );
                wrdout << string( 68 - wordHeader.length() - rankstring.length(), ' ' ) << rankstring;
            }

            //finish the heading for words of a given rank
            wrdout << setw( 12 ) << right << avgRank << endl;
            int nCols = 80 / maxStrLen;

            //write the words for a given rank category
            for ( auto rankit = rankBegin; rankit < freqit; ++rankit )
            {
                wrdout << setw( maxStrLen + 1 ) << setfill( ' ' ) << left << rankit->second;
                if ( ++colIdx == nCols )
                {
                    colIdx = 0;
                    wrdout << endl;
                }
            }
            wrdout << endl;
            wrdout << endl;

            //update ranking information
            rank += rankCount;
            rankCount = 0;
            nOccs = freqit->first;
            rankBegin = freqit;
        }
        ++rankCount;
    }
    //clean heap-allocated memory
    delete[] freq;
    free( wrdFname );
    free( csvFname );
}
-----------------------------------------------------------------------------------------
HashTable.cpp
------------------------------------------------
#include "Hashtable.h"

Hashtable::Hashtable( int n )
{
    // create new table of size n, set size to n
    table = new tableEntry[n];
    size = n;
}

Hashtable::Hashtable( const Hashtable& hash )
{
    // create new table same size as hash
    size = hash.size;
    table = new tableEntry[size];

    // copy values from hash into new table
    for ( int i = 0; i < size; i++ )
    {
        if ( table[i].second != "" )
        {
            table[i].first = hash.table[i].first;
            table[i].second = hash.table[i].second;
        }
    }
}


Hashtable::~Hashtable()
{}

int Hashtable::GetCount( int i )
{
    // invalid index
    if ( ( i < 0 ) || ( i >= size ) )
        return -1;

    return table[i].first;
}

int Hashtable::GetCount( const string& k )
{
    int i = FindHash( k );
    if ( i > -1 )
        return table[i].first;

    // string not found in table
    return 0;
}


string Hashtable::GetKey( int i )
{
    if ( ( i < 0 ) || ( i >= size ) )
        return "*BAD INDEX*";

    return table[i].second;
}


int Hashtable::GetIndex( const string& k ){ return FindHash( k ); }


int Hashtable::GetSize() { return size; }

int Hashtable::GetEntryCount() { return entries; }


float Hashtable::GetLoadFactor() { return (float) entries / size; }

void Hashtable::Insert( const string& k )
{
    int i = InsertHash( k );

    // increase frequency count if key exists
    if ( table[i].second == k )
        table[i].first++;

    else
    {
        // insert and increment table entries
        table[i].first++;
        table[i].second = k;
        entries++;

        // if table is more than 75% full after insert, rehash it
        if ( ( (float) entries / size ) > 0.75 )
        {
            cout << "Table is at least 75% full! Rehashing now..."<< endl;
            Rehash();
        }
    }
}

void Hashtable::Remove( int i )
{
    // invalid index
    if ( ( i < 0 ) || ( i >= size ) )
        return;

    // reset count, mark as deleted and decrement entries
    table[i].first = 0;
    table[i].second = "*DELETED*";
    entries--;
}

void Hashtable::Remove( const string& k )
{
    int i = FindHash( k );

    // if key is there reset count, mark as deleted and decrement entries
    if ( table[i].second == k )
    {
        table[i].first = 0;
        table[i].second = "*DELETED*";
        entries--;
    }
}


void Hashtable::Decrease( int i )
{
    // invalid index
    if ( ( i < 0 ) || ( i >= size ) )
        return;

    table[i].first--;

    // if count hits 0, mark key as deleted
    if ( table[i].first == 0 )
        table[i].second = "*DELETED*";
}


void Hashtable::Decrease( const string& k )
{
    int i = FindHash( k );

    // if key is there, decrement count
    if ( table[i].second == k )
        table[i].first--;

    // if count hits 0, mark key as deleted
    if ( table[i].first == 0 )
        table[i].second = "*DELETED*";
}


unsigned long Hashtable::HashFunction( const string& k )
{
    unsigned long hash = 5381;

    for ( unsigned int i = 0; i < k.length(); i++ )
        hash = ( hash * 33 ) + k[i];

    return hash;
}


int Hashtable::InsertHash( const string& k )
{
    int i = ( HashFunction( k ) ) % size;

    // probe table until either empty/deleted spot or matching key is found
    while ( (table[i].first != 0 ) && ( table[i].second != k ) )
        i = ( i + 1 ) % size;

    return i;
}

int Hashtable::FindHash( const string& k )
{
    int i = ( HashFunction( k ) ) % size;

    // probe table until either empty string or matching key is found
    while ( ( table[i].second != "" ) && ( table[i].second != k ) )
        i = ( i + 1 ) % size;

    if ( table[i].second == k )
        return i;

        // not in table
    else
        return -1;
}

void Hashtable::Rehash()
{
    // store data values
    tableEntry* temp = table;
    int oldSize = size;

    // get new table size
    size = size*2;
    while ( !IsPrime( size ) )
        size++;

    // resize table
    table = new tableEntry[size];
    int index;

    // hash entries from orginal table into new table
    for ( int i = 0; i < oldSize; i++ )
    {
        // skip empty and deleted entries
        if ( ( temp[i].second != "" ) && ( temp[i].second != "*DELETED*" ) )
        {
            index = InsertHash( temp[i].second );
            table[index].first = temp[i].first;
            table[index].second = temp[i].second;
        }
    }
}


bool Hashtable::IsPrime( int n )
{
    // if n is divisible by 2 it's not prime
    if ( n % 2 == 0 )
        return false;

    // if n is divisible by i, it's not prime
    for ( int i = 3; ( i * i ) <= n; i += 2 )
        if ( n % i == 0 )
            return false;

    return true;
}
----------------------------------------------------------------------------------------------------
Hashtable.h
----------------------------------------------------------------
#ifndef _Hashtable_H_
#define _Hashtable_H_

#include <string>
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <cmath>

using namespace std;

// pair ints default to 0
typedef pair<int, std::string> tableEntry;

class Hashtable
{

private:
    //table counts and keys
    tableEntry* table;

    // how big the table is
    int size;

    // how many entries are in the table
    int entries = 0;
    unsigned long HashFunction( const string& );
    int InsertHash( const string& );
    int FindHash( const string& );
    void Rehash();
    bool IsPrime( int );
public:

    Hashtable( int n = 1033);

    Hashtable( const Hashtable& );

    ~Hashtable();

    int GetCount( int );
    int GetCount( const string& );
    string GetKey( int );
    int GetIndex( const string& );
    int GetSize();
    int GetEntryCount();
    float GetLoadFactor();

    void Insert( const string& );
    void Remove( int );
    void Remove( const string& );
    void Decrease( int );
    void Decrease( const string& );

};
#endif
--------------------------------------------------------------------------------------------------------
Tokenizer.cpp
--------------------------------------------------------
#include "Tokenizer.h"

void Tokenize( const string& str, vector<string>& tokens, const string& valid = " " )
{
    // skip delimiters to start of first token
    int tokenStart = str.find_first_of( valid, 0 );

    // find next delimiter (i.e., end of first token)
    int tokenEnd = str.find_first_not_of( valid, tokenStart );

    // loop through input string
    while ( tokenStart != (signed) string::npos )
    {
        // only emebeded single quotes are valid
        if ( str[tokenStart] == ''')
            tokenStart++;
        if ( str[tokenEnd] == ''')
            tokenEnd--;

        // found a token, add it to the vector
        tokens.push_back( str.substr( tokenStart, tokenEnd - tokenStart ) );

        // skip delimiters to start of next token
        tokenStart = str.find_first_of( valid, tokenEnd );

        // find next delimiter (end of token)
        tokenEnd = str.find_first_not_of( valid, tokenStart );
    }
}
--------------------------------------------------------------------------------------------------------
Tokenizer.h
--------------------------------------------------
#ifndef _Tokenizer_H_
#define _Tokenizer_H_


#include <string>
#include <vector>
#include <iostream>

using namespace std;

// string tokenizer that searches for symbols of a delimiter string
void Tokenize( const string& , vector<string>& , const string& );

#endif