Zipf’s Law is a curious observation about the statistical distribution of words
ID: 3600587 • Letter: Z
Question
Zipf’s Law is a curious observation about the statistical distribution of words in text: the frequency of any word is inversely proportional to its rank in the frequency table. Frequency is the number of times a word appears in the text. When words are ranked according to frequency, the most frequent word is given rank 1, the next most frequent word is given rank 2, etc. Ties are handled by assigning the middle value in a range of ranks. For example, in a two-way tie for most frequent word, each word is given the rank of 1.5.
Write a C++ program to investigate the validity of Zipf’s Law. Read in words from a text file, counting their frequency of occurrence. Print out the words sorted by frequency (highest frequency first). Within each frequency group, sort the words alphabetically. Then print a table that lists word rank, frequency, and rank * frequency to determine whether Zipf’s Law holds true. According to Zipf’s Law, the product of rank and frequency should be (roughly) constant.
Test your program on this sample textfile:
This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work. Tra la la la la, la la la la.
Implementation Details
The input text file consists of a stream of ASCII characters that you must parse into words. For this assignment, restrict words to strings of letters, possibly containing embedded single quotes (contractions). Words are separated from the surrounding text by anything that is not a letter. Make your program case-insensitive by converting all words to lower case. (Hint: the strtok() routine is useful for parsing text into words.)
When a great deal of searching is required, a hash table is an appropriate data structure. Hash table performance is O(1) average for find, insert and delete operations, allowing you to insert new words and increment word frequencies very efficiently. Using a hash table, write an application to test Zipf’s Law.
Implementation notes
Store the words and their frequencies in a hash table that you implement with a hash table class. (Implement your own hash table. Do not use any STL functionality.) Include a constructor, a destructor, and member functions for inserting, deleting, and finding items in the hash table. (I recognize that there are no deletions in this assignment. Humor me.) The hash function is up to you, but make it a good one for strings. Use open addressing with linear probing to resolve collisions.
Start with a dynamically-allocated hash table of about 1K entries (remember, the hash table size should generally be prime). Whenever the hash table gets over 75% full, rehash the contents into a new hash table, approximately twice as large as the previous one. Print a message to the screen, notifying the user when rehashing occurs.
Open addressing makes it easy to sort the hash table after counting the word frequencies. To sort the table, use the standard library routine qsort(). Words should be sorted first by frequency, and then alphabetically within frequency groups.
Output the results to two files, with the same name as the input text file, but different extensions. For example, given an input file named textfile.txt, print the word concordance to a file named textfile.wrd, and the rank vs. frequency table to a file named textfile.zpf. Output files should include headers with the input filename, the total number of words, and the total number of distinct words.
Program output
The output of your program should include the following:
Print the word frequencies to a file with the same name as the input file, but with the extension .wrd. For example, given an input text file file.txt, write the output to file.wrd. File header info should include the input filename, total number of words processed, and number of distinct words found.
Print words in frequency groups, sorted alphabetically within each group. Do not print empty frequency groups. Print multiple words per line, left justified within columns. For formatting purposes, you may assume that words will not exceed 15 characters in length.
Print rank and frequency information to a CSV (comma separated value) file, suitable for importing into an Excel spreadsheet. This file should have the same name as the input text file, with a .csv extension. Include the same file header info as in the .wrd file.
Print timing results to the console. This will give an indication of how efficiently your program executes.
Program execution
% zipf testfile.txt
Read 31 words from the file testfile.txt.
Inserted 20 distinct words into the hash table.
Elapsed time = 0.0 msec
Contents of input file: testfile.txt
This is a pile of nonsense, of no particular importance other than that of demonstrating how this program is supposed to work.
Tra la la la la, la la la la.
Contents of output file: testfile.wrd
Zipf's Law: word concordance
----------------------------
File: testfile.txt
Total words: 31
Distinct words: 20
Word Frequencies Ranks Avg Rank
---------------- ----- --------
Words occurring 8 times: 1 1.0 la
Words occurring 3 times: 2 2.0 of
Words occurring 2 times: 3-4 3.5 is
this
Words occurring once: 5-20 12.5 a
demonstrating how importance no nonsense
other particular pile program supposed
than that to tra work
Contents of output file: testfile.csv
Zipf's Law: rank * freq = const
-------------------------------
File: testfile.txt
Total words: 31
Distinct words: 20
rank, freq,
r*f 1.0, 8,
8.0 2.0, 3,
6.0 3.5, 2,
7.0
12.5, 1, 12.5
Explanation / Answer
Main.cpp
--------------------------------------------------------------------------
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <ctime>
#include "Tokenizer.h"
#include "Hashtable.h"
using namespace std;
typedef pair<int, std::string> tableEntry;
const char* VALID = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'";
int TableComparator( const void* a, const void* b );
void WriteFiles( char* origFile, tableEntry* freq, int nWords, int maxStrLen, int tabSize );
int main( int argc, char** argv )
{
Hashtable table;
vector<string> tokens;
string temp;
// track longest string count and total word count
int maxStrLen = 0;
int wordCount = 0;
// must be two arguments
if ( argc != 2 )
{
cout << "Wrong number of arguments Usage: zipf <file.txt>" << endl;
return 1;
}
// make sure file opens
ifstream infile( argv[1] );
if ( !infile )
{
cout << "Unable to open the file " << argv[1] << endl;
return 2;
}
// time process
clock_t t = clock();
while ( !infile.eof() )
{
// read in line, make all characters lowercase and tokenize into words
getline( infile, temp );
transform( temp.begin(), temp.end(), temp.begin(), ::tolower );
Tokenize( temp, tokens, VALID );
}
for ( auto t: tokens )
{
// insert each token, count total words, and check string length
table.Insert( t );
wordCount++;
if ( t.length() > (unsigned) maxStrLen )
maxStrLen = t.length();
}
// retrieve total entry count and table size
int count = table.GetEntryCount();
int size = table.GetSize();
// copy table data to array of pairs
tableEntry* wordList = new pair<int, string> [count+1];
int j = 0;
for ( int i = 0; i < size; i++ )
{
int c = table.GetCount( i );
if ( c != 0 )
{
wordList[j].first = c;
wordList[j].second = table.GetKey( i );
j++;
}
}
// sort list and write data to file
qsort( wordList, count, sizeof( tableEntry ), TableComparator );
WriteFiles( argv[1],wordList, wordCount, maxStrLen, count );
// display program runtime
t = clock() - t;
cout << "Time to read in, hash, sort and write out data: " << (float) t / CLOCKS_PER_SEC << " seconds" << endl;
return 0;
}
int TableComparator( const void* a, const void* b )
{
// if integers are same, compare strings
int result = ( (tableEntry *) b )->first - ( (tableEntry *) a )->first;
if ( !result )
result = ( (tableEntry *) a )->second.compare( ( (tableEntry *) b )->second );
return result;
}
void WriteFiles( char* origFile, tableEntry* freq, int nWords, int maxStrLen, int tabSize )
{
//determine output filenames
char* wrdFname = strdup( origFile );
strcpy ( wrdFname + strlen( wrdFname ) - 3, "wrd" );
char* csvFname = strdup( origFile );
strcpy( csvFname + strlen( csvFname ) - 3, "csv" );
ofstream wrdout( wrdFname );
ofstream csvout( csvFname );
//print header of wrd file
wrdout << "Zipf's Law: word concordance" << endl;
wrdout << "----------------------------" << endl;
wrdout << "File: " << right << origFile << endl;
wrdout << "Total words: " << right << nWords << endl;
wrdout << "Distinct words: " << right << tabSize << endl;
wrdout << endl;
wrdout << "Word Frequencies Ranks Avg Rank" << endl;
wrdout << "---------------- ----- --------" << endl;
//print header of csv file
csvout << "Zipf's Law: word concordance" << endl;
csvout << "----------------------------" << endl;
csvout << "File: " << right << origFile << endl;
csvout << "Total words: " << right << nWords << endl;
csvout << "Distinct words: " << right << tabSize << endl;
csvout << endl;
csvout << "rank, freq, r*f" << endl;
auto freqit = &freq[tabSize+1];
auto freqend = &freq[tabSize + 1];
int rank = 1;
int nOccs = freq[0].first;
int rankCount = 0;
auto rankBegin = freq;
//iterate over frequency entries detecting changes in number of occurences
for ( freqit = freq; freqit < freqend; ++freqit )
{
if ( nOccs != freqit->first )
{
int colIdx = 0;
float avgRank = rank + double (rankCount - 1) / 2;
//output line of csv data
csvout << avgRank << ", " << nOccs << ", " << avgRank*nOccs << endl;
//handle pluralization
string wordHeader;
if ( nOccs > 1 )
wordHeader = "Words occurring " + to_string( nOccs ) + " times:";
else
wordHeader = "Words occurring once:";
wrdout << wordHeader;
//handle rank ranges
if ( rankCount == 1 )
wrdout << setw( 68 - wordHeader.length() ) << setfill( ' ' ) << right << rank;
else
{
string rankstring = to_string( rank ) + "-" + to_string( rank + rankCount - 1 );
wrdout << string( 68 - wordHeader.length() - rankstring.length(), ' ' ) << rankstring;
}
//finish the heading for words of a given rank
wrdout << setw( 12 ) << right << avgRank << endl;
int nCols = 80 / maxStrLen;
//write the words for a given rank category
for ( auto rankit = rankBegin; rankit < freqit; ++rankit )
{
wrdout << setw( maxStrLen + 1 ) << setfill( ' ' ) << left << rankit->second;
if ( ++colIdx == nCols )
{
colIdx = 0;
wrdout << endl;
}
}
wrdout << endl;
wrdout << endl;
//update ranking information
rank += rankCount;
rankCount = 0;
nOccs = freqit->first;
rankBegin = freqit;
}
++rankCount;
}
//clean heap-allocated memory
delete[] freq;
free( wrdFname );
free( csvFname );
}
-----------------------------------------------------------------------------------------
HashTable.cpp
------------------------------------------------
#include "Hashtable.h"
Hashtable::Hashtable( int n )
{
// create new table of size n, set size to n
table = new tableEntry[n];
size = n;
}
Hashtable::Hashtable( const Hashtable& hash )
{
// create new table same size as hash
size = hash.size;
table = new tableEntry[size];
// copy values from hash into new table
for ( int i = 0; i < size; i++ )
{
if ( table[i].second != "" )
{
table[i].first = hash.table[i].first;
table[i].second = hash.table[i].second;
}
}
}
Hashtable::~Hashtable()
{}
int Hashtable::GetCount( int i )
{
// invalid index
if ( ( i < 0 ) || ( i >= size ) )
return -1;
return table[i].first;
}
int Hashtable::GetCount( const string& k )
{
int i = FindHash( k );
if ( i > -1 )
return table[i].first;
// string not found in table
return 0;
}
string Hashtable::GetKey( int i )
{
if ( ( i < 0 ) || ( i >= size ) )
return "*BAD INDEX*";
return table[i].second;
}
int Hashtable::GetIndex( const string& k ){ return FindHash( k ); }
int Hashtable::GetSize() { return size; }
int Hashtable::GetEntryCount() { return entries; }
float Hashtable::GetLoadFactor() { return (float) entries / size; }
void Hashtable::Insert( const string& k )
{
int i = InsertHash( k );
// increase frequency count if key exists
if ( table[i].second == k )
table[i].first++;
else
{
// insert and increment table entries
table[i].first++;
table[i].second = k;
entries++;
// if table is more than 75% full after insert, rehash it
if ( ( (float) entries / size ) > 0.75 )
{
cout << "Table is at least 75% full! Rehashing now..."<< endl;
Rehash();
}
}
}
void Hashtable::Remove( int i )
{
// invalid index
if ( ( i < 0 ) || ( i >= size ) )
return;
// reset count, mark as deleted and decrement entries
table[i].first = 0;
table[i].second = "*DELETED*";
entries--;
}
void Hashtable::Remove( const string& k )
{
int i = FindHash( k );
// if key is there reset count, mark as deleted and decrement entries
if ( table[i].second == k )
{
table[i].first = 0;
table[i].second = "*DELETED*";
entries--;
}
}
void Hashtable::Decrease( int i )
{
// invalid index
if ( ( i < 0 ) || ( i >= size ) )
return;
table[i].first--;
// if count hits 0, mark key as deleted
if ( table[i].first == 0 )
table[i].second = "*DELETED*";
}
void Hashtable::Decrease( const string& k )
{
int i = FindHash( k );
// if key is there, decrement count
if ( table[i].second == k )
table[i].first--;
// if count hits 0, mark key as deleted
if ( table[i].first == 0 )
table[i].second = "*DELETED*";
}
unsigned long Hashtable::HashFunction( const string& k )
{
unsigned long hash = 5381;
for ( unsigned int i = 0; i < k.length(); i++ )
hash = ( hash * 33 ) + k[i];
return hash;
}
int Hashtable::InsertHash( const string& k )
{
int i = ( HashFunction( k ) ) % size;
// probe table until either empty/deleted spot or matching key is found
while ( (table[i].first != 0 ) && ( table[i].second != k ) )
i = ( i + 1 ) % size;
return i;
}
int Hashtable::FindHash( const string& k )
{
int i = ( HashFunction( k ) ) % size;
// probe table until either empty string or matching key is found
while ( ( table[i].second != "" ) && ( table[i].second != k ) )
i = ( i + 1 ) % size;
if ( table[i].second == k )
return i;
// not in table
else
return -1;
}
void Hashtable::Rehash()
{
// store data values
tableEntry* temp = table;
int oldSize = size;
// get new table size
size = size*2;
while ( !IsPrime( size ) )
size++;
// resize table
table = new tableEntry[size];
int index;
// hash entries from orginal table into new table
for ( int i = 0; i < oldSize; i++ )
{
// skip empty and deleted entries
if ( ( temp[i].second != "" ) && ( temp[i].second != "*DELETED*" ) )
{
index = InsertHash( temp[i].second );
table[index].first = temp[i].first;
table[index].second = temp[i].second;
}
}
}
bool Hashtable::IsPrime( int n )
{
// if n is divisible by 2 it's not prime
if ( n % 2 == 0 )
return false;
// if n is divisible by i, it's not prime
for ( int i = 3; ( i * i ) <= n; i += 2 )
if ( n % i == 0 )
return false;
return true;
}
----------------------------------------------------------------------------------------------------
Hashtable.h
----------------------------------------------------------------
#ifndef _Hashtable_H_
#define _Hashtable_H_
#include <string>
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <cmath>
using namespace std;
// pair ints default to 0
typedef pair<int, std::string> tableEntry;
class Hashtable
{
private:
//table counts and keys
tableEntry* table;
// how big the table is
int size;
// how many entries are in the table
int entries = 0;
unsigned long HashFunction( const string& );
int InsertHash( const string& );
int FindHash( const string& );
void Rehash();
bool IsPrime( int );
public:
Hashtable( int n = 1033);
Hashtable( const Hashtable& );
~Hashtable();
int GetCount( int );
int GetCount( const string& );
string GetKey( int );
int GetIndex( const string& );
int GetSize();
int GetEntryCount();
float GetLoadFactor();
void Insert( const string& );
void Remove( int );
void Remove( const string& );
void Decrease( int );
void Decrease( const string& );
};
#endif
--------------------------------------------------------------------------------------------------------
Tokenizer.cpp
--------------------------------------------------------
#include "Tokenizer.h"
void Tokenize( const string& str, vector<string>& tokens, const string& valid = " " )
{
// skip delimiters to start of first token
int tokenStart = str.find_first_of( valid, 0 );
// find next delimiter (i.e., end of first token)
int tokenEnd = str.find_first_not_of( valid, tokenStart );
// loop through input string
while ( tokenStart != (signed) string::npos )
{
// only emebeded single quotes are valid
if ( str[tokenStart] == ''')
tokenStart++;
if ( str[tokenEnd] == ''')
tokenEnd--;
// found a token, add it to the vector
tokens.push_back( str.substr( tokenStart, tokenEnd - tokenStart ) );
// skip delimiters to start of next token
tokenStart = str.find_first_of( valid, tokenEnd );
// find next delimiter (end of token)
tokenEnd = str.find_first_not_of( valid, tokenStart );
}
}
--------------------------------------------------------------------------------------------------------
Tokenizer.h
--------------------------------------------------
#ifndef _Tokenizer_H_
#define _Tokenizer_H_
#include <string>
#include <vector>
#include <iostream>
using namespace std;
// string tokenizer that searches for symbols of a delimiter string
void Tokenize( const string& , vector<string>& , const string& );
#endif
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.