Search is a problem where given a long string, the text, you are to find a list

ID: 3761684 • Letter: S

Question

Search is a problem where given a long string, the text, you are to find a list of k-letter words hidden in the text. For example, the text “thequickbrownfoxjumpsoverthelazydog” contains the words (“quick”, “overt”) within it.

Design and implement an algorithm that would find the number of occurrences of a k-letter word within the text, in O(K) average time, after a preprocessing step on the text. Design and implement the preprocessing step too, and find its Big-Oh time complexity.

*Use hashing*

Explanation / Answer

Here is the general approach:

Read a dictionary file and organize all words in a trie data structure. Many Unix systems have such files in the /usr/share/dict/ directory.

Find possible matches of a prefix of your input in the trie. This will usually produce multiple matches, for example theologyisabout begins with theology and the.

If we remove the matched prefixes, we get a set of possible continuations, on which we repeat step 2.

We then end up with a vast tree of possible interpretations.

There are two problems with this:

We can solve both of these problems by fuzzy matching. When we look up prefixes in the trie, we allow letters to be missing, inserted or changed. However, each such aberration increases the Levenshtein distance. If one interpretation has a too high summed Levenshtein distance, we can prune that interpretation and concentrate on other branches. You could also keep the branches in a priority queue and always investigate the branches with the lowest current edit distance, which is most likely to be a sensible interpretation – not unlike Dijkstra's pathfinding algorithm.

Note that multiple prefix sequences with different edit distances might lead to the same remaining string. You can therefore keep your progress in a data structure that allows parts to be shared. This caching will likely be beneficial for performance. If you in fact try to implement a variant of Dijkstra's algorithm here, a known tail would correspond to a visited node in the graph.

The difficult part is how to actually perform the fuzzy matching. E.g. you could decide on a maximum edit density of x edits per character (0 <= x <= 1), and abort an interpretation if it is guaranteed that this interpretation will have a higher density. For a given string with length l we can therefore determine an edit budget b = x · l. This budget is less important when matching prefixes in the trie, but this trie is only useful if there are less edits than characters in the prefix. An edit budget like b = floor(c / 2) with a prefix of length c might be sensible. How much edits you allow is not only a metric for how garbled texts you allow your system to “understand”, but also a performance setting – smaller budgets run faster, as less alternatives have to be investigated.

#include <stdio.h>

#include <string.h>

#include <ctype.h>

# define MAX_CHARS 26

# define MAX_WORD_SIZE 30

// A Trie node

struct TrieNode

{

bool isEnd; // indicates end of word

unsigned frequency; // the number of occurrences of a word

int indexMinHeap; // the index of the word in minHeap

TrieNode* child[MAX_CHARS]; // represents 26 slots each for 'a' to 'z'.

};

// A Min Heap node

struct MinHeapNode

{

TrieNode* root; // indicates the leaf node of TRIE

unsigned frequency; // number of occurrences

char* word; // the actual word stored

};

// A Min Heap

struct MinHeap

{

unsigned capacity; // the total size a min heap

int count; // indicates the number of slots filled.

MinHeapNode* array; // represents the collection of minHeapNodes

};

// A utility function to create a new Trie node

TrieNode* newTrieNode()

{

// Allocate memory for Trie Node

TrieNode* trieNode = new TrieNode;

// Initialize values for new node

trieNode->isEnd = 0;

trieNode->frequency = 0;

trieNode->indexMinHeap = -1;

for( int i = 0; i < MAX_CHARS; ++i )

trieNode->child[i] = NULL;

return trieNode;

}

// A utility function to create a Min Heap of given capacity

MinHeap* createMinHeap( int capacity )

{

MinHeap* minHeap = new MinHeap;

minHeap->capacity = capacity;

minHeap->count = 0;

// Allocate memory for array of min heap nodes

minHeap->array = new MinHeapNode [ minHeap->capacity ];

return minHeap;

}

// A utility function to swap two min heap nodes. This function

// is needed in minHeapify

void swapMinHeapNodes ( MinHeapNode* a, MinHeapNode* b )

{

MinHeapNode temp = *a;

*a = *b;

*b = temp;

}

// This is the standard minHeapify function. It does one thing extra.

// It updates the minHapIndex in Trie when two nodes are swapped in

// in min heap

void minHeapify( MinHeap* minHeap, int idx )

{

int left, right, smallest;

left = 2 * idx + 1;

right = 2 * idx + 2;

smallest = idx;

if ( left < minHeap->count &&

minHeap->array[ left ]. frequency <

minHeap->array[ smallest ]. frequency

)

smallest = left;

if ( right < minHeap->count &&

minHeap->array[ right ]. frequency <

minHeap->array[ smallest ]. frequency

)

smallest = right;

if( smallest != idx )

{

// Update the corresponding index in Trie node.

minHeap->array[ smallest ]. root->indexMinHeap = idx;

minHeap->array[ idx ]. root->indexMinHeap = smallest;

// Swap nodes in min heap

swapMinHeapNodes (&minHeap->array[ smallest ], &minHeap->array[ idx ]);

minHeapify( minHeap, smallest );

}

// A standard function to build a heap

void buildMinHeap( MinHeap* minHeap )

{

int n, i;

n = minHeap->count - 1;

for( i = ( n - 1 ) / 2; i >= 0; --i )

minHeapify( minHeap, i );

}

// Inserts a word to heap, the function handles the 3 cases explained above

void insertInMinHeap( MinHeap* minHeap, TrieNode** root, const char* word )

{

// Case 1: the word is already present in minHeap

if( (*root)->indexMinHeap != -1 )

{

++( minHeap->array[ (*root)->indexMinHeap ]. frequency );

// percolate down

minHeapify( minHeap, (*root)->indexMinHeap );

}

// Case 2: Word is not present and heap is not full

else if( minHeap->count < minHeap->capacity )

{

int count = minHeap->count;

minHeap->array[ count ]. frequency = (*root)->frequency;

minHeap->array[ count ]. word = new char [strlen( word ) + 1];

strcpy( minHeap->array[ count ]. word, word );

minHeap->array[ count ]. root = *root;

(*root)->indexMinHeap = minHeap->count;

++( minHeap->count );

buildMinHeap( minHeap );

}

// Case 3: Word is not present and heap is full. And frequency of word

// is more than root. The root is the least frequent word in heap,

// replace root with new word

else if ( (*root)->frequency > minHeap->array[0]. frequency )

{

minHeap->array[ 0 ]. root->indexMinHeap = -1;

minHeap->array[ 0 ]. root = *root;

minHeap->array[ 0 ]. root->indexMinHeap = 0;

minHeap->array[ 0 ]. frequency = (*root)->frequency;

// delete previously allocated memoory and

delete [] minHeap->array[ 0 ]. word;

minHeap->array[ 0 ]. word = new char [strlen( word ) + 1];

strcpy( minHeap->array[ 0 ]. word, word );

minHeapify ( minHeap, 0 );

}

// Inserts a new word to both Trie and Heap

void insertUtil ( TrieNode** root, MinHeap* minHeap,

const char* word, const char* dupWord )

{

// Base Case

if ( *root == NULL )

*root = newTrieNode();

// There are still more characters in word

if ( *word != '' )

insertUtil ( &((*root)->child[ tolower( *word ) - 97 ]),

minHeap, word + 1, dupWord );

else // The complete word is processed

{

// word is already present, increase the frequency

if ( (*root)->isEnd )

++( (*root)->frequency );

else

{

(*root)->isEnd = 1;

(*root)->frequency = 1;

}

// Insert in min heap also

insertInMinHeap( minHeap, root, dupWord );

}

// add a word to Trie & min heap. A wrapper over the insertUtil

void insertTrieAndHeap(const char *word, TrieNode** root, MinHeap* minHeap)

{

insertUtil( root, minHeap, word, word );

}

// A utility function to show results, The min heap

// contains k most frequent words so far, at any time

void displayMinHeap( MinHeap* minHeap )

{

int i;

// print top K word with frequency

for( i = 0; i < minHeap->count; ++i )

{

printf( "%s : %d ", minHeap->array[i].word,

minHeap->array[i].frequency );

}

// The main funtion that takes a file as input, add words to heap

// and Trie, finally shows result from heap

void printKMostFreq( FILE* fp, int k )

{

// Create a Min Heap of Size k

MinHeap* minHeap = createMinHeap( k );

// Create an empty Trie

TrieNode* root = NULL;

// A buffer to store one word at a time

char buffer[MAX_WORD_SIZE];

// Read words one by one from file. Insert the word in Trie and Min Heap

while( fscanf( fp, "%s", buffer ) != EOF )

insertTrieAndHeap(buffer, &root, minHeap);

// The Min Heap will have the k most frequent words, so print Min Heap nodes

displayMinHeap( minHeap );

}

// Driver program to test above functions

int main()

{

int k = 5;

FILE *fp = fopen ("file.txt", "r");

if (fp == NULL)

printf ("File doesn't exist ");

else

printKMostFreq (fp, k);

return 0;

}

{

}

{

}

Navigate

Search in Sheet Home | Layout Tables Charts SmartArt Formulas Data Review Alignm

Search l 1:33 PM +1 (754) 422-9701 9/23/18, 12:59 PM All Media School 0/4 pts Qu

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Search is a problem where given a long string, the text, you are to find a list

Question

Explanation / Answer

Related Questions

Navigate