Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

1. A certain hash function for a string of characters C-c0c1 . . .Gn-1 is given

ID: 3736565 • Letter: 1

Question

1. A certain hash function for a string of characters C-c0c1 . . .Gn-1 is given by: m_ hash(C)Xex3 C¡ x 31(m-imod 232 (a) Suppose we want to find the first occurrence of a string P = Pop! pk-1 In a string Q = goi qN-1, where N >> k. We can first find the hash code for P and then compare it with hash codes of k-length substrings of Q: Q-ok-1, Q1-q12. qk, etc. until the hash codes match, or no further k-length substrings of Q are left If a match occurs then P must be compared character by character to the substring of Q for which the hash codes match If no match of hash codes is found, no substring of Q is equal to P. From the above equation, show that, if the hash code for Q is known, the hash code for Q+1 can be found in constant timc (b) Give the runt line of the string matching algorithin in 'big O notation in terms of N and One of the best algorithms (Boyer-Moore) takes O(N) time and, if there is no match, takes N/M time.]

Explanation / Answer

since you mentioned composite key so I am assuming that you have a PK like this, PRIMARY KEY(PartionKey, StringVal) where StringVal is 6 to 7 chars and I guess you want to know how C* efficiently gets to record for this PK. If this is your question then answer lies in how C* stores data. In this example all the data for a given partition key is stored as one physical row using 'StringVal' as the sorting order. So if you have say 1 million unique 'StringVal' for a given PartionKey value, then they all of them will be stored as one physical row (on the disk) on a node (determined by hash of Partionkey) and sorted in the default ascending order of ‘StringVal’. All the columns in PK other than partition key are called ‘clustering’ columns as they decide clustering order. So in this example, first column of composite key is partition key and second column is clustering column which decides clustering order for all the records for a partition key. Now if you want to get a specific PK record, since C* stores the offset for the Primary keys in index files (-Index.db files for a column family) getting to a specific record for a PK is very efficient as it involves seek to that location. Also this allows C to do efficient range queries as well for e.g. you could get a slice of physical row corresponding to the partition key by specifying a range of ‘StringVal’ like ‘nnn’ > sv < ‘mmm’ which in your case will be lexical order comparison. But the point is that since its in specific order on the disk and C* has offset to the various records corresponding to values of ‘StringVal’ , it can do very efficient seeks.