The quadratic probing strategy has a clustering problem related to the way it lo

ID: 3634048 • Letter: T

Question

The quadratic probing strategy has a clustering problem related to the way it looks for open slots. Namely, when a collision occurs at bucket h(k), it checks buckets A[h(k) + j^2) mod N], for j = 1,2,...,N-1.
a) show that j^2 mod N will assume at most (N+1)/2 distinct values, for N prime, as j ranges from 1 to N-1. As a part of this justification, note that j^2 mod N = (N-j^2) mod N for all j.
b) a better strategy is to choose a prime N such that N mod 4 = 3 and then to check the buckets A[h(k) +- j^2) mod N] as j ranges from 1 to (N-1)/2, alternating between plus and minus. show that this alternate version is guaranteed to check every bucket in A.
Note: h(k) is the hash function.

Explanation / Answer

A hash table, put simply, is an abstraction of an array that allows any value to be used as an index. While an array requires that indices be integers, a hash table can use a floating-point value, a string, another array, or even a structure as the index. This index is called the key, and the contents of the array element at that index is called the value. So a hash table is a data structure that stores key/value pairs and can be quickly searched by the key. Because insertion and removal are operations dependent on the speed of the search, they tend to be fast as well. To achieve this magic, a hash table uses a helper function that converts any object into an integral index suitable for subscripting the array. For example, to index an array with strings representing the numbers 0 through 9, a hash function might look like this: 1 unsigned hash ( const char key ) 2 { 3 return key - '0'; 4 } The first character is expected to always be '0', '1', '2', '3', '4', '5', '6', '7', '8', or '9', and the trick of subtracting '0' from one of those characters evaluates to the integer value that the character represents. This works because you can subtract the lowest value in a contiguous range from another value in that range and find the zero-based distance between them. So ('2'-'0') results in 2, ('0'-'0') results in 0, and ('9'-'0') results in 9. For those of you who aren't familiar with this trick, it might help to work with the actual values rather than character constants. Let's take an arbitrary range of [12, 16). You want 12 to represent 0, 13 to represent 1, 14 to represent 2, and so on up to 16 which should represent 4. To do this you can take any value in the range and subtract the lowest value, 12, from it. Try it with a calculator. The hash function shown above can then be used to index a suitable array. For illustrative purposes, I'll include a complete test program for you to play with. Notice how problems occur when two entries are hashed to the same index. This is called a collision, and properly handling collisions is where most of the effort in implementing hash tables comes in. Here is the example program: 1 #include 2 3 /* Find the number of elements in an array */ 4 #define length(a) ( sizeof a / sizeof *a ) 5 6 static size_t hash ( const char key ) 7 { 8 return key - '0'; 9 } 10 11 static void jsw_flush ( void ); 12 static int get_digit ( char *ch ); 13 14 static void fill_table ( char table[] ); 15 static void show_table ( const char table[], size_t size ); 16 17 int main ( void ) 18 { 19 /* This is our hash table */ 20 char table[10] = {0}; 21 22 fill_table ( table ); 23 show_table ( table, length ( table ) ); 24 25 return 0; 26 } 27 28 /* 29 Discard remaining characters in stdin 30 up to the next newline or end-of-file 31 */ 32 void jsw_flush ( void ) 33 { 34 int ch; 35 36 do 37 ch = getchar(); 38 while ( ch != ' ' && ch != EOF ); 39 40 clearerr ( stdin ); 41 } 42 43 /* 44 Get a single character digit, '0'-'9'. 45 Return 1 on success, 0 on input error, 46 end-of-file, or invalid input 47 */ 48 int get_digit ( char *ch ) 49 { 50 int rc; 51 52 printf ( "Enter a single digit: " ); 53 fflush ( stdout ); 54 55 if ( rc = scanf ( "%1[0123456789]", ch ) == 1 ) { 56 /* At least a newline is left over; clear it all */ 57 jsw_flush(); 58 } 59 60 return rc; 61 } 62 63 /* Populate the hash table */ 64 void fill_table ( char table[] ) 65 { 66 char ch; 67 68 while ( get_digit ( &ch ) ) { 69 /* Create an index from the character */ 70 unsigned i = hash ( ch ); 71 72 if ( table[i] != 0 ) { 73 /* Duplicate indexes are called collisions */ 74 printf ( "That index has been filled " ); 75 } 76 else { 77 /* The element is empty; fill it in */ 78 table[i] = ch; 79 } 80 } 81 } 82 83 /* Display the contents of the hash table */ 84 void show_table ( const char table[], size_t size ) 85 { 86 size_t i; 87 88 for ( i = 0; i size. This may or may not be more efficient due to the cost of a division as opposed to the cost of a branch. In general you should refrain from worrying about such micro-optimizations and use whichever is clearer. Also notice how a test for the table being full is not made. This might be done by saving table->table[h], and comparing each successive probe against it, or maintaining a counter of how many iterations have been made and stopping when that counter hits the table size. Any technique for finding out if you're at an index that you've already visited will suffice. However, this test should only be present as a stopgap against exceptional failure. Open addressing schemes do not handle high load factors well and you should avoid filling the table completely. In fact, the table should half full at most for a generally safe open addressing load factor. Some open addressing techniques allow more, but half full is a good time to start worrying. The step size is almost always 1 with linear probing, but it is acceptable to use other step sizes as long as the step size is relatively prime to the table size so that every index is eventually visited. If this restriction isn't met, all of the indices may not be visited, and the search may fail or loop infinitely even if the key exists in the table. Insertion into a hash table with linear probing is simply a variation of searching, where the key is inserted into the first empty bucket. If the table does not allow duplicates, an early return should be made. This works because the probes will always end up going over a previously entered duplicate before hitting an empty slot. However, that does introduce a problem with deletion that we'll look at shortly: 1 struct hash_table { 2 void **table; 3 size_t size; 4 }; 5 6 int jsw_insert ( 7 struct hash_table *table, 8 void *key, size_t size, 9 int allow_duplicates ) 10 { 11 /* Get the initial index for the key */ 12 size_t h = hash ( key, size ) % table->size; 13 14 while ( table->table[h] != EMPTY ) { 15 if ( !allow_duplicates && compare ( key, table->table[h] ) == 0 ) 16 return 0; 17 18 if ( ++h == table->size ) 19 h = 0; 20 } 21 22 table[h] = key; 23 24 return 1; 25 } Removal in any open addressing strategy is where things get interesting. A direct removal is not possible because future probes could rely on the key being removed, and if an empty bucket is created where a full bucket once was, searches will incorrectly fail. All in all, the obvious way to remove from a hash table with open addressing will destroy the data structure and cause it to work improperly. With linear probing, it is possible to remove a key and then re-insert each of the keys in the same cluster, but that solution is somewhat complicated. A more general solution that works for all of the open addressing schemes is to mark a bucket as deleted in such a way so that searches will not fail. This is a rather hackish solution, but it works well in practice: 1 struct hash_table { 2 void **table; 3 size_t size; 4 }; 5 6 int jsw_remove ( 7 struct hash_table *table, 8 void *key, size_t size, 9 int allow_duplicates ) 10 { 11 int item_found = 0; 12 13 /* Get the initial index for the key */ 14 size_t h = hash ( key, size ) % table->size; 15 16 while ( table[h] != EMPTY ) { 17 if ( compare ( key, table->table[h] ) == 0 ) { 18 item_found = 1; 19 table[h] = DELETED; 20 21 if ( !allow_duplicates ) { 22 /* No need to keep searching the cluster */ 23 break; 24 } 25 } 26 27 if ( ++h == table->size ) 28 h = 0; 29 } 30 31 return item_found; 32 } The catch to this is that if duplicate items are allowed, you can't just delete the first occurrence and be satisfied. You have to keep going through the entire cluster and make sure that all occurrences are marked as deleted. Only the current cluster needs to be searched though, because once you hit an empty slot, it's guaranteed that no duplicates could be probed beyond it. Keep in mind that the insertion algorithm must also be modified to insert into deleted buckets rather than ignoring them for this technique to work properly. The change is quite easy. Note that because the removal algorithm takes care not to leave any duplicates, the insertion algorithm can safely assume that an EMPTY or DELETED slot marks a valid insertion point. Inserting in a DELETED slot won't leave an accidental duplicate further along the probe line. 1 struct hash_table { 2 void **table; 3 size_t size; 4 }; 5 6 int jsw_insert ( 7 struct hash_table *table, 8 void *key, size_t size, 9 int allow_duplicates ) 10 { 11 /* Get the initial index for the key */ 12 size_t h = hash ( key, size ) % table->size; 13 14 while ( table->table[h] != EMPTY && table->table[h] != DELETED ) { 15 if ( !allow_duplicates && compare ( key, table->table[h] ) == 0 ) 16 return 0; 17 18 if ( ++h == table->size ) 19 h = 0; 20 } 21 22 table[h] = key; 23 24 return 1; 25 } Occasionally these deleted items will begin to fill the table as they take the place of real keys. In that case, the deleted items must be removed and the entire table rebuilt from scratch. We will cover this operation shortly. Quadratic probing In an attempt to alleviate the problem of primary clustering, a non-constant step size can be used. In general, it has been found that a quadratically growing step size works well. Simply start with a step size of 1 and quadratically increase the step size until the desired index is found. This strategy is called quadratic probing, and the search algorithm is only slightly more complicated than linear probing, and insertion and deletion are equally simple changes. Just add a step and use it to increase the index: 1 struct hash_table { 2 void **table; 3 size_t size; 4 }; 5 6 int jsw_find ( 7 struct hash_table *table, 8 void *key, size_t size ) 9 { 10 size_t step; 11 12 /* Get the initial index for the key */ 13 size_t h = hash ( key, size ) % table->size; 14 15 for ( step = 1; table->table[h] != EMPTY; step++ ) { 16 if ( compare ( key, table->table[h] ) == 0 ) 17 return 1; 18 19 /* Move forward by quadratically, wrap if necessary */ 20 h = ( h + ( step * step - step ) / 2 ) % table->size; 21 } 22 23 return 0; 24 } Quadratic probing alleviates primary clustering because probes are no longer close together according to some small constant. Rather, the probe sequence covers buckets that are far apart and because the step size is not constant, primary clustering is no longer apparent. However, because the probe sequence is always the same from each bucket, the effect of secondary clustering can be seen. However, secondary clustering is not nearly as big of a problem as primary clustering, and the slowdown is hardly noticeable because quadratic probing only works if the load factor is less than .5 and the table size is prime. Both of these requirements go a long way toward improving clustering anyway. In general, quadratic probing is fast and avoids primary clustering, and it is a marked improvement over linear probing, but quadratic probing is restrictive in basic implementations and overcoming those restrictions can be tricky. For example, basic quadratic probing is only guaranteed to work if the table size is prime and the load factor is less than .5. Beyond that you are on your own. While this is a nice step forward, double hashing is generally a better method for open addressing because it avoids primary clustering without the annoying restrictions. Double hashing Double hashing uses two independent hash functions. The first hash function is used as usual, and the second hash function is used to create a step size. Because the key itself determines the step size, primary clustering is avoided. The algorithm is simple, but two rules must be adhered to for double hashing to work correctly: First, the second hash cannot ever return 0 or there will be an infinite loop. Second, much like with linear probing, the table size must either be prime, or a power of two with the second hash returning an odd number. Beyond that, the implementation is similar to the other open addressing methods discussed so far. So similar, in fact, that I'll show only the search algorithm. The insertion and removal algorithms are identical to linear probing with the step changes, just like quadratic probing: 1 struct hash_table { 2 void **table; 3 size_t size; 4 }; 5 6 int jsw_find ( 7 struct hash_table *table, 8 void *key, size_t size ) 9 { 10 /* Get the initial index for the key */ 11 size_t h = hash ( key, size ) % table->size; 12 13 /* Get the step size */ 14 size_t step = hash2 ( key ) % ( table->size - 1 ) + 1; 15 16 while ( table->table[h] != EMPTY ) { 17 if ( compare ( key, table->table[h] ) == 0 ) 18 return 1; 19 20 /* Move forward by quadratically, wrap if necessary */ 21 h = ( h + step ) % table->size; 22 } 23 24 return 0; 25 } Coalesced chaining Note: This is a modification of my Wikipedia article on Coalesced Hashing. Coalesced chaining is a lesser known strategy that forms a hybrid of separate chaining and open addressing. In a separately chained hash table, items that hash to the same index are placed on a list at that index. This can result in a great deal of wasted memory because the table itself must be large enough to maintain a load factor that performs well (typically twice the expected number of items), and extra memory must be used for all but the first item in a chain (unless list headers are used, in which case extra memory must be used for all items in a chain). Given a sequence of randomly generated three character long strings, the following table might be generated with a table of size 10: (null) "clq" "qur" (null) (null) "dim" "aty" --> "qsu" "rhv" "qrj" --> "ofu" --> "gcl" -- > "ecd (null) (null) This strategy is effective and very easy to implement. However, sometimes the extra memory use might be prohibitive, and the most common alternative using open addressing has uncomfortable disadvantages when it comes to primary clustering. Coalesced hashing uses a similar technique as separate chaining, but instead of allocating new nodes for the linked list, buckets in the table are used, just like open addressing. The first empty bucket in the table is considered a collision bucket. When a collision occurs anywhere in the table, the key is placed in the collision bucket and a link is made between the colliding index and the collision bucket. Then the next empty bucket is searched for to give the next collision bucket. Because the collision bucket moves in a predictable pattern unrelated to how the keys are hashed, the effect of primary clustering is avoided, and search times remain efficient. The lists are interleaved around one another, so they are said to coalesce, hence the name, coalesced chaining. The code to insert into a coalesced hash table is surprisingly short for how convoluted the concept seems at first. 1 struct jsw_node { 2 void *key; 3 struct jsw_node *next; 4 }; 5 6 struct hash_table { 7 struct jsw_node **table; 8 size_t size; 9 size_t cursor; 10 }; 11 12 int jsw_insert ( 13 struct hash_table *table, 14 void *key, size_t size ) 15 { 16 /* Get an index for the key */ 17 size_t h = hash ( key, size ) % table->size; 18 19 if ( table->table[h] == EMPTY ) { 20 /* The slot is empty, no more work needed */ 21 table->table[h] = key; 22 } 23 else { 24 struct jsw_node *it; 25 26 /* Find the next open slot */ 27 while ( table->cursor size 28 && table->table[table->cursor] != EMPTY ) { 29 30 ++table->cursor; 31 } 32 33 if ( table->cursor == table->size ) 34 return 0; 35 36 table->table[table->cursor] = key; 37 38 /* Find the end of the coalesced chain */ 39 for ( it = table->table[h]; it->next != NULL; it = it->next ) 40 ; 41 42 it->next = table->table[table->cursor]; 43 } 44 45 return 1; 46 } Search with coalesced chaining is identical to separate chaining, but removal is not as simple. In fact, because coalesced chaining is technically an open addressing strategy, a sentinel marking the bucket as deleted must be used, just as with the other open addressing schemes. Rehashing Occasionally, because the operation is very slow, a table needs to be resized, or deleted sentinel items need to be removed to make way for real keys. Unfortunately, the only way to do this is to take every key out of the table and rebuild it completely from scratch. Naturally, this process is very slow compared to other operations, so it should be performed very rarely. An example of rehashing can be found in the jsw_hlib source code. Performance The performance properties of hash tables are well known. In my quest to avoid complex mathematical calcuations in my tutorials, and thus cater to the other 99% of programmers, I will summarize. A hash table has O(1) performance in the best case. The majority of the work in implementing a hash table is ensuring that this best case is also the average case by choosing a good hash function that minimizes collisions. However, it is more informative to say that the average expected performance of a hash table is somewhere between O(1) and O(log N) with a good hash function, and there is a strong bias toward O(1). The worst case for a hash table is O(N), and there is no way to avoid the worst case using a basic hash table. If guaranteed good performance is required then a more suitable structure should be used, such as a deterministic skip list or balanced binary search tree. The biggest factor in the performance of a hash table, if the hash function is good, is the load factor of the table. The load factor is defined as the result of dividing the number of used buckets by the size of the table, or M/N where M is the number of used buckets and N is the table size. All collision resolution strategies perform equally well when the load factor is .5, which is the optimum load factor most of the time, and each begins to degrade as the load factor increases. The general recommendation is to maintain a .5 load factor when possible, and never to go above a .7 load factor. However, if memory is scarce, separately chained hash tables degrade more slowly and more gracefully than open addressing tables, and you can have higher load factors. Open addressing schemes are limited by this guideline because of the pronounced slowdown with larger load factors, and the implementations of open addressing typically assume that the table is largely empty. However, separately chained schemes are not bound as tightly to the guideline and a table size of the expected number of keys will not perform as badly as it would with an open addressing strategy. The search times will simply be a bit slower due to longer chains. For the speed conscious, more empty buckets means shorter chains, and to ensure good performance, the a larger table is desirable. Comparison Of these five collision resolution strategies, which one is best? Much like everything in computer science, it depends. Typically the choice is between separate chaining and double hashing because those two support the most flexibility and the best performance properties. Linear probing breaks down to easily because of primary clustering, quadratic probing is too restrictive, and coalesced chaining is somewhat of the worst of both worlds where the complexity of chains are coupled with the disadvantages of open addressing. In the end a choice should be made on a case by case basis. Disadvantages Hash tables are not a silver bullet. If the hash function is poor then there will be excessive collisions and performance will quickly approach O(N). On top of this, some hash functions are slower than others, and a slow hash function could increase the constant for O(1) due to expensive operations and the hash table could be slower than other data structures that are theoretically slower. Of the collision resolution strategies discussed, only separate chaining can easily handle deletion of a key. The open addressing schemes do not lend themselves well to deleting a key because other keys may rely on it as part of a search probe sequence. Separate chaining is also the only collision resolution strategy that can easily handle duplicate keys. A few moments thought will show this to be true because to search for duplicates in a chain, only the entire chain must be searched, whereas with duplicates in an open addressed table, the entire cluster must be searched for linear probing, and the entire table for non-linear probing! Hash tables are based on arrays, which are notoriously difficult to resize once they have been allocated. As such, a hash table is said to have a constant size unless you want to implement the table as a dynamic array and rehash all of the keys whenever the size changes. This is an incredibly expensive operation if the table is even remotely large and should be avoided as much as possible by increasing the size of the table by a fairly large amount (such as half again the current table size, or N * 1.5). That way the reorganizations are minimal for real-time applications and can be amortized into other operations for other applications. The keys in a table need to be rehashed when the size is changed because each insertion and subsequent search depends on the table size to give the correct result. If the table size changes, the indices for all of the previously hashed keys are invalidated because future searches will no longer work correctly. Because a hash table is an unordered data structure, certain operations are difficult and expensive. Range queries, selection, and sorted traversals are possible only if the keys are copied into a sorted data structure. There are hash table implementations that keep the keys in order, but they are far from efficient. A solution to this problem is a hybrid data structure called a hashlist. The idea is to use a sorted data structure such as a binary search tree or sorted linked list as the primary storage unit. Then the keys in the primary structure are hashed and references to them are stored in the hash table. This has excellent performance properties, but in practice the hashlist can be more work than it is worth. The keys in a hash table are ideally spread in a uniform, pseudorandom pattern throughout the table. This is a good thing in theory, but in practice, processors will create a cache of frequently accessed memory under the assumption that adjacent memory is more likely to be accessed next than non-adjacent memory. As such, unless a fix like the hashlist is used, searching in a hash table may result in excessive cache misses for medium to large tables because the jumping around performed by a hash table does not fit the usage pattern for efficient caching. For smaller tables that fit completely in a cache line, any performance problems are unlikely to be noticeable. Conclusion Hash tables are a powerful and flexible data structure for unordered searching and storage. There are subtle issues when resolving collisions, but all in all the hash table is easy to implement and provides near constant performance in the averages case. OR Using quadratic probing for collision resolution #define ERROR_TABLE_FULL -1 #define ERROR_RECORD_NOT_FOUND -1 #define ERROR_DUPLICATE_RECORD -1 class HashTable{ public: HashTable(int table_size); int insert(const string &record); int retrieve(const string &record); private: int hash(const string &record); int hash_size; int record_count; string* table; }; /** * Constructor accepting table_size */ HashTable::HashTable(int table_size){ hash_size = table_size; table = new string[hash_size]; record_count = 0; } /** * Hash function calculated using * product p of all the characters of the key * and then returns the index where index=s%hash_size */ int HashTable::hash(const string &key) { int value = 1; for (int position = 0; position hash_mid){ return ERROR_RECORD_NOT_FOUND; } string tmp = table[hash_key]; //Empty slot if(tmp.empty()){ break; }//Position already taken, duplicate keys are not allowed else if(tmp.compare(record) == 0) { return ERROR_DUPLICATE_RECORD; } else{ //Handles collision using h+i^2 hash_key = (hash_key+increment) % hash_size; increment+=2; probe_counter++; } } table[hash_key] = record; record_count++; return hash_key; } /** * Retrieve record index */ int HashTable::retrieve(const string &record){ int hash_key = hash(record); int increment = 1; int probe_counter=0; int hash_mid=(hash_size+1)/2; while(1){ //Overflow encountered if the probe is bigger than hash_size/2 if(probe_counter > hash_mid){ return ERROR_RECORD_NOT_FOUND; } string tmp = table[hash_key]; //Record empty for the key if(tmp.empty()) { break; } else if(tmp.compare(record) == 0){ return hash_key; } else{ hash_key = (hash_key+increment) % hash_size; increment+=2; probe_counter++; } } return ERROR_RECORD_NOT_FOUND; }

Navigate

The quadratic probing strategy has a clustering problem related to the way it lo

The quadratic regression equation shown below is for a sample of n=21. Complete

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

The quadratic probing strategy has a clustering problem related to the way it lo

Question

Explanation / Answer

Related Questions

Navigate