Suppose you are hired by RPI for the daunting task of improving the student records system. The administration wants a system that is blazingly fast in storing and retrieving student information. You know that every student can be identified by a unique Social Security Number. What is the best way to store student data so that it can be saved and accessed quickly?
One possibility we have studied is binary search trees. We saw that
searching for an element in such a tree, if it is ``balanced'' (all
leaves are about the same distance from the root node), is
.
This performance is not bad, but we can do better. Besides, we have
not addressed the problems of making sure that the tree remains
balanced when adding and deleting items, and the additional work
involved.
We could also use an array. A student with ID number 123 would have all of his or her information stored in array element 123. Just by looking at a student's ID number, we can immediately go to the correct position in the data structure to save or retrieve the student's information. Do you see a problem with this method? Student ID numbers are 9 digits long, which means your array will have to have one billion elements! (So if a single student record takes up only 1 kilobyte of memory, this array will require 1 gigabyte!) Besides, there are probably no more than 12,000 active student records, so 99.99 percent of the array would be wasted space.
A hash table can provide an excellent compromise between the speedy performance of an array and the prohibitive memory requirements of making an array big enough to hold all possible data values. It still requires more storage space than would be necessary to hold your data, but it can give significantly faster performance than other ADTs if you are willing to commit enough memory to it.
A hash table is usually implemented using an array. We call each element a bucket. For the best performance, the array should be larger than the total number of items to be stored in it. Usually a hash table should be 2 to 3 times the size needed to store all items.
For our student ID example, let's say that we want to choose a table size of 24,000. How do we choose which bucket will hold a given student's information?
In order to implement a hash table, we need a hash function. A hash function takes a key as input, and computes an integer which represents a particular bucket in the table. The function returns this bucket number. The key is often a part of the item being stored: typically it is one or more data elements of the class, and it is usually guaranteed to be unique (i.e., duplicate keys are not allowed). Clearly the nature of the hash function is dependent on the type of the key.
In our example, the nine-digit Student ID number is an excellent choice for the key. It is guaranteed to be unique, unlike student names or birthdays. How can we turn a nine-digit number into one in the range 0 .. 23,999? The simplest way is to use the remainder function. So one way to write the hash function for our example is this:
const int TABLESIZE = 24000; // defined earlier in the program
int HashFunction(int id) {
return id % TABLESIZE;
}
A good hash function has the following three properties:
The first property is necessary for correct storage and retrieval (the same item must hash to the same bucket every time), and the remaining properties are necessary for fast performance.
There has been a considerable amount of research on hash functions,
but for this course we will only consider one simple but fairly
effective one. Convert the key to a large integer, and then
compute this value modulo
, where
is the number of buckets.
This function is guaranteed to produce an integer in the range
.
If your key field is a character string, here is a good method to convert
it into a large integer: Multiply the position number of
each character in the string times the ASCII value of the character and
accumulate the sum.
For example, if the string is cat, you would get:
where 99 is the ASCII value of c, 97 is the ASCII value of
a and 116 is the ASCII value of t.
Collision Resolution
It is very likely that, no matter how cleverly you construct your hash function, you can find two different keys that will be hashed to the same bucket number. When this happens, it is referred to as a collision. Therefore, besides developing a hash function that will evenly distribute the keys in the hash table, we must also be concerned with how to resolve these collisions when they occur.
A simple solution, and the only method we will demonstrate in this worksheet, is to make your hash table an array of lists of items instead of a simple array of items. In this way, a bucket can hold more than one entry. When searching, after the bucket number is computed via the hash function, you must search that bucket's list for the target item. However, if the list is short, searching is not time consuming. (And if it is not usually short, you have not designed your hash table properly; you need to have either more buckets or a more evenly-distributing hash function.)
Research has shown that you get a more even distribution of hash values, and thus fewer collisions, if you choose your table size to be a prime number. For this course, you don't have to prove that your table size is prime, but you should be able to quickly pick a size that is not a multiple of 2, 3, or 5.
Here is the code for a simple hash table which stores information on students with the key being the student's social security number. Note that it uses the STL list class.
#include <iostream>
#include <list>
using namespace std;
#include "Student.h"
#define TABLESIZE 997
class HashTable {
private:
list<Student> the_data[TABLESIZE]; // an array of 997 lists of students
int HashFunction(int key) {
return key % TABLESIZE;
}
public:
HashTable(){}
void Insert(Student s) {
int h = HashFunction(s.getssnum()); // get the bucket number
the_data[h].push_front(s); // add student into list of that bucket
}
// the find function returns true if a student with ssnum equal
// to key is found, and a copy of the data is put in s.
// If no student with an ssnum equal to key is found, the
// function returns false and the value of s is indeterminate
bool find(Student &s, int key) {
int h = HashFunction(key);
list<Student>::iterator i = the_data[h].begin();
while (i != the_data[h].end()) {
if (i->getssnum() == key) {
s = *i;
return true;
}
else i++;
}
return false;
}
};
Here is a simple example on a toy hash table with only seven buckets.
Initially the list is empty. To insert Mickey Mouse (ss num 101010101),
we first compute the hash value to be 3 (
), so we
insert Mickey into Bucket 3. We then insert Donald Duck (ss num
123456789) into bucket 1 (
), and then insert Snow
White (ss num 987654321) into bucket 3 (
). The list
is shown on the next page.
Exercise 1: Show what the hash table on the next page would look like after inserting the following additional students:
Goofy, ssnum 676767676 (this number mod 7 is 4)
Minnie Mouse, ssnum 333333333 (this number mod 7 is 4)
Elvis Presley, ssnum 654365432 (this number mod 7 is 0)
OJ Simpson, ssnum 333333340 (this number mod 7 is 4)
Exercise 2: Implement a hash table using the STL list class and the code above. Make TABLESIZE seven. Create a simple Student class with Name and SSN data members, and write a short main to duplicate the example above. Add a function void printbucket(int b) to your hash table so you can see if your hash table stored everything in the correct places as you observed them in Exercise 1.
Analysis of Hash Tables
The performance of the find operation for a hash table is a function of
the load factor, often denoted by the symbol
(lambda). The
load factor is the number of items in the table divided by the size
of the table (number of buckets).
The hash function should always be
, i.e., constant time. The linear
search operation in an STL list is
where
is the length
of the list. The average length of the lists in a hash table is
the total number of items in the table divided by the number of lists.
Since the number of lists is the table size, the average length of
the lists is the load factor. For example, if there are 97 buckets in the table
and there are 97 items in the table, the load factor is 1, and that is
average length of the lists. Of course some lists will have two
or three or more items, and others will have none, but on average
a list will have one element.
If an item is guaranteed to be in the list, then average time that it
takes to locate an item in the list is the list length divided by 2.
Thus, the time that it takes, on average, to find an item in a table is
It is clear that with a sparsely filled table (say, a
less than
), the performance of a hash table is excellent, far better
than any of the other data structures that we have seen. Since
having a small
means having a large table size relative to the
number of elements actually stored, this means we are trading off
memory usage for speed.
Exercise 3: What would the average load factor be for each of the following hash tables:
| NUMBER | |
| TABLESIZE | OF ITEMS |
| 97 | 97 |
| 97 | 50 |
| 97 | 970 |
| 997 | 97 |
| 997 | 970 |
Notice that, with a poor hash function/table size combination, it is
possible that some buckets could become extremely overused with
respect to the others. When this occurs, the worst-case performance
of the hash table find function can become
.
Exercise 4: Play around with the hash table you have implemented. Add some additional students to main, and experiment with changing the table size to different values. Try rewriting your hash function as well. Observe how your changes affect the distribution of students in the different buckets.
Open Addressing Hash Tables
You may find yourself in a situation where you may not wish or be able to use dynamic memory allocation (as is done when using lists) for your hash table. In this case, you can create a hash table as an array of the type of item you are storing instead of an array of lists.
Handling collisions must now be done differently, since each bucket can hold only one item. When two different items hash to the same bucket number, you must find some other empty bucket in which to store the second item. The method of finding that space is called open addressing.
To resolve the bucket conflict, we need a secondary hash function which computes a new hash value based on the old hash value -- not on the original key. This secondary hashing function may need to be called repeatedly. A simple secondary hashing function is simply to add 1 to the old hash value, rolling the number back to zero if it reaches the value of the table size.
The find algorithm for open addressing is as follows:
If an operation for deleting an item from a hash table is added, searching and inserting into an open addressing hash table becomes more complicated. In contrast, if the hash table is implemented as an array of lists, these operations are unaffected by adding a delete function. This is another reason why, in most instances, a list-based hash table is the preferred implementation.
Note: While it would be nice to create a template class for a hash table, it is more involved than other data types we have seen so far. One complication is that, as was mentioned earlier, the hash function must be tailored to the specific data being stored in the table. It is not impossible to do, but we will not cover it here.