You begin your exploration by implementing a hash table in Java, then expand that to add hash map capabilities.
Hash Tables
As mentioned previously, hash tables apply a hash function to the data being stored to generate the index. This means you need to pick a good hash function, determine where the index and data will be stored, and figure out how to handle collisions. You start at the beginning with a hash function.
Picking a Good Hash Function
Tip: Both Java and Python have a hash() or hashCode() method build in that can be used to get the hash code of any object. That said, building your understanding how a hash works, what a hash code is, and how useful they are for things like HashMaps is fantastic practice.
Not every hash function is good for every application. Picking a good function requires some knowledge of the data being stored, and analysis of hash functions is a research topic in itself. Most hash functions are "good enough" as long as they perform well. For this example, you will be storing single words and will implement a hash function that converts them into numbers. It's time to take a look at a potential function which will do this:
public int hash(String data){
int SIZE = 501;
int seed = 131;
long hashValue = 0;
for (char ch: data.toCharArray()){
hashValue = ((hashValue * seed) + (int) ch) % SIZE;
}
return (int)(hashValue % SIZE);
}
You can see that the function loops through all the letters of the data being provided and multiplies each ASCII value by a seed number. It then adds all these numbers together and uses the modulo operator (%) to ensure the value stays in a set range.
The values of SIZE and seed are somewhat carefully chosen -- seed is a prime number, while SIZE is not. However, what is more important is that neither is divisible by the other. This means that multiplying by one and dividing by the other will result in fewer chances of hash collisions.
Now that you have a hash function that can turn any given String value into a number between 0 and 500 inclusive, you can figure out the next step -- where to store everything.
Storing Hashes and Data
Most hash tables use an array to store data. The hash value is used as the key index into the array:
public void add(String data){
int index = this.hash(data);
this.hashTable[index] = data;
}
To add new data to the hash table, you generate a new hash value and store the data in that location. This assumes you have already declared and defined this.hashTable[] as an array of SIZE strings.
At this point, you may be asking yourself: But what about hash collisions? Won't this result in lost data if two words generate the same hash?
Yes, it will. This is why handling collisions is the last important piece of the puzzle.
Handling Hash Collisions
How do you store two pieces of data that generate the same hash value? They cannot be stored in the same location in the hash table, so you have to find a new place for the second and subsequent collisions. There are three main techniques used, although only one is useful in a practical sense.
The first is to rehash the data. This would either use a different seed value or a different algorithm to generate a new hash value. However, there is no guarantee that this new value will not also collide with some other piece of data, requiring yet another rehash.
The second technique is even less appealing. If a hash value collides with another, you increment it until you find an empty space to store the data. This avoids performing multiple calculations at the cost of a linear search for an empty slot. This also makes finding data more difficult, as the initial hash value is now just a starting point for a linear search.
Neither of these techniques works well if the amount of data being stored fills or exceeds the size of the table, which makes them less than ideal for large data sets.
The last and best technique for managing hash table collisions is the chain or list technique. Instead of storing each data value directly, each entry in the hashTable[] array is the head of a list of data values. If no collision occurs, then a new list is created at the specific index. When a collision occurs, the new data value is added to the existing list. Finding data is done by computing the hash value and then walking the specific list to find the data.
Here is what add() looks like with this modification:
public void add(String data){
int index = this.hash(data);
if (this.hashTable[index] == null){
this.hashTable[index] = new ArrayList<String>();
}
this.hashTable[index].add(data);
}
In this case, hashTable[] is declared as List<String> hashTable[]. The data value is added to the ArrayList at the computed index, which is created if one doesn't already exist.
Finding data is equally straightforward:
public boolean find(String data){
int index = this.hash(data);
if (this.hashTable[index] != null) {
for (String item : this.hashTable[index]) {
if (item.compareTo(data) == 0) return true;
}
}
return false;
}
Depending on the amount of data being stored and the relative size of the underlying array, you could also use a BST instead of a List.
Hash Maps
As mentioned earlier, a hash map is similar to a hash table, so most of what you have already learned will apply directly to hash maps.
The primary difference between the two is that two pieces of data are stored in a hash map:
- A key value used to index and find the data being stored.
- The data itself.
In a hash table, the data is hashed to provide the index into the underlying hash table. In a hash map, the key is usually hashed, although a combination of key and value can be hashed together to reduce hash collisions and improve performance.
Storing both the key and data value as a unit requires an object:
class StorageUnit{
K key;
V dataValue;
StorageUnit(K key, V dataValue){
this.key = key;
this.dataValue = dataValue;
}
}
Here, you create an internal class called StorageUnit to keep both the key and data value together. The types K and V are both declared in the outer HashMap class:
public class HashMap<K extends Comparable, V>
This allows you to use any key and any data value in your hash maps. You will see why extending Comparable is required for the key data type shortly.
As before, a hash table is defined as List<StorageUnit> hashTable[SIZE], with SIZE being a constant to provide a large enough store to prevent hash collisions.
Because our key values extend the Object class, you can simplify your hash() method considerably:
private int hash(K key){
return key.hashCode() % SIZE;
}
Using the built-in .hashCode() method provides a great alternative to rolling your own hash function.
Adding data to a hash map is similar to a hash table with one change:
public void add(K key, V dataValue){
int index = hash(key);
// Is this a new key?
if (this.hashTable[index] == null) {
this.hashTable[index] = new ArrayList<StorageUnit>();
this.hashTable[index].add(new StorageUnit(key, dataValue));
} else {
// Check if you already have this key stored
for (StorageUnit item: this.hashTable[index]){
if (item.key.equals(key)) {
// Change the data and return
item.dataValue = dataValue;
return;
}
}
// Not a duplicate, so store this new key and data pair
this.hashTable[index].add(new StorageUnit(key, dataValue));
}
}
Hash maps require unique keys. Therefore, whenever you detect a hash collision, you need to check if this is because the key is the same. This is why the key data type extends Comparable If so, you replace the data with the new value - otherwise, you can add the new key-value pair.
If you'd like to continue learning about HashMaps in Java, check out the following resources in the Java 301 course:
Summary: Java HashMap
Hash tables and hash maps are very similar but differ in key ways.
Hash tables use the hash of the data being stored as an index into an array or other storage structure. If two different pieces of data result in the same hash value, a collision occurs, which needs to be resolved. The best way to resolve most hash table collisions is storing the data in a List or other structure at the hash index -- this reduces the amount of work needed to find each piece of data.
Hash maps build on the hash table concept by introducing the concept of a key-value pairing. The key is used to reference the data being stored. Hash maps will hash the key, or the key-value pair, to provide an index into an underlying hash table. Since hash maps do not allow duplicate keys, adding data to a hash map requires an extra step of verifying the key during hash collisions and replacing the data for duplicate keys.