2) Data Structures Lesson

Python Dictionaries

13 min to complete · By Jon Fincher

Dictionaries in Python

You begin your exploration by implementing a hash table in Python, then expand that to add dictionary capabilities.

Hash Tables

As mentioned previously, hash tables apply a hash function to the data being stored to generate the index. This means you need to pick a good hash function, determine where the index and data will be stored, and figure out how to handle collisions. You start at the beginning with a hash function.

Picking a Good Hash Function

Not every hash function is good for every application. Picking a good function requires some knowledge of the data being stored, and analysis of hash functions is a research topic in itself. Most hash functions are "good enough", as long as they perform well. For this example, you will be storing single words as string values, and will implement a hash function which converts them into numbers. Here is a potential hash function which will do this:

def hash(data):
    __SIZE = 501
    seed = 131
    hash_value = 0
    for ch in data:
        hash_value = ((hash_value*seed) + ord(ch)) % self.__SIZE
    return hash_value

You can see that the function loops through all the letters of the data being provided, gets the ASCII value with the ord() function, and multiplies that by a seed number. It then adds all these numbers together, and uses the modulo operator (%) to ensure the value stays in a set range.

The value of __SIZE and seed are somewhat carefully chosen -- you may notice that seed is a prime number, while __SIZE is not. However, what is more important is that neither is divisible by the other, a property called relatively prime or coprime. This means that multiplying by one and dividing by the other will result in less chances of hash collisions.

Now that you have a hash function which can turn any given String value into a number between 0 and 500 inclusive, you can figure out the next step -- where to store everything.

Storing Hashes and Data

Most hash tables use a list to store data. The hash value is used as the key index into the list:

def add(self, data):
    index = self.hash(data)
    self.hash_table[index] = data

To add new data to the hash table, you generate a new hash value and store the data in that location. This assume you have already declared and defined self.hash_table[] as an array of __SIZE strings.

At this point, you may be asking yourself: But what about hash collisions? Won't this result in lost data if two words generate the same hash?

Yes, it will. Which is why handling collisions is the last important piece of the puzzle.

Handling Hash Collisions

How do you store two pieces of data which generate the same hash value? They cannot be stored in the same location in the hash table, so you have to find a new place for the second and subsequent collisions. There are three main techniques used, although only one is useful in a practical sense.

The first is to rehash the data. This would either use a different seed value, or a different algorithm, to generate a new hash value. However, there is no guarantee that this new value will not also collide with some other piece of data, requiring yet another rehash.

The second technique is even less appealing. If a hash value collides with another, you can increment it until you find an empty space to store the data. This avoids performing multiple calculations, at the cost of a linear search for an empty slot. This also makes finding data more difficult, as the initial hash value is now just a starting point for a linear search.

Neither of these techniques works well if the amount of data being stored fills or exceeds the size of the table, which make them less than ideal for large data sets.

The last, and best, technique for managing hash table collisions is the chain or list technique. Instead of storing each data value directly, each entry in the hash_table[] list is the head of a list of data values. If no collision occurs, then a new list is created at the specific index. When a collision occurs, the new data value is added to the existing list. Finding data is done by computing the hash value, then walking the specific list to find the data.

Here is what add() looks like with this modification:

def add(self, data):
    index = self.hash(data)
    if not self.hash_table[index]:
        self.hash_table[index] = []
    self.hash_table[index].add(data)

In this case, the data value is added to the list at the computed index, which is created if one doesn't already exist.

Finding data is equally straightforward:

def find(self, data) -> bool:
    index = self.hash(data)
    if self.hash_table[index]:
        return data in self.hash_table[index]
    return False

Depending on the amount of data being stored and the relative size of the underlying array, each entry into self.hash_table could also be a BST instead of a list.

Dictionaries

As mentioned earlier, a dictionary is similar to a hash table, so most of what you have already learned will apply directly to dictionaries.

The primary difference between the two is that two pieces of data are stored in a dictionary:

  1. A key value, used to index and find the data being stored.
  2. The data itself.

In a hash table, the data is hashed to provide the index into the underlying hash table. In a dictionary, the key is usually hashed, although a combination of key and value can be hashed together to reduce hash collisions and improve performance.

Storing both the key and data value as a unit requires an object:

class StorageUnit:
    def __init__(self, key, data):
        self.key = key
        self.data = data

Here, you create an internal class called StorageUnit to keep both the key and data value together. This is used by the outer Dictionary class:

class Dictionary:
    def __init__(self):
        self.__SIZE = 1001
        self.hash_table = [None]*self.__SIZE

As before, you define a hash table as a list of __SIZE items, where __SIZE is large enough to prevent many hash collisions.

Here, you make use of the built in Python function hash(), which simplifies your hash() method considerably:

def hash(self, key) -> int:
    return hash(key) % self.__SIZE

Using the built-in hash() function provides a great alternative to rolling your own hash function, and will work with any built in Python types, as well as Python objects.

Adding data to a dictionry is similar to a hash table with one small change:

def add(self, key, data):
    index = hash(key)
    if not self.hash_table[index]:
        self.hash_table[index] = []
        self.hash_table[index].add(StorageUnit(key, data))
    else:
        for item in self.hash_table[index]:
            if item.key == key:
                item.data = data
                return
        this.hash_table[index].add(StorageUnit(key, data))

Dictionaries require unique keys. Therefore, whenever you detect a hash collision, you need to check if this is because the key is the same. If so, you replace the data with the new value -- otherwise, you can add the new key-value pair as you would with any collision.

Colorful illustration of a light bulb

If you'd like to continue learning about Dictionaries in Python, check out the following resources in the Python 301 course:

Summary: Dictionaries in Python

Hash tables and dictionaries are very similar, but differ in key ways.

Hash tables use the hash of the data being stored as an index into an array, or other storage structure. If two different pieces of data result in the same hash value, a collision occurs which needs to be resolved. The best way to resolve most hash table collisions is storing the data in a list or other structure at the hash index -- this reduces the amount of work needed to find each piece of data.

Dictionaries build on the hash table concept by introducing the concept of a key-value pairs. The key is used to reference the data being stored. Dictionaries hash the key, or the key-value pair, to provide an index into an underlying hash table. Since dictionaries do not allow duplicate keys, adding data to a dictionaries requires an extra step of verifying the key during hash collisions, and replacing the data for duplicate keys.