2) Data Structures Lesson

Hash Functions

8 min to complete · By Jon Fincher

As you've been learning, there are many ways to impose order on your data. However, all the data has been of different sizes -- strings of all different lengths and numbers of all different sizes. As you learned when studying arrays, if every item is the same size, finding a specific item can be done using simple math. Following links can be time-consuming -- it would be much easier if all your data were the same size so it could be stored and retrieved quickly. Is there a way to make all your data the same size?

It's time to explore how this can be done, which also opens the door to other applications as well.

What is a Hash Function

In computer science, a hash or hash function is a function that accepts arbitrary length data and returns a fixed length result. The result returned is dependent on the input, and the function will always return the same result given the same input. In many cases, the result returned bears little resemblance to the input that generated it. Like hash you may make for dinner, the hash function chops the input into small pieces and recombines them into something new.

It's time to look at an example. Invented in the early 20th century, Soundex is an algorithm used to index names by sound. It was originally developed to help match names regardless of how they were written. It works by turning each name into code consisting of a single letter followed by three numbers. Here's how it works:

  1. First, retain the first letter of the name.
  2. Next, ignore all other occurrences of the following letters: a, e, i, o, u, y, h, w
  3. Replace all other consonants with digits as follows: a. b, f, p, v --> 1 b. c, g, j, k, q, s, x, z --> 2 c. d, t --> 3 d. l --> 4 e. m, n --> 5 f. r --> 6
  4. If two or more letters with the same number are adjacent, only keep the first one.
  5. If there are less than three numbers, add zeroes to pad it out to three.
  6. If there are more than three numbers, keep the first three.

For example, using this algorithm, the name "Michael" would be encoded as "M240", "Frank" as "F165", and both "Robert" and "Rupert" result in "R163". Because this algorithm maps names of arbitrary length to a four-character long code, it can be called a hash function.

Why use a Hash Function

Computing a hash function is much faster than searching through a data set looking for a specific data item. Hash functions can also be used to quickly verify that the data being input matches what is expected by hashing each and comparing the results. Since the hash values are fixed in size, a comparison is quicker and easier to implement.

Use Cases for a Hash Function

Hash functions are often used to help associate data with a key which can be used to retrieve it. This type of association is called hash table, also called associative arrays, maps, or dictionaries. If the hash function returns an integer, that can be used as an index into an array to find the associated data.

Hash functions are also used extensively in cryptography. Well-known algorithms such as MD5, SHA1, and SHA256 are all hash functions specialized for use in cryptography.

Benefits of a Hash Function

As mentioned above, hash functions are most frequently used in constructing hash tables, where the speed of computing the hash is much faster than storing data in other structures. Hash functions such as MD5 are also used to verify data integrity. By comparing the data received against a known MD5 value, you can quickly ensure the data hasn't been altered.

One major issue that needs to be addressed when using a hash function is a collision. This occurs when two different inputs to the hash function generate the same result. Recall the example above where the names "Robert" and "Rupert" both generated the same Soundex code "R163". If a unique code for each name is required, then this situation needs to be handled. You'll see how to do this practically in the next section.

Terminology

  • Hash Function: Any function that accepts variable length data and returns a fixed length result based on the input data.
  • Hash Table: A structure that uses hash values to index data values in a table format.
  • Hash Map: A structure that stores key-value pairs in an underlying hash table. Also known as dictionaries and associative arrays.
  • Collision: Occurs when a hash function returns the same result for two distinct inputs.

Summary: Hash Functions

Hash functions work by turning variable-length data into uniformly sized data. They are used in hash tables and associative arrays to efficiently store and retrieve data and in cryptography to encrypt, decrypt, and validate data.