2) Data Structures Lesson

Tries in Python

22 min to complete · By Jon Fincher

The best way to learn about tries is to try them, no pun intended, in code. You'll see how to add, remove, and find data in a trie using Python.

If you have not already read the previous page on the Trie Data Structure, please do so first and then return here.

Nodes in Python Tries

Like any other dynamic data structure, you need to define the nodes in which you will store your data. Here's what a Node might look like in Python:

class Node:
    def __init__(self):
        """
        Defines a Node for use in a Trie
        """
        self.children = {}
        self.is_end = False

So you may have some questions, like:

  • Where is the data stored?
  • Why is children a dictionary instead of just links?
  • What does is_end mean?

All will be revealed as you delve deeper into the code.

Trie Data Structure

Let's continue with the Trie structure itself:

class Trie:

    def __init__(self, data=None):
        self.root = Node()
        if data:
            self.add(data)

You should see some similarities between the Trie and the BST. The Trie has a root member variable, and the constructor creates the root node. From here, however, things get more interesting.

For example, how do you add words to this thing?

Inserting Data into Trie

Let's look at code that will add words to the trie, which you can analyze to understand how this structure works:

def add(self, data):
    current_node = self.root

    for ch in data:
        if ch not in current_node.children.keys():
            current_node.children[ch] = Node()
        current_node = current_node.children[ch]

    current_node.is_end = True

Like all dynamic data structures, you need a variable to indicate which Node you care about, so you define current_node to start at the root. However, things diverge from BSTs after that.

Remember that tries are also called prefix trees -- each node stores the prefix of a piece of data. The complete data item is stored in a sequence of Node objects, beginning at the root, which stores the initial character. The next character is stored in one of the children of root, and so on, until the final character is stored.

To do this, you loop through every character in data. This is then used as the key in a dictionary to identify which child to use next. If that child doesn't exist, you create it by adding it to the dictionary. Then you set current_node to that child and continue the loop. Once you have run out of characters in data, you set is_end to True to indicate this is a complete word.

Of course, an example is in order. Assume you have the following code:

trie = Trie()
trie.add("can")

To add the word can, current_node is set to root, and the loop starts.

  • The first character is c, which doesn't exist as a key in the current_node.children dictionary.
    • A new child Node is created as the data value associated with the key c.
  • current_node is now changed to refer to the node associated with c.

Your trie now looks like this:

#TBD: Image of trie after adding c

The next iteration of the loop adds the character a, which doesn't exist in the current_node.children dictionary, so it gets added, and current_node is changed again.

Now your trie looks like this:

#TBD: Image of trie after adding a

The final iteration of the loop adds the character n, which again doesn't exist in the current_node.children dictionary. After it is added, current_node is changed again, and the loop ends. The last thing you do is set current_node.is_end to True to indicate you have added a complete word, and your trie looks like this:

#TBD: Image of trie after adding n

You can see that the data you added isn't stored in a single Node structure but in a sequence of Nodes starting at the root and continuing to the end of the trie.

So, what is the deal with .is_end? Isn't it obvious that the last Node in the sequence is the end of the word? Actually, it's not -- let's say you now want to add the word cane to your trie:

trie.add("cane")

The code will follow almost the same steps after initializing current_node and starting the loop:

  • The first character c is found as a key in current_node.children, so no new Node is created, and current_node is changed to follow that path.
  • The next character a is found as a key in current_node.children, so no new Node is created, and current_node is changed to follow that path.
  • The next character n is found as a key in current_node.children, so no new Node is created, and current_node is changed to follow that path.
  • The last character e is not found as a key in current_node.children, so a new Node is created, and current_node is changed to follow that path.
  • Finally, current_node.is_end is set to True.

The difference is that the first three letters in the word cane are all found in the trie -- they represent the word can. Since can is a prefix of the word cane, those nodes are used when adding cane as well.

Your trie now contains two words, can and cane. You can follow the sequence of characters c, a, and n to find the word can, and know that it is a complete word. Why? Because .is_end is True when you get to the n node, even though there is another child to follow. Similarly, you know cane is a complete word in the trie because the e node has .is_end set to True. Further, you know the word ca is not a complete word in the trie because .is_end for the a node is set to False.

So how do you find a word in the trie?

Searching a Trie

Let's look at some code to find a word in a trie, then analyze it to understand what's going on:

def contains(self, data):
    current_node = self.root

    for ch in data:
        if ch in current_node.children.keys():
            current_node = current_node.children[ch]
        else:
            current_node = None
            break

    return current_node and current_node.is_end

Just like adding a word to a trie, you set up a variable current_node to help you walk through the trie starting at the root. You then start looping through all the letters in data.

The loop checks if the current character is a child of current_node. If so, you set current_node equal to that child node and continue the loop. Otherwise, you set current_node to None and terminate the loop.

When the loop ends, it's because you checked every character in data, or one of those characters was not a child of the previous character. If current_node is a valid Node, then it must point to a valid word, right?

Well, not completely -- remember a word is only in the trie if the final character's node has .is_end set to True. So you have to check that as well. Both of these checks have to be True for the word to be found, which is why they are combined with the and operator in the return statement.

Still confused? Let's look at two examples using the trie we created above, which contains can and cane. First, let's look for the word car in the trie:

print("Is 'car' in the trie? " + trie.contains("car"))

After setting current_node to it's initial value, you start the loop:

  • The first character c is a child of current_node, so you update current_node.
  • The next character a is a child of current_node, so you update current_node.
  • The next character r is not a child of current_node, so you set current_node = None.

The return statement checks first that current_node is valid -- since the truthiness of None is False, the method returns False.

OK, what if you were looking for can instead?

print("Is 'can' in the trie? " + trie.contains("can"))

A similar sequence of steps occurs after setting current_node to its initial value:

  • The first character c is a child of current_node, so you update current_node.
  • The next character a is a child of current_node, so you update current_node.
  • The next character n is a child of current_node, so you update current_node.

This time, the return statement sees that current_node is valid (its truthiness is True), so it also checks current_node.is_end. In this case, that is True as well, so the method returns True and True, which is just True.

Finding Multiple Words in a Trie

Modern word processors and editors often suggest words to the user as they type. This feature relies on finding all the words that begin with the letters the user has already typed. You can do the same thing in your trie, traversing the trie to return a collection of words that start with the same prefix.

As before, let's look at the code and then dig into the analysis:

def contains_prefix(self, prefix):
    current_node = self.root

    for ch in prefix:
        if ch in current_node.children.keys():
            current_node = current_node.children[ch]
        else:
            current_node = None
            break

    if current_node:
        return __contains_prefix(current_node, prefix)
    else:
        return None

def __contains_prefix(self, current_node, prefix):
    result = []
    if current_node.is_end:
        result.append(prefix)

    for ch in current_node.children.keys():
        result.extend(__contains_prefix(current_node.children[ch], prefix+ch))

    return result

For this method, you are using a recursive helper to get the job done -- you'll see why in a minute.

The first method should look familiar -- it is almost the same code as the .contains() method. In fact, it's doing the same job, but when it gets to the final node in prefix, it calls the recursive helper, passing in the current_node and the prefix to dig deeper into the trie. In effect, the first method is finding the first node where you can start traversing the trie.

The recursive dunder helper method first creates a new list to hold all the words found under the current_node. You check if this node is a word using .is_end and add it to the results list if so.

Next, you check all the children of current_node by getting the children's keys one by one from current_node.children.keys(). Recall that the keys in the .children dictionary represent the next letters in valid prefixes in the trie, so you can use each key to find and visit all the child nodes from current_node.

Once you know the character of the child node, you use that to:

  • Get the child node using current_node.children[ch].
  • Add that character to the prefix using prefix+ch.
  • Passing those into a recursive call to the dunder method .__contains_prefix().
  • Adding those results to the results list using .extend(), which is returned after the loop ends.

So, how does this work in practice? Let's return to the example above and run the following:

print("Words starting with 'can':")
for s in trie.containsPrefix("can"):
    print(f"  {s}")

The first method finds can (as seen in the previous section) and calls the dunder method, passing in the n node and the prefix set to can.

The dunder method starts by creating a new list, then checks current_node.is_end. Since that is True, it appends can to results.

The loop gets all the children of current_node, starting with e, and calls itself, passing in the e node and a new prefix set to cane.

The second call to the dunder method creates another list and adds cane to it since current_node.is_end is True for the e node. However, since it has no children, the loop never starts, so this instance ends, returning the results list.

The first call gets that return value and adds it to its version of results, which now holds can and cane. Since there are no more children of the n node, the loop ends, and that list is returned to the first method, which returns it to the caller:

Words starting with 'can':
  can
  cane

Deleting Data from a Trie

Of course, you can also remove data from a trie. Here's what that code might look like:

def delete(self, data):
    current_node = self.root

    for ch in data:
        if ch in current_node.children.keys():
            current_node = current_node.children[ch]
        else:
            current_node = None
            break

    if current_node:
        current_node.is_end = False

The initial traversal code should look familiar -- it's doing the same thing here as it does in .add() and .contains(), making sure the prefix is a valid word in the trie. However, when the loop ends and you know you found a valid ending node, you simply mark that node as no longer being a valid end by setting .is_end = False.

Why don't you remove that final node? Because it might be a valid prefix for other words. Let's look at an example:

trie.delete("can")

Since can is a valid prefix for the word cane, you can't delete the n node without losing all its children.

Summary: Tries in Python

Working with tries requires you to break apart your data so you can store the individual parts in nodes to act as prefixes for longer words. Most of the operations on tries can be done without recursion, with the exception of traversal. For this reason, most operations on tries are $`O(n)`$, where $`n`$ is the size of the word being added to or removed from the trie.