2) Data Structures Lesson

Tries in Java

26 min to complete · By Jon Fincher

The best way to learn about tries is to try them, no pun intended, in code. You'll see how to add, remove, and find data in a trie using Java.

If you have not already read the previous page on the Trie Data Structure, please do so first and then return here.

Nodes in Java Tries

Like any other dynamic data structure, you need to define the nodes in which you will store your data. Here's what a Node might look like in Java:

class Node<T extends CharSequence>{
    public Map<Character, Node<T>> children;
    public boolean isEnd;

    public Node() {
        this.children = new HashMap<Character, Node<T>>();
        this.isEnd = false;
    }

For this unit, you will use your tree to store String data, so each node is declared to store a CharSequence, which is implemented by String.

Of course, you may also have some other questions, like:

  • Where is the data stored?
  • Why is children a Map type instead of just links?
  • What does isEnd mean?

All will be revealed as you delve deeper into the code.

Trie Data Structure

Let's continue with the Trie structure itself:

public class Trie<T extends CharSequence> {
    Node<T> root;

    public Trie(){
        this.root = new Node<>();
    }
}

You should see some similarities between the Trie and the BST. The Trie stores the same data type as its Node structure; it has a root member variable, and the constructor creates the root node. From here, however, things get more interesting.

For example, how do you add words to this thing?

Inserting Data into Trie

Let's look at code that will add words to the trie, which you can analyze to understand how this structure works:

/**
 * Add a new item to the Trie
 * @param data The data to add.
 */
public void add(T data)
{
    // Start at the root
    Node<T> currentNode = this.root;

    // Look at every character
    for (int i=0; i < data.length(); i++){
        Character ch = data.charAt(i);
        if (!currentNode.children.containsKey(ch))
            currentNode.children.put(ch, new Node<T>());
        currentNode = currentNode.children.get(ch);
    }
    currentNode.isEnd = true;
}

Like all dynamic data structures, you need a variable to indicate which Node you care about, so you define currentNode to start at the root. However, things diverge from BSTs after that.

Remember that tries are also called prefix trees -- each node stores the prefix of a piece of data. The complete data item is stored in a sequence of Node objects, beginning at the root, which stores the initial character. The next character is stored in one of the children of root, and so on, until the final character is stored.

To do this, you loop through every character in data -- since the type is CharSequence, you use the .charAt() method to get them all. This is then used as the key in a Map to identify which child of root to use next. If that child doesn't exist, you make it using the .put() method. Then, you set currentNode to that child and continue the loop. Once you have run out of characters in data, you set isEnd to true to indicate this is a complete word.

Of course, an example is in order. Assume you have the following code:

Trie trie = new Trie();
trie.add("can");

To add the word can, currentNode is set to root, and the loop starts.

  • The first character is c, which doesn't exist as a key in the currentNode.children map.
    • A new child Node is created as the data value associated with the key c.
  • currentNode is now changed to refer to the node associated with c.

Your trie now looks like this:

#TBD: Image of trie after adding c

The next iteration of the loop adds the character a, which doesn't exist in the currentNode.children map, so it gets added, and currentNode is changed again.

Now your trie looks like this:

#TBD: Image of trie after adding a

The final iteration of the loop adds the character n, which again doesn't exist in the currentNode.children map. After it is added, currentNode is changed again, and the loop ends. The last thing you do is set currentNode.isEnd to true to indicate you have added a complete word, and your trie looks like this:

#TBD: Image of trie after adding n

You can see that the data you added isn't stored in a single Node structure but in a sequence of Nodes starting at the root and continuing to the end of the trie.

So what is the deal with .isEnd? Isn't it obvious that the last Node in the sequence is the end of the word? Actually, it's not -- let's say you now want to add the word cane to your trie:

trie.add("cane");

The code will follow almost the same steps after initializing currentNode and starting the loop:

  • The first character c is found as a key in currentNode.children, so no new Node is created, and currentNode is changed to follow that path.
  • The next character a is found as a key in currentNode.children, so no new Node is created, and currentNode is changed to follow that path.
  • The next character n is found as a key in currentNode.children, so no new Node is created, and currentNode is changed to follow that path.
  • The last character e is not found as a key in currentNode.children, so a new Node is created, and currentNode is changed to follow that path.
  • Finally, currentNode.isEnd is set to true.

The difference is that the first three letters in the word cane are all found in the trie -- they represent the word can. Since can is a prefix of the word cane, those nodes are used when adding cane as well.

Your trie now contains two words, can and cane. You can follow the sequence of characters c, a, and n to find the word can and know that it is a complete word. Why? Because .isEnd is true when you get to the n node, even though there is another child to follow. Similarly, you know cane is a complete word in the trie because the e node has .isEnd set to true. Further, you know the word ca is not in the trie because .isEnd for the a node is set to false.

So how do you find a word in the trie?

Searching a Trie

Let's look at some code to find a word in a trie, then analyze it to understand what's going on:

/**
 * Does the trie contain this data item?
 * @param data What are you looking for?
 * @return True if the data is in the tree, False otherwise
 */
public boolean contains(T data){
    Node<T> currentNode = this.root;

    int i = 0;
    while (i<data.length() && 
          currentNode.children.containsKey(data.charAt(i)))
        currentNode = currentNode.children.get(data.charAt(i++));

    return (i==data.length() && currentNode.isEnd);
}

Just like adding a word to a trie, you set a variable currentNode to help you walk through the trie starting at the root. You also set up an index i, which will look at each letter in the data you are looking for.

The loop checks two conditions for every node:

  • Have you looked at every character in data? As long as i is less than the length of data, there are still characters to check.
  • Is the current character a child of currentNode? If the character is a key in the .children map, then it is.

If both conditions are true, then you set currentNode equal to the proper child node, increment i to the next character position in data, and continue the loop.

When the loop ends, it's because you either ran out of characters to check or one of those characters was not a child of the previous character. You can check which one is true by comparing i to data.length() -- if they are equal, then you have looked at all the characters in data, and therefore the word is in the trie. Right?

Well, not completely -- remember a word is only in the trie if the final character's node has .isEnd set to true. So you have to check that as well. Both of these checks have to be true for the word to be found, which is why they are combined with the && operator in the return statement.

Still confused? Let's look at two examples using the trie we created above, which contains can and cane. First, let's look for the word car in the trie:

System.out.println("Is 'car' in the trie? " + trie.contains("car"));

After setting currentNode and i to their initial values, you start walking through the loop:

  • i is 0, which is less than 3, and c is a child of currentNode, so you update currentNode and increment i.
  • i is 1, which is less than 3, and a is a child of currentNode, so you update currentNode and increment i.
  • i is 2, which is less than 3, but r is not a child of currentNode, so the loop ends.

The return statement checks that i==data.length() is false, and so the method returns false.

OK, what if you were looking for can instead?

System.out.println("Is 'can' in the trie? " + trie.contains("can"));

A similar sequence of steps occurs after setting currentNode and i to their initial values:

  • i is 0, which is less than 3, and c is a child of currentNode, so you update currentNode and increment i.
  • i is 1, which is less than 3, and a is a child of currentNode, so you update currentNode and increment i.
  • i is 2, which is less than 3, and n is a child of currentNode, so you update currentNode and increment i.
  • i is now 3, which is not less than 3, so the loop ends.

This time, the return statement checks that i==data.length() is true, so it also checks currentNode.isEnd. In this case, that is true as well, so the method returns true && true, which is just true.

Finding Multiple Words in a Trie

Modern word processors and editors often suggest words to the user as they type. This feature relies on finding all the words that begin with the letters the user has already typed. You can do the same thing in your trie, traversing the trie to return a collection of words that start with the same prefix.

As before, let's look at the code and then dig into the analysis:

/**
 * Returns a list of items in the trie which contain the prefix given
 *
 * @param prefix The prefix to find
 * @return A List containing the items containing the prefix
 */
public List<T> containsPrefix(T prefix){
    Node<T> currentNode = this.root;

    int i = 0;
    while (i<prefix.length() && 
            currentNode.children.containsKey(prefix.charAt(i)))
        currentNode = currentNode.children.get(prefix.charAt(i++));

    // If the prefix isn't present, no need to go further
    if (i!=prefix.length()) return null;

    // Find everything from here down
    return containsPrefix(currentNode, prefix);
}

/**
 * Recursive helper method to find all prefix words in the trie
 *
 * @param currentNode What node do we start at
 * @param prefix What is the current prefix?
 * @return A List of items which are marked as final in the trie
 */
private List<T> containsPrefix(Node<T> currentNode, T prefix){
    ArrayList<T> results = new ArrayList<>();

    // Is the prefix a word? Make sure we add it
    if (currentNode.isEnd) results.add(prefix);

    // Follow all of its children to find words
    for (Character ch: currentNode.children.keySet())
        results.addAll(containsPrefix(currentNode.children.get(ch),
                   (T) prefix.toString().concat(ch.toString())));

    return results;
}

For this method, you are using a recursive helper to get the job done -- you'll see why in a minute.

The public method should look familiar -- it is almost the same code as the .contains() method. In fact, it's doing the same job, but when it gets to the final node in prefix, it calls the recursive helper, passing in the currentNode and the prefix to dig deeper into the trie. In effect, the public method is finding the first node where you can start traversing the trie.

The recursive private helper first creates a new ArrayList to hold all the words found under the currentNode. You check if this node is a word and add it to the results list if so.

Next, you check all the children of currentNode by getting the children's keys one by one from currentNode.children.keySet(). Recall that the keys in the .children map represent the next letters in valid prefixes in the trie, so you can use each key to find and visit all the child nodes from currentNode.

Once you know the character of the child node, you use that to:

  • Get the child node using currentNode.children.get(ch).
  • Add that character to the prefix using prefix.toString().concat()(ch.toString()).
  • Passing those into a recursive call to the private .containsPrefix() method.
  • Adding those results to the results list, which are returned after the loop ends.

So, how does this work in practice? Let's return to the example above and run the following:

System.out.println("Words starting with 'can':");
for (String s: (ArrayList<String>) trie.containsPrefix("can"))
    System.out.println("  " + s);

The public method will find can (as seen in the previous section), and call the private method, passing in the n node and the prefix set to can.

The private method starts by creating a new ArrayList, then it checks currentNode.isEnd. Since that is true, it adds can to results.

The loop gets all the children of currentNode, starting with e, and calls itself, passing in the e node and a new prefix set to cane.

The second call to the private method creates another ArrayList and adds cane to it since currentNode.isEnd is true for the e node. However, since it has no children, the loop never starts, so this instance ends, returning the results list.

The first call gets that return value and adds it to its version of results, which now holds can and cane. Since there are no more children of the n node, the loop ends, and that list is returned to the public method, which returns it to the caller:

Words starting with 'can':
  can
  cane

Deleting Data from a Trie

Of course, you can also remove data from a trie. Here's what that code might look like:

/**
 * Delete a word from the trie
 *
 * @param data The word to remove
 */
public void delete(T data){
    Node<T> currentNode = this.root;
    int i=0;

    // Walk down the trie to find the data
    while (i<data.length() 
            && currentNode.children.containsKey(data.charAt(i)))
        currentNode = currentNode.children.get(data.charAt(i++));

    // If data exists, mark the node as no longer being an end-point
    if (i==data.length())
        currentNode.isEnd = false;
}

The traversal code should look familiar -- it's doing the same thing here as it does in .add() and .contains(), making sure the prefix is a valid word in the trie. However, when the loop ends and you know you found a valid ending node, you simply mark that node as no longer being a valid end.

Why don't you remove that final node? Because it might be a valid prefix for other words. Let's look at an example:

trie.delete("can");

Since can is a valid prefix for the word cane, you can't delete the n node without losing all its children.

Summary: Tries in Java

Working with tries requires you to break apart your data so you can store the individual parts in nodes to act as prefixes for longer words. Most of the operations on tries can be done without recursion, with the exception of traversal. For this reason, most operations on tries are $`O(n)`$, where $`n`$ is the size of the word being added to or removed from the trie.