Byte-Pair Encoding: Subword-based tokenization algorithm

The branch of Artificial Intelligence, Natural Language Processing (NLP), is all about making machines understand and process human language. Processing human language is not an easy task for machines as machines work with numbers and not text. 💻 NLP is such a vast and widely studied branch of AI that every now and then we hear a new advancement in this domain. Researchers are trying hard to make the machines understand human language and the context behind it.

One of the main roles in understanding human language is played by the tokenizers. Tokenization algorithms can be word, subword, or character-based. Each type of tokenizer helps the machines process the text in a different way. Each one has an advantage over the other. If you want to know about the different types of tokenizers used in NLP then you can read this article. This article is a hands-on tutorial on TDS and will give you a good understanding of the topic. 😇

Word, Subword, and Character-Based Tokenization: Know the Difference

The popular one among these tokenizers is the subword-based tokenizer. This tokenizer is used by most state-of-the-art NLP models. So let’s get started with knowing first what subword-based tokenizers are and then understanding the Byte-Pair Encoding (BPE) algorithm used by the state-of-the-art NLP models. 🙃

Subword-based tokenization

Subword-based tokenization is a solution between word and character-based tokenization. 😎 The main idea is to solve the issues faced by word-based tokenization (very large vocabulary size, large number of OOV tokens, and different meaning of very similar words) and character-based tokenization (very long sequences and less meaningful individual tokens).

The subword-based tokenization algorithms do not split the frequently used words into smaller subwords. It rather splits the rare words into smaller meaningful subwords. For example, "boy" is not split but "boys" is split into "boy" and "s". This helps the model learn that the word "boys" is formed using the word "boy" with slightly different meanings but the same root word.

Some of the popular subword tokenization algorithms are WordPiece, Byte-Pair Encoding (BPE), Unigram, and SentencePiece. We will go through Byte-Pair Encoding (BPE) in this article. BPE is used in language models like GPT-2, RoBERTa, XLM, FlauBERT, etc. A few of these models use space tokenization as the pre-tokenization method while a few use more advanced pre-tokenization methods provided by Moses, spaCY, ftfy. So, let’s get started. 🏃

Byte-Pair Encoding (BPE)

BPE is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data. It was first described in the article "A New Algorithm for Data Compression" published in 1994. The below example will explain BPE and has been taken from Wikipedia.

Suppose we have data aaabdaaabac which needs to be encoded (compressed). The byte pair aa occurs most often, so we will replace it with Z as Z does not occur in our data. So we now have ZabdZabac where Z = aa. The next common byte pair is ab so let’s replace it with Y. We now have ZYdZYac where Z = aa and Y = ab. The only byte pair left is ac which appears as just one so we will not encode it. We can use recursive byte pair encoding to encode ZY as X. Our data has now transformed into XdXac where X = ZY, Y = ab, and Z = aa. It cannot be further compressed as there are no byte pairs appearing more than once. We decompress the data by performing replacements in reverse order.

A variant of this is used in NLP. Let us understand the NLP version of it together. 🤗

BPE ensures that the most common words are represented in the vocabulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.

Suppose we have a corpus that has the words (after pre-tokenization based on space) – old, older, highest, and lowest and we count the frequency of occurrence of these words in the corpus. Suppose the frequency of these words is as follows:

{"old": 7, "older": 3, "finest": 9, "lowest": 4}

Let us add a special end token "" at the end of each word.

{"old": 7, "older": 3, "finest": 9, "lowest": 4}

The "" token at the end of each word is added to identify a word boundary so that the algorithm knows where each word ends. This helps the algorithm to look through each character and find the highest frequency character pairing. I will explain this part in detail later when we will include "" in our byte-pairs.

Moving on next, we will split each word into characters and count their occurrence. The initial tokens will be all the characters and the "" token.

Since we have 23 words in total, so we have 23 "" tokens. The second highest frequency token is "e". In total, we have 12 different tokens.

The next step in the BPE algorithm is to look for the most frequent pairing, merge them, and perform the same iteration again and again until we reach our token limit or iteration limit.

Merging lets you represent the corpus with the least number of tokens which is the main goal of the BPE algorithm, that is, compression of data. To merge, BPE looks for the most frequently represented byte pairs. Here, we are considering a character to be the same as a byte. This is a case in the English language and can vary in other languages. Now we will merge the most common byte pairs to make one token and add them to the list of tokens and recalculate the frequency of occurrence of each token. This means our frequency count will change after each merging step. We will keep on doing this merging step until we hit the number of iterations or reach the token limit size.

Iterations

Iteration 1: We will start with the second most common token which is "e". The most common byte pair in our corpus with "e" is "e" and "s" (in the words finest and lowest) which occurred 9 + 4 = 13 times. We merge them to form a new token "es" and note down its frequency as 13. We will also reduce the count 13 from the individual tokens ("e" and "s"). This will let us know about the leftover "e" or "s" tokens. We can see that "s" doesn’t occur alone at all and "e" occurs 3 times. Here is the updated table:

Iteration 2: We will now merge the tokens "es" and "t" as they have appeared 13 times in our corpus. So, we have a new token "est" with frequency 13 and we will reduce the frequency of "es" and "t" by 13.

Iteration 3: Let’s work now with the "" token. We see that byte pair "est" and "" occurred 13 times in our corpus.

Note: Merging stop token "" is very important. This helps the algorithm understand the difference between the words like "estimate" and "highest". Both these words have "est" in common but one has an "est" token in the end and one at the start. Thus tokens like "est" and "est" would be handled differently. If the algorithm will see the token "est" it will know that it is the token for the word "highest" and not for the word "estate".

Iteration 4: Looking at the other tokens, we see that byte pairs "o" and "l" occurred 7 + 3 = 10 times in our corpus.

Iteration 5: We now see that byte pairs "ol" and "d" occurred 10 times in our corpus.

If we now look at our table, we see that the frequency of "f", "i", and "n" is 9 but we have just one word with these characters, so we are not merging them. For the sake of the simplicity of this article, let us now stop our iterations and closely look at our tokens.

The tokens with 0 frequency count have been removed from the table. We can now see that the total token count is 11, which is less than our initial count of 12. This is a small corpus but in practice, the size reduces a lot. This list of 11 tokens will serve as our vocabulary.

You must have also noticed that when we add a token, either our count increases or decreases or remains the same. In practice, the token count first increases and then decreases. The stopping criteria can be either the count of the tokens or the number of iterations. We choose this stopping criterion such that our dataset can be broken down into tokens in the most efficient way.

Encoding and Decoding

Let us now see how we will decode our example. To decode, we have to simply concatenate all the tokens together to get the whole word. For example, the encoded sequence ["the", "high", "est", "range", "in", "Seattle"], we will be decoded as ["the", "highest", "range", "in", "Seattle"] and not as ["the", "high", "estrange", "in", "Seattle"]. Notice the presence of the "" token in "est".

For encoding the new data, the process is again simple. However, encoding in itself is computationally expensive. Suppose the sequence of words is ["the", "highest", "range", "in", "Seattle"]. We will iterate through all the tokens we found in our corpus – longest to the shortest and try to replace substrings in our given sequence of words using these tokens. Eventually, we will iterate through all the tokens and our substrings will be replaced with tokens already present in our token list. If a few substrings will be left (for words our model did not see in training), we will replace them with unknown tokens.

In general, the vocabulary size is big but still, there is a possibility of an unknown word. In practice, we save the pre-tokenized words in a dictionary. For unknown (new) words, we apply the above-stated encoding method to tokenize the new word and add the tokenization of the new word to our dictionary for future reference. This helps us build our vocabulary even stronger for the future.

Isn’t it greedy? 🤔

In order to represent the corpus in the most efficient way, BPE goes through every potential option to merge at each iteration by looking at its frequency. So, yes it follows a greedy approach to optimize for the best possible solution.

Anyways, BPE is one of the most widely used subword-tokenization algorithms and it has a good performance despite being greedy. 💃

I hope this article helped you understand the idea and logic behind the BPE algorithm. 😍

References:

Thank you, everyone, for reading this article. Do share your valuable feedback or suggestion. Happy reading! 📗 🖌

Byte-Pair Encoding: Subword-based tokenization algorithm

Subword-based tokenization

Byte-Pair Encoding (BPE)

Iterations

Encoding and Decoding

Isn’t it greedy? 🤔

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function