Poster

CipherBusters: Automatically Busting Classical Ciphers 👻

Team: John Chung, Alex Ding, Megan Frisella.

Final project for CSCI 1470: Deep Learning at Brown University, Fall of 2021.

Poster

Introduction

A cipher is a procedure that defines how to transform plaintext into ciphertext. Cracking a cipher is the task of reverse-engineering a given ciphertext to recover the plaintext without knowledge of the cipher. This is a popular adversarial task used to uncover flaws in encryption schemes with the goal of producing more robust ciphers. Statistical methods have proven powerful for decrypting a variety of simple ciphers using the concept of frequency analysis. Frequency analysis methods are particularly useful for 1:1 ciphers, which are ciphers where there is a 1:1 mapping between characters in the plaintext alphabet and ciphertext alphabet.

Substitution cipher falls under this category. Substitution cipher assigns a random 1:1 mapping of plaintext to ciphertext characters. Transformation from plaintext to ciphertext means replacing each character with the character it is mapped to. Analyzing the frequency of individual characters, pairs of characters (2-grams), or N-grams works well to decrypt this sort of ciphertext because the frequencies can be compared against known frequencies in the plaintext alphabet. Caesar’s cipher is a simplification of substitution cipher, where the 1:1 mapping is defined by shifting the alphabet over N characters. This cipher is considered easier to crack because the mapping between plaintext and ciphertext characters has a consistent offset, so if you figure out the mapping for one character (or N-gram) then you get all of them. Vigènere cipher is slightly more complicated, where multiple different Caesar ciphers are applied at different characters in the plaintext according to some key.

Plaintext	Caesar (key=3)	Vigènere (key=bust)	Substitution
the 1935 washington huskies football team was an american football team	wkh 4c68 zdvklqjwrq kxvnlhv irrwedoo whdp zdv dq dphulfdq irrwedoo whdp	u1w kann fbcz1o0b7o 1cbl2wb g86ccu34 uys5 xua to u4xs2uto z67uvs4m dwtn	j1p hial tcf15v0jqv 1bfs5pf uqqj2c66 jpco tcf cv cop85dcv uqqj2c66 jpco

Table 1. Examples of all three cipher schemes used in this paper.

You might ask, “How do we know which decryption method to apply if we don’t know which cipher we are trying to crack?” Our work aims to make headway on this problem by utilizing deep learning to create a generalized algorithm that can reliably decrypt multiple different ciphers.

Related Work

The idea of taking advantage of the generalizability of deep learning to crack ciphers has been explored in the past. In particular, Focardi et al. trained various simple feed-forward networks using letter frequencies to crack Caesar cipher, Vigènere, and arbitrary substitution ciphers. David Z. expanded on the work to also include affine cipher. Their methodology requires preexisting knowledge about the specific type of cipher, as the model architectures are distinct for different cipher types. Kopal trained a feed-forward network using manually engineered features to classify between different types of ciphers. We build on their work to evaluate whether more powerful sequential models can generalize the decryption pipeline such that a single model can handle cipher identification and decryption of various different simple ciphers while maintaining good performance in these ciphers.

We aim to develop an even more generalized algorithm by removing the statistical analysis. We hypothesize that a sequential encoding of the ciphertext will mimic this step by learning latent frequency information. Although we don’t expect as high of accuracy as classic statistical methods, a generalized cipher decryption algorithm has much potential for academic interest (i.e. to see how powerful deep learning methods are when applied to increasingly generalized tasks) and practical importance because we may not know which cipher we are trying to crack (e.g. substitution vs. Vigènere) so a generalized decryption algorithm is able to handle multiple cases. We restrict the domain of ciphers we are trying to crack to Caesar, substitution, and Vigènere. We begin with easier tasks (e.g. decrypting one particular Caesar cipher) on simple machine translation algorithms as a proof of concept. As we increase the generality of the task (e.g. decrypting any arbitrary Caesar cipher), we move on to more robust machine translation models.

Methodology

Our approach is to identify various decryption tasks of varying difficulty/generality, then develop a machine translation model that is successful on each task. As the tasks increase in generality, we change the architecture to make the model more robust if need be. We identify the following tasks, in order of increasing generality.

Decrypt ciphertext guaranteed to come from one particular Caesar cipher.
Decrypt ciphertext that may come from an arbitrary Caesar cipher (“All Caesar” task).
Decrypt ciphertext that may come from any arbitrary substitution cipher.
Decrypt ciphertext that may come from any arbitrary Vigènere cipher.
Decrypt ciphertext that may come from any arbitrary substitution or Vigènere cipher.

For all tasks, we use a Wikipedia corpus of English text as our body of plaintext. In total, we sample 20,000 Wikipedia articles for a total of 58,222,708 characters. We reduce the character space to 26 English letters, 10 Arabic numbers, and whitespace by lowercasing, normalizing spacing between words, and removing all other characters. We then tokenize and one-hot encode fixed-sized windowed character sequences as matrices. For the train-test split, we use an 80:20 split.

The plaintext serves as labels for the corresponding ciphertext, which is enciphered differently according to the task. Tasks 1 and 2 use Caesar ciphers, so we enciphered the entire corpus with every Caesar cipher, of which there are 36. Whitespaces characters are ignored during enciphering. For task 1, we randomly choose one Caesar cipher and generate batches of train/test data using the plaintext and the corresponding ciphertext for this cipher. For task 2, we generate batches of train/test data for every cipher then randomly shuffle these together. Thus, each batch comes from one Caesar cipher but the model sees every Caesar cipher during training. For tasks 3, 4, and 5, it is unreasonable to encipher the entire corpus with every substitution and Vigènere cipher because the keyspace is very large (there are 36! keys for substitution cipher and infinitely many for Vigènere). Instead, to generate train/test batches we iterate through batches plaintext and randomly choose a key to encipher each batch. This introduces a training/testing bottleneck because the data is generated on the spot, but it allows the model to see arbitrarily many different substitution and/or Vigènere ciphers during training without having to store ciphertext for arbitrarily many ciphers.

Figure 1. Our overall pipeline for models [c], [d], and [e]. The encoder and decoder layers can be either LSTMs or transformers. Models [a] and [b] follow the same pipeline, except they do not have the encoder layer.

Figure 1. Our overall pipeline for models [c], [d], and [e]. The encoder and decoder layers can be either LSTMs or transformers. Models [a] and [b] follow the same pipeline, except they do not have the encoder layer.

In the following, we will describe the progression of model architectures that we built in accordance with increasingly difficult/generalized tasks. We will hint at model performance to motivate the changes to architectures, however we will provide more formal coverage of the results in the next section. We first develop a simple RNN architecture (model [a]) and simple transformer architecture (model [b]). Each consists of one LSTM or transformer layer and a few fully connected layers. Both quickly learn task 1 but are not as robust to task 2, especially the simple RNN.

Next, we make the RNN and transformer more robust by adding encoder and decoder layers, which are two LSTMs for the RNN (model [c]) and two transformer layers for the transformer (model [d]), followed by a few fully connected layers. Both learn task 2 with high accuracy but are not as robust to tasks 3 or 4. To address this, we make the transformer more robust by adding one additional transformer layer each to the encoder and decoder, still followed by multiple fully connected layers (model [e]).

An important hyperparameter for our machine translation task is window size. The basic intuition is that a larger window size may elicit better performance because the model can learn more reliable information about the frequencies of ciphertext characters if more ciphertext is supplied. We ran an experiment with model [c] on task 2 where we varied window size and kept all other hyperparameters constant to determine the optimal window size.

Results

For all results, the reader can assume that we train models for 20 epochs with a learning rate of 1e-3. Our first task is decrypting one particular Caesar cipher. Our second task is decrypting any arbitrary Caesar cipher, of which there are 36. In Figure 2, we present results from 36 experiments, where each represents the task of decrypting N particular Caesar ciphers. Clearly, N=1 is task 1 and N=36 is task 2. These experiments are trained and tested on the simple RNN for a window size of 50. We observe a negative trend in model performance as the number of Caesar ciphers we train the model to decrypt grows. The simple RNN achieves outstanding performance on task 1 but is not robust to task 2. This motivates us to explore the simple transformer and an encoder-decoder based RNN and transformer to achieve better performance on task 2.

Figure 2. Simple RNN model performance significantly declines as we train it on an increasing number of ciphers.

Figure 2. Simple RNN model performance significantly declines as we train it on an increasing number of ciphers.

Figure 3 presents results for task 2 on the simple RNN, simple transformer, RNN, and transformer. All experiments are trained and tested on a window size of 50. The encoder-decoder architecture of the RNN and transformer, outlined in the pipeline in Figure 1, offers the best performance on this task. For this reason, we move forward with these architectures for tasks 3 and 4.

Encrypted	Decrypted
mtzwzrt343 nzyqt2x zn4z053p3 3xl24 m54 wtvp 4sp9 34tww zn4z053p3 wp43 yz4 rp4 lsplo zq z523pw6p3	biologists confirm octopuses smart but like they still octopuses lets not get ahead of ourselves
fgy60d ywd7ea9 d0fdwyfe x4ff0d d08wd6e w1f0d 149w77k 20ff492 wyy0bf0z fa xdai9	tucker carlson retracts bitter remarks after finally getting accepted to brown
qjy ymj wzqnsl hqfxxjx ywjrgqj fy f htrrzsnxynh wj0tqzynts	let the ruling classes tremble at a communistic revolution

Table 2. Examples of trained transformer decrypting arbitrary Caesar ciphers.

Figure 3. The encoder-decoder architectures exhibit the best performance on the “All Caesar” task.

Figure 3. The encoder-decoder architectures exhibit the best performance on the “All Caesar” task.

Before presenting results for tasks 3 and 4, we motivate our chosen window size of 50 by the following experiment. In Figure 4, we present results from 13 window size experiments, where each experiment is the test accuracy of the model [c] on the “All Caesar” task. Each experiment has a different window size. We observe that the optimal window size is ~64. Window sizes larger than this maximum begin to see decreases in performance, which we did not expect. This contradicts our intuition that the model learns more reliable information about the frequencies of ciphertext characters if more ciphertext is supplied.

Figure 4. The optimal window size for the RNN model on the 36 Caesar cipher task is ~64. Window size does not always positively vary with accuracy.

Figure 4. The optimal window size for the RNN model on the 36 Caesar cipher task is ~64. Window size does not always positively vary with accuracy.

Finally, we present results for tasks 3 and 4. Figure 5 displays performance for the transformer (model [d]) and robust transformer (model [e]) on task 3 (substitution) and task 4 (Vigènere). Worse performance on substitution compared to Vigènere indicates that the substitution task is more difficult than the Vigènere task. This is expected because Vigènere is a complexification of Caesar’s cipher, for which the models have shown extremely high performance. Overall, the robust transformer does not offer major performance gain compared to the transformer. Although 56% accuracy on Vigènere has merit given the infinite keyspace of Vigènere ciphers, it is not nearly high enough to produce sensible plaintext. Thus, we believe that these methods have reached their limit with regard to generalizing simple decryption tasks.

Figure 5. The transformer and robust transformer exhibit better performance on the Vigènere task compared to the substitution task. The robust transformer only offers slight performance gains.

Figure 5. The transformer and robust transformer exhibit better performance on the Vigènere task compared to the substitution task. The robust transformer only offers slight performance gains.

Challenges

Our challenges pertained to our model architecture and training methods. Our first major challenge regarding model architecture was that teacher forcing failed during inference time. Initially, we built encoder-decoder RNN and transformer models that used teacher forcing during training (i.e. the model saw the plaintext during training). When testing these models without teacher forcing during inference time, they did not generalize to new ciphertext. To overcome this challenge, we started from scratch with the simple RNN and simple transformer architectures, which return probability distributions over one-hot encodings for input ciphertext without ever seeing the plaintext. This method was significantly more successful. We eventually returned to an encoder-decoder architecture without teacher forcing. Additionally, we had to determine the best window size for our RNN model that struck a balance between accuracy and memory. Finally, our models did not perform as well on the generalized tasks.

As for our training methods, we could not encode the training plaintext for the substitution and Vigènere tasks in the way we did with the Caesar task because the key space for these ciphers is enormous. Instead, we dynamically generated training data for each batch by randomly choosing a cipher and encoding a batch of plaintext. Although this method introduced a training bottleneck, the model was able to see arbitrarily many different ciphers during training without significant storage needs.

Conclusion

We accomplished Tasks 1 and 2 outlined in our Methodology. Our best models achieve near-perfect accuracy for the Caesar cipher task. However, our models underperformed for Tasks 3 and 4, exhibiting only 38% accuracy for the substitution cipher task and 56% accuracy for the Vigènere cipher task. For this reason, our model certainly cannot generalize to Task 5.

Tasks 3 and 4 were unsuccessful because we incorrectly hypothesized that a larger window size leads to better performance. Figure 4 suggests that the model is not learning latent information about frequencies. Perhaps while the encoder guesses the key and the decoder iterates through each character of the ciphertext, the LSTM slowly loses information about the encoded key. Additionally, our model exhibited better performance for the Vigènere cipher task than the substitution cipher task because of differences between the ciphers’ complexities. The Vigènere cipher, an extension of the Caesar cipher, shifts each character by some arbitrary amount. However, the substitution cipher makes a random 1:1 mapping between plaintext and ciphertext characters, which makes it more complicated and difficult to decrypt.

One direction of future work is to add an adversarial component to our training pipeline to punish implausible decryptions. As it is, our model often devolves into guessing the most common letters during training. The adversarial loss can help mitigate such loss-gaming behavior. Another direction of work is to handcraft features about ciphertext, such as frequency analysis, and feed that as part of the input to the model. This may lead to better performance by limiting the size of the problem and encouraging more probable outcomes. Coupling the model with an adversarial component and initial frequency analysis may address the underperformance seen in the generalization tasks.

Bibliography

Focardi, R. and Luccio, F.L., 2018. Neural Cryptanalysis of Classical Ciphers. In ICTCS (pp. 104-115).
Kopal, N., 2020, May. Of Ciphers and Neurons–Detecting the Type of Ciphers Using Artificial Neural Networks. In Proceedings of the 3rd International Conference on Historical Cryptology HistoCrypt 2020 (No. 171, pp. 77-86). Linköping University Electronic Press.
Z., D., 2021, February. An Exposition of Neural Cryptanalysis of Classical Ciphers.

Check-In 1

Introduction

Traditionally, algorithms relying on statistical methods like frequency analysis have been used to decrypt simple ciphers, such as Caesar ciphers and Vigenere ciphers. We are interested in using deep learning as an alternate approach to these decryption tasks because they have the potential to generalize better than any one statistical method can. We hope to first find out whether neural networks can mimic the frequency-based technique used to solve one particular cipher. If so, we hope to develop an architecture that can generalize to multiple ciphers (e.g. a model that can decrypt both Caesar and Vigenere ciphers). This is a machine translation task, where our input-label pairs are enciphered and deciphered english text.

Related Work

Focardi et al. trained various simple feed-forward networks using letter frequencies to crack Caesar cipher, Vigenère, and arbitrary substitution ciphers. David Z. expanded on the work to also include affine cipher. Their methodology requires preexisting knowledge about the specific type of cipher, as the model architectures are distinct for different cipher types. Kopal trained a feed-forward network using manually engineered features to classify between different types of ciphers. We build on their work to evaluate whether more powerful sequential models can generalize the decryption pipeline such that a single model can handle cipher ID and decryption of various different ciphers while maintaining good performance in these simpler ciphers.

Focardi, R. and Luccio, F.L., 2018. Neural Cryptanalysis of Classical Ciphers. In ICTCS (pp. 104-115).
https://www.cs.umd.edu/~gasarch/TOPICS/cryptoml/DavidZMLcrypto.pdf
Kopal, N., 2020, May. Of Ciphers and Neurons–Detecting the Type of Ciphers Using Artificial Neural Networks. In Proceedings of the 3rd International Conference on Historical Cryptology HistoCrypt 2020 (No. 171, pp. 77-86). Linköping University Electronic Press.

Data

We will use a standard corpus of English text (e.g. a Wikipedia corpus) to create our input data and labels. We will implement a few simple ciphers (Caesar cipher, general substitution cipher, Vigenere cipher) in order to generate enciphered text as input for our various tasks (to be described below). The corresponding labels are English plaintext. The corpus will be relatively large, so this will take a considerable amount of preprocessing time since we would like to encrypt English plaintext with multiple ciphers.

Methodology

Our project consists of multiple stages. At each stage, we will increase the complexity of the machine translation task by generalizing the types of enciphered text we expect the model to decrypt. The specific task-stages include

decrypt inputs guaranteed to come from one particular Caesar cipher,
decrypt inputs that may come from an arbitrary Caesar cipher,
decrypt inputs that may come from an arbitrary Vigenere cipher,
decrypt inputs that may come from any arbitrary substitution cipher, and
decrypt inputs that may come from any arbitrary Caesar, affine, or substitution cipher.

We would like to begin with a simple model and, once it can no longer generalize to a new task, we will consider a more robust architecture. In particular, we plan to start with an RNN to decrypt inputs from one particular cipher. As we move on to later task-stage, we plan to switch to a transformer model because it encodes the entire sequence rather than considering only the previous characters like and RNN does. By encoding the entire sequence, we hope that the model learns something about the character frequencies, including single-character frequencies and n-gram character frequencies. If this model fails to generalize at any task-stage, we will graduate to using a GAN to discriminate between decrypted text and actual plaintext. We hope that this will encourage the transformer to generate decodings that are believable english text.

Regardless of the model, we will train in batches on the input-label pairs described previously. Our loss will penalize the model according to the number of characters that are mismatched between the predicted decryption and the plaintext. If we find that our models fail to perform well on any of our prediction tasks or fail to generalize, we will implement the statistical/deep learning algorithm described in the related work.

Metrics

Accuracy is character based for each of our models. We train on consistent window sizes across models, so calculating the number of correct characters per window size will give a consistent measure. For task 2 (tasks 3 and 4 also?), we compare the model’s accuracy to the statistical model from the paper. For all other tasks, we will decide on an acceptable “good” accuracy and compare it to the accuracy of our model. Success is determined by these comparisons. Our base goal is to achieve good accuracy on tasks 1 and 2. Our target goal is to achieve good accuracy on tasks 1-4. Our stretch goal is to achieve good accuracy on task 5.

Ethics

Our dataset will be an English corpus from Wikipedia, which we will then process into input/label pairs. A possible concern is that the entries from Wikipedia may reflect the biases of the volunteers who wrote or edited them.
For each of our three tasks, we plan to measure success by comparing the model’s accuracy to the expectation we have for it. A possible concern is that apart from task 2, we have no baseline for how well the model should be performing, so our threshold for success may be too low or too high.

Division of Labor

John will be responsible for pre-processing. This includes cleaning the English corpus obtained from Wikipedia, implementing ciphers to encrypt the data, and generating data for the statistical model from the paper. Alex will be responsible for implementing the RNN that will decrypt a particular Caesar cipher and the baseline frequency-based feed-forward model used in related works. Megan will be responsible for implementing the Transformer that will be used to decrypt various ciphers. After we finish the first tasks (and if the transformer model proves insufficient for the generalization task), we will work on the GAN model together.

Check-In 2

Challenges

Model architecture:

Teacher forcing failed during inference time. When testing our RNN and Transformer models without teacher forcing, they did not generalize to new ciphertext. This led us to construct the simple RNN and simple transformer architectures, which return “probability distributions” over one-hot encodings for input ciphertext without ever seeing the plaintext. This method was much more successful.
Adding a discriminator module? As we move on to more generalized tasks (e.g. general substitution cipher, vigenere cipher) we may need a stronger architecture. Adding a discriminator (effectively GAN training) may encourage the generator to produce more English-looking plaintext. But GAN training may be quite finicky...

Training method:

How to train on multiple Caesar task. We decided to encode our training plaintext 36 different times (this represents all possible Caesar ciphers) and train in batches where each batch comes entirely from one cipher. This method is not feasible for more complicated ciphers…
How to train on multiple substitution task. It is not feasible to encode our training plaintext with all possible substitution ciphers (there are 36! of these). We plan to dynamically generate training data per batch by randomly choosing a substitution cipher and encoding a batch of plaintext. This introduces a training bottleneck due to processing each batch, but means that the model can see arbitrarily many different substitution ciphers during training without significant storage needs.

Insights

The task is similar to language translation, but not exactly the same. Teacher forcing causes the model to learn how to predict a likely next character given previous characters, but the model effectively learns how to generate probable English sentences, instead of learning a decryption algorithm. To achieve good performances, we had to impose some structure to the model--namely that each input character maps to exactly one output character; in other words, we had to sacrifice the generalizability of our architecture to cipher schemes that do not preserve sequence lengths to get competitive performance on our ciphers.

We also experimented between simple vs. encoder-decoder architecture. After having success using the simple RNN and simple Transformer architectures, we were curious if using an encoder-decoder architecture without teacher forcing would further boost performance, and we did manage to improve the accuracy from 96.3% to 99.4%. We also ran qualitative experiments with the trained models, trying to decrypt arbitrary Caesar-enciphered texts, and the model performs well. Our architecture allows the model to take in arbitrary input lengths (even though the training data are split in fixed-size windows), and it seems like longer inputs result in better decryption. We plan on exploring this further.

Models’ accuracies on all Caesar cipher tasks:

Simple RNN (not encoder-decoder architecture): 89%
Simple Transformer (not encoder-decoder architecture): 96%
Encoder-decoder Transformer: 99.4%
Adversarial Transformer: 99.1%

We expected high accuracies for the Caesar cipher task. Therefore, these models performed in line with our expectations. The additional adversarial module in the adversarial transformer does not seem to offer additional performance boosts, but the two models already achieve near perfect results, so the module could prove helpful for more complicated ciphers.

Plan

Are you on track with your project?

Yes. We have already implemented Vigenere’s and general substitution ciphers, and we are planning to run experiments with the models we’ve implemented as soon as we write the data loading code.

What do you need to dedicate more time to?

We plan to spend more time writing dynamic data loaders for substitution and Vigenere ciphers. Then, we just need to spend most of our time running experiments. We also need to run qualitative experiments exploring the various insights we’ve had.

What are you thinking of changing, if anything?

Other than the changes we’ve discussed above, we are sticking to the plan.

Built With

python
tensorflow

CipherBusters: Automatically Busting Classical Ciphers 👻

Introduction

Related Work

Methodology

Results

Challenges

Conclusion

Bibliography

Check-In 1

Introduction

Related Work

Data

Methodology

Metrics

Ethics

Division of Labor

Check-In 2

Challenges

Insights

Plan

Built With

Updates