Skip to content

andres-erbsen/engcode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

engcode - binary2english

Motivation

Words, even unfamiliar ones, are easier to remember than numbers or cryptic strings.

XKCD: Password Strength

Algorithm

Each word represents its 0-based index in the alphabetically ordered word list. Assuming that the word list has 2^k entries, encoding binary data can be done as follows:

  • Break the input into groups of k bits.
  • Interpret each group as an unsigned k-bit integer where the most significant bit is the first one. Lets denote this value n.
  • Output nth word from the word list.

Decoding is similar:

  • For each word in the input, find its position in the word list
  • Encode it as an unsigned k-bit integer (MSB first again)
  • Concatenate values from all groups

Implementations of the described procedures for k=18 are in engcode.py and engdecode.py. In the case of k=18, 9 bytes can be encoded using 4 words. There is a much more efficient encoder implemented in engcode.c that uses this fact. Decoding could also be implemented faster, for example using a precalculated prefix trie of words with their numbers in leaves.

The bits left over

If the number of bits to be encoded is not divisble by k, the missing at the end are assumed to be 0. This means that encodings of 0b1 and 0b10 are indistinguishable. To avoid this, one could either transmit the length separately or use a padding scheme as for block ciphers.

The word list

Based on Yet Another Word List by Mendel Cooper and Alan Beale from 12-Oct-2008, a list of 2^18 words was generated by selecting 2^18 shortest words and breaking the ties in favor of including words earlier in the alphabetical ordering. The resulting list starts with words aa and aah, ends with zyzzyvas and zzz and when delimited with newlines has an MD5 checksum of 4e0766283a0b1892b2ecbc28691e64b7. The script used is prepare.sh.

Legal...

Yet Another Word List was released to the public domain and so is the derived list and this specification. The code accompanying it is GPL3.

About

Encode binary data as english words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors