Words, even unfamiliar ones, are easier to remember than numbers or cryptic strings.
Each word represents its 0-based index in the alphabetically ordered word list. Assuming that the word list has 2^k entries, encoding binary data can be done as follows:
- Break the input into groups of k bits.
- Interpret each group as an unsigned k-bit integer where the most significant bit is the first one. Lets denote this value n.
- Output nth word from the word list.
Decoding is similar:
- For each word in the input, find its position in the word list
- Encode it as an unsigned k-bit integer (MSB first again)
- Concatenate values from all groups
Implementations of the described procedures for k=18 are in engcode.py
and engdecode.py. In the case of k=18, 9 bytes can be encoded
using 4 words. There is a much more
efficient encoder implemented in engcode.c that uses this fact.
Decoding could also be implemented faster, for example using a precalculated
prefix trie of words with their numbers in leaves.
If the number of bits to be encoded is not divisble by k, the missing
at the end are assumed to be 0. This means that encodings of 0b1 and
0b10 are indistinguishable. To avoid this, one could either transmit the length separately or use a padding scheme as for block ciphers.
Based on Yet Another Word List by Mendel Cooper and Alan Beale from
12-Oct-2008, a list of 2^18 words was generated by
selecting 2^18 shortest words and breaking the ties in favor of
including words earlier in the alphabetical ordering. The resulting list
starts with words aa and aah, ends with zyzzyvas and zzz and
when delimited with newlines has an MD5 checksum of
4e0766283a0b1892b2ecbc28691e64b7. The script used is prepare.sh.
Yet Another Word List was released to the public domain and so is the derived list and this specification. The code accompanying it is GPL3.
