Each guess you get a pattern consisting of black, yellow and green colours for each letter of the guess.
A guess of “PARSE” would return a hint with a pattern, like this:
| Word | Meaning |
|---|---|
| Hidden word | A word that could be the answer, but it is hidden from the player |
| Test word | A word that can be used as a guess |
| Pattern | The hint given by the game after making a guess |
| “Easy” mode | The original game where the set of test words is not restricted |
| “Hard” mode | A variation where the set of test words gets smaller depending on the previous guesses and patterns. In this variation, you can only use guesses that match the patterns provided |
The regular wordle list from 15-Feb-22 has hidden words , and test words .
The regular wordle problem has been looked at in great depth by Alex Selby on The best strategies for Wordle, part 2.
BIGHIDDEN mode, which is a harder version where there are hidden words and test words, is what I focused on. In particular, the wordle “easy” mode, which is ironically harder to compute than wordle hard mode, where the test words you can use are restricted by your previous guesses.
With BIGHIDDEN mode, it can be proven (by verifying the models) that there are at least unique starting test words that have a solution. Unless there are mistakes in the code, there should be exactly unique starting test words for this problem, because it was exhaustively calculated.
Finding the solution is quite difficult, but verifying that it is a correct solution is simple. The complete 6460 word list is available here, and the all the models are here.
Using solve from domsleee/wordle.
| Goal | Command |
|---|---|
| Find all solutions | ./bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt |
| Verify a model | ./bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt --verify model.json |
Refer to EndGameAnalysis/ folder.
This is a cutoff that caches all powersets of sets of words . Currently it uses 2276 sets of words that match on 4 letters + positions, e.g. pattern .arks.

The main idea is that if you are querying for a lowerbound for a set of hidden words when you have known lower bounds for sets of hidden words :
This only makes sense for easy mode, since for hard mode you would also need to consider the test words.
In terms of implementation details, I tried an inverted index approach SubsetCache.h, and also a trie approach.
See this PR.
We know the following
normal: bin/solve --guesses ext/wordle-guesses.txt --answers ext/wordle-answers.txt
bighidden: bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt
--other-program |
normal | bighidden |
|---|---|---|
any3in2 |
✅ | ✅ |
any4in2 |
❌ (13 groups, eg batch,catch,hatch,patch) | ❌ (11423 groups) |
any6in3 |
✅ | ❌ (eg fills,jills,sills,vills,bills,zills) |
Refer to RemoveGuessesBetterGuess/
Goal: reduce the size of test words by removing test words that have a better test word.
Let be the partitioning of the hidden word set using test word , i.e.
Where is the words in which would give pattern when guessing the word .
Define to mean is at least as good as if every partition of is a subset of a partition of :
implies that can be deleted from .
Consider this definition of solvability, which returns iff there is a solution in guesses:
We want to prove that will return an equivalent value if we remove values from .
For every that is used as a better guess than :
We can prove this by cases:
If any value of the RHS is , the claim is clearly true because the domain is .
If all values of the RHS is , then all values of the LHS is because every partition of the LHS is a subset of a partition on the RHS.
qed
The definition for least possible guesses, taken from The best strategies from Wordle:
The same definition for “better guess” doesn’t work because the sum can exclude a partition () . So for least guesses, we must add an additional check that .
This is nice however I could not find a fast enough way of doing this.
Using the partitioning approach, I could find an algorithm, but with a reasonable approximation, there is an approach. Typical values are and .
An approximation of this partitioning check is to reason about which partitions would be split, by reasoning about how two words of are differentiated:
If a letter does not occur in any word of , then any occurrence of that letter in a test word will not differentiate any two words of .
If every position of one letter across all has the same position in every word of (for all positions of the letter in the word), then any occurrence of that letter in a test word will not differentiate any two words of .
If there are exactly occurrences of in every word in , then any occurrence of in that is in a position other than all the positions of in can be ignored.
If there are at most occurrences of in every word in , a word with of can mark up to positions that don’t occur in any position of - they are guaranteed to be useless
If there are at least occurrences of in every word in , a word with with of can mark all letters that don’t occur in any position of - they are guaranteed to be useless
]]>See papers:
The above paper and technical document talks about a technique that can be used to reduce the number of nodes traversed by the IDA* search when finding an optimal solution for the n-puzzle.
IDA* search does not use any memory to store which nodes have already been visited, and so it will revisit the same nodes many times in the DFS search. The technique described in the paper is to create a state machine to govern successor nodes for a given path.
The technique is to find a list of forbidden operator strings, and then use the Aho-Corasick algorithm to build a finite state machine that can be used to skip operator strings that are redundant in the search. For example, exploring up to depth 2 produces operator strings lr, rl, ud, du, which builds an FSM with 9 states.
The branching factor was calculated by finding the number of paths of lengths 23, 24, 25. The geometric mean was taken of the branching factors of odd and even parity, e.g .
The result for duplicate operators matches what has already been shown in this paper:
Branching factor, calculated by:
./bin/puzzle -d databases/6666-reg --fsmDepthLimit 2 -e
The set of 16366272 duplicate operator strings for depth 22 can be downloaded here:
| Type | Number Words, Number FSM States | Branching factor |
|---|---|---|
| idfs2 | 4 strings, 5 states | 2.36762 = sqrt(2.39446*2.34108) |
| idfs14 | 18414 strings, 65558 states | 2.24915 = sqrt(2.27542*2.22317) |
| idfs22 | 16366272 strings, 61701626 states | 2.16475 = sqrt(2.18836*2.1414) |
Using the first two problems from Solving the 24-Puzzle.
The single-threaded solver uses plain IDA*, not a hybrid search. See log file: duplicate-nodes-log.txt.
Number of nodes
| problem# | idfs2 | idfs14 | idfs22 |
|---|---|---|---|
| 1 | 58,644,915,709 | 23,045,422,923 (39.29%) | 14,531,111,159 (24.78%) |
| 2 | 21,770,762,064 | 7,842,119,256 (36.02%) | 5,089,706,902 (23.38%) |
Runtime
| problem# | idfs2 | idfs14 | idfs22 |
|---|---|---|---|
| 1 | 7,029.19s | 2,977.23s (42.3%) | 2,527.94s (36.0%) |
| 2 | 2,447.82s | 956.079s (39.0%) | 870.295s (35.6%) |
Nodes per second
| fsm type | nodes per second |
|---|---|
| idfs2 | 8,485,342.7 nodes/s (80,415,677,773/9,477.01) |
| idfs14 | 7,852,813.5 nodes/s (30,887,542,179/3,933.309) |
| idfs22 | 5,773,826.1 nodes/s (19,620,818,061/3,398.235) |
These results show that as the state machine grows in number of states, the nodes per second decreases, probably due to cache locality. However, the decrease in the branching factor drastically decreases the number of the nodes that need to be expanded in the search.
In these two problems, idfs22 outperformed idfs14, however there will be easier problems where idfs14 will perform better.
The technique suggested in the paper uses a BFS to explore a larger version of the puzzle. For example, the 24-puzzle (5x5) - you would do a BFS on the 9x9 puzzle so that all paths would be found. In the search, paths where the bounding box of the blank tile is larger than 5 horizontally or vertically would be excluded. For example lrrrrr can not occur in the 24-puzzle, but can occur on the 9x9 grid.
By keeping track of the board states that have been visited before, we can exclude operator strings that revisit a known board state. We must ensure that the two board states can occur for every starting position of the blank tile. Taylor describes how to check this “Operator precondition”
To deal with this situation, a routing was written to test the “bounding box” of the actions of a pair of strings, A and B. B can be a duplicate if it is a match of A, and A has a bounding box contained within or identical to that of B.
For example:
When the blank tile begins at position 1 as in the video, then one of these operator strings can be considered a duplicate. However, if the blank tile begins in position 0, only the right operator string is valid, so there is no duplicate operator string in this case.
So to consider either of these as a duplicate operator is wrong, and could result in missing the optimal solution.
When using BFS and keeping track of seen board states, memory is quickly exhausted.
A key observation - if we have a list of operator strings of length n-1, and we know all shortest paths of length n for a given board state, we can make a function that will correctly choose the same subset of these shortest paths to be considered duplicate operators.
For this set of operator strings, we want to split it into two groups - a set of permitted operators, and a set of forbidden operators. We then must not violate the following constraint:
For each forbidden operator, there must be at least one permitted operator which has a length less than or equal to it, which bounding box is a subrange of the forbidden operator.
To compute this, we look at all 2-partitions of the set of operators for a given board state, and we have a sorting function that always determines the same “best” 2-partition to use.
Since the forbidden operator strings we are computing is a set, it does not matter if we process the same board state more than once! Because of this, an iterative DFS can be used to find all board states and all paths to that board state of a given depth. When searching at depth , an FSM with operator strings of length is used to find new operator strings of length . When memory is running low, a clean up process can begin, which is a DFS of depth which does not track any more board states, it will only find paths to known board states with paths of length . This clean-up process will find new duplicate operators, and remove the board states from memory since the duplicate operators of these board states has already been computed.
In this way, we can compute very long duplicate operators, at the cost of significantly more computation. When the clean up process is used, the branching factor is squared. So the branching factor moves from to . The memory requirement is reduced and is now bound by the number of paths for a board state of the same length, which is miniscule up to depth 22.
A naive scoring function is to count the number of possible starting positions of the blank tile for a forbidden operator. For example, a 2x2 bounding box in a 5x5 grid will have 16 possible starting positions. When scoring a 2-partition, we can take the sum of the number of possible starting positions of each forbidden operator in the forbidden operator partition.
I could not find any improvement on this scoring function. An idea that was explored was looking at the sum of the probabilities of the starting positions of the blank tile. However this did not produce a different scoring order.
The ordering used was the sum of the tiles, with a fallback to ordering lexicographically if they had the same score. Note that changing the fallback ordering will result in a different set of forbidden operator strings, which suggests that there may be a scoring function that produces better run times.
The first four duplicate operator strings are lr, rl, ud, du. The basic Aho-corasick algorithm is represented like:
This is a trie with nodes that have four children and a boolean if it is a word or not. Our implementation does not need a failure function because we populate every edge of the trie - it is not space efficient to use a double-array trie because our character set is small.
So the size of the trie is four ints + a boolean, which is 17 bytes per word:
struct TrieNode
{
int children[4];
bool isWord;
};
TrieNode trie[NUM_WORDS];
We can improve on this by noticing that we never visit words. When the IDA* detects a word through the FSM, it does not search on the branch. So we can replace every word in the trie with a dummy integer value.
In this example, there are 5 states from the original 9 states. This is also faster because checking if a child is a word has one less indirection:
// old structure
bool isChildAWord = trie[node.children[0]].isWord;
// new structure
bool isChildAWord = node.children[0] != WORD_STATE
The amount of space used in this representation is now 16 bytes per word, and there are now fewer words. For a dictionary of 178302 strings, the state reduction is from 832012 to 653710 states, which is about 21.5% smaller.
]]>In the 1996 paper, 10 instances of the 24-puzzle problem are looked at. Nine of ten of those instances are solved in the paper, with a lower bound for the tenth being established.
I have implemented a solver that can solve all 10 instances. This can be reproduced by running ./run from this commit 4f2a1be.
| Instance | Nodes Generated1 | Time taken (s) | Optimal Solution |
|---|---|---|---|
| 17 1 20 9 16 2 22 19 14 5 15 21 0 3 24 23 18 13 12 7 10 8 6 4 11 | 6,508,216,548 | 168 | 100 |
| 14 5 9 2 18 8 23 19 12 17 15 0 10 20 4 6 11 21 1 7 24 3 16 22 13 | 3,949,323,558 | 176 | 95 |
| 7 13 11 22 12 20 1 18 21 5 0 8 14 24 19 9 4 17 16 10 23 15 3 2 6 | 74,382,671,211 | 3172 | 108 |
| 18 14 0 9 8 3 7 19 2 15 5 12 1 13 24 23 4 21 10 20 16 22 11 6 17 | 75,528,769,943 | 1928 | 98 |
| 2 0 10 19 1 4 16 3 15 20 22 9 6 18 5 13 12 21 8 17 23 11 24 7 14 | 127,858,033,287 | 5536 | 101 |
| 16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14 | 373,302,608,938 | 8572 | 96 |
| 21 22 15 9 24 12 16 23 2 8 5 18 17 7 10 14 13 4 0 6 20 11 3 1 19 | 330,453,737,334 | 17656 | 104 |
| 6 0 24 14 8 5 21 19 9 17 16 20 10 13 2 15 11 22 1 3 7 23 4 18 12 | 68,097,975,369 | 3146 | 97 |
| 3 2 17 0 14 18 22 19 15 20 9 7 10 21 16 6 24 23 8 5 1 4 11 12 13 | 492,114,567,189 | 24793 | 113 |
| 23 14 0 24 17 9 20 21 2 18 10 13 22 1 3 11 4 16 6 5 7 12 8 15 19 | 10,172,974,643,392 | 422230 | 114 |
Run on Ubuntu 20.04 with a Ryzen r9-3900x CPU with 32GB ram.
The solution to the tenth instance, using a demo from Michael Kim:
The demo can be reproduced using these strings:
To generate the board from a reverse board (empty on top left):
L U R U L L D R D R U U L L L D L D R U R U U R R D L D L U L D R D L U L U R U U R D R U R D D D L D L U L L D R U U U U L D D D R U U U R D D D L U U U R D D R D L L U R R U U R D D D D L U U U L L D R D D L U R R U R D D L L
To solve the board (a reverse sequence of above with inverted operators):
R R U U L D L L D R U U L U R R D D D R U U U U L D D L L D R R U L U U L D D D R U U U L D D D L U U U R D D D D L U R R D R U R U U U L D L U L D D L D R D R U L U R D R U R U L L D D L D L U R U R R R D D L U L U R R D L D R
A zip containing the log file of all the results is here.
Differences to the implementation in Korf & Taylor paper:
The techniques used are similar to Nintendo Tetris AI Revisited by Meat Fighter. It may be worth reading his post first, as this post builds upon his work
We define an evaluate function as a heuristic that gives a score to a tetris board and block type (e.g. I-PIECE). evaluate is calculated by getting the dot product of a feature vector and a weights vector of the same size, i.e. . PSO was used to find optimal weights for the weights vector of evaluate. The feature vector and the PSO setup used is similar to that of Meat Fighter’s post. Note that the branching algorithm is not incorporated in the PSO training phase, evaluate is used directly.
Notable differences include:
{'c1': 2.5, 'c2': 2.5, 'w': 0.92}lock_height from the evaluation functionOur configuration for each particle is as follows:
The branching algorithm is inspired by alpha-beta pruning algorithms, except with only one agent. The inputs are a board and the next two block types from the piece set (call these and ). The idea is to search promising branches to find the best placement for .
Define the following methods:
findAllMoves(board, blockType): finds all moves (a.k.a PieceInfo) given a board a blockTypeevaluate(board, PieceInfo): uses predefined weights to give a score for the game stateFirstly, if there is a move that results in a tetris with , select that move.
If not, observe that we can apply these functions two times each to find an ideal position for just by using and . This would result in a tree of height , where we take the branch with the leaf node that has the best evaluation. This is the simplest way of utilising and , and describes what is used in Bfs in the results.
For more predictive power, we aim to maximise the expected evaluation of a partial search tree. To do this, we sum the product of the evaluation if we had block type , multiplied by the probability based on the piece RNG algorithm. Adding a depth parameter to make the results computable, the recursive relation looks like:
From this, we see that the branching factor for evalGivenBlockType calls is the number of block types () multiplied by the number of moves from findAllMoves (typically ). To mitigate this, we define a pruning factor , and only continue the search in the top scores, bringing the branching factor down to (however, evaluate still needs to be called on all of these states to establish which should be pruned).
An implementation detail is that evaluate adds an offset of -1e9 when a tetris is made. Slight modifications are made to ensure that this offset is carried through to the leaf node of the tree, to promote aggressive decisions
With pruning on and included (prune factors of first layer 3, subsequent layers 2), only one extra layer was able to be computed in a reasonable time (132x slower than no branching). Profiling reveals that the ratio of evaluate to findAllMoves is about the same between bfs and bfs+branching ( and respectively), and the number of calls to each function evaluate and findAllMoves is 154x and 146x respectively.
Below is a graph that visualises the branching factor. The numbers on the edges indicate how many children the parent node has. and are special because the block types are known. Note that, for each rectangle node, all the possible moves result in a call of evaluate, for pruning or finding the best value in the leaf node. This explains the number of calls to evaluate, despite the selective search (and therefore explains why it is much slower to compute)
After training is finished, a different set of 150 games were used to see how the bot performs. This was run with two configurations - with and without branching - using an identical state evaluation function evaluate:
| name | median | mean | maxout% |
|---|---|---|---|
| Bfs | 1066910 | 1053015 | 81.33 |
| Bfs+branching | 1230460 | 1193441 | 94.67 |
Note that the highest score that can be achieved is 1542000. Raw data: bfs, bfs_branch.csv
We can also compare the probability of a bot achieving a certain score:
From this data we draw the following conclusions:
There are a few things that can be done to improve the performance of the agent:
There are a few improvements that will be possible with better hardware:
evaluateUltimately I’m satisfied with the results - a 13% improvement is significant, especially when considering it is approaching the theoretical maximum score. It would be nice if it was less computationally expensive. There is potential for offloading the evaluation function (which is 75% of the processing) to the GPU, which would be a substantial improvement
]]>