Dom’s Tech blog

Solving Wordle - Novel Strategies for the NP hard problem

2022-10-14T14:27:31+00:00

Wordle is the popular game where you try to guess a hidden word in $6$ guesses.

Each guess you get a pattern consisting of black, yellow and green colours for each letter of the guess.

A guess of “PARSE” would return a hint with a pattern, like this:

Terminology

Word	Meaning
Hidden word	A word that could be the answer, but it is hidden from the player
Test word	A word that can be used as a guess
Pattern	The hint given by the game after making a guess
“Easy” mode	The original game where the set of test words is not restricted
“Hard” mode	A variation where the set of test words gets smaller depending on the previous guesses and patterns. In this variation, you can only use guesses that match the patterns provided

Results

The regular wordle list from 15-Feb-22 has $2315$ hidden words $H$ , and $12972$ test words $T$ .

The regular wordle problem has been looked at in great depth by Alex Selby on The best strategies for Wordle, part 2.

BIGHIDDEN mode, which is a harder version where there are $12972$ hidden words and $12972$ test words, is what I focused on. In particular, the wordle “easy” mode, which is ironically harder to compute than wordle hard mode, where the test words you can use are restricted by your previous guesses.

With BIGHIDDEN mode, it can be proven (by verifying the models) that there are at least $6460$ unique starting test words that have a solution. Unless there are mistakes in the code, there should be exactly $6460$ unique starting test words for this problem, because it was exhaustively calculated.

Finding the solution is quite difficult, but verifying that it is a correct solution is simple. The complete 6460 word list is available here, and the all the models are here.

Using solve from domsleee/wordle.

Goal	Command
Find all solutions	`./bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt`
Verify a model	`./bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt --verify model.json`

New strategies

1. EndGame Db

Refer to EndGameAnalysis/ folder.

This is a cutoff that caches all powersets of sets of words $E$ . Currently it uses 2276 sets of words that match on 4 letters + positions, e.g. pattern .arks.

2. Lower bounds using a cache entry that is a subset

The main idea is that if you are querying for a lowerbound for a set of hidden words $H$ when you have known lower bounds for sets of hidden words $L = \{H_1,H_2,...,H_n\}$ :

l = \textrm{max}_{H \in L}\{lb(H) \mid H \subseteq Q\}

This only makes sense for easy mode, since for hard mode you would also need to consider the test words.

In terms of implementation details, I tried an inverted index approach SubsetCache.h, and also a trie approach.

3. Most solved in 2 or 3 guesses

See this PR.

We know the following

2 guesses ⇒ any 3 words can be solved (by exhaustion)
3 guesses ⇒ any 5 words can be solved (reasoning, {1,1,1} can add any 2)

normal: bin/solve --guesses ext/wordle-guesses.txt --answers ext/wordle-answers.txt

bighidden: bin/solve --guesses ext/wordle-combined.txt --answers ext/wordle-combined.txt

`--other-program`	normal	bighidden
`any3in2`	✅	✅
`any4in2`	❌ (13 groups, eg batch,catch,hatch,patch)	❌ (11423 groups)
`any6in3`	✅	❌ (eg fills,jills,sills,vills,bills,zills)

4. Remove guesses that have a better guess

Refer to RemoveGuessesBetterGuess/

Goal: reduce the size of test words $T$ by removing test words that have a better test word.

Let $\mathcal{P}(H, t)$ be the partitioning of the hidden word set $H$ using test word $t$ , i.e.

\mathcal{P}(H,t)=\{P(H,t,x)\mid x\in X\}

Where $P(H,t,x)$ is the words in $H$ which would give pattern $x$ when guessing the word $t$ .

Define $t_1 \geq t_2$ to mean $t_1$ is at least as good as $t_2$ if every partition of $t_1$ is a subset of a partition of $t_2$ :

\forall_{p_1\in\mathcal{P}(H,t_1)}\exists_{p_2\in \mathcal{P}(H,t_2)}(p_1 \subseteq p_2)

$t_1 \geq t_2$ implies that $t_2$ can be deleted from $T$ .

proof

Consider this definition of solvability, which returns $1$ iff there is a solution in $d$ guesses:

\mathcal{F}(H,d) = \begin{cases} |H|\leq d &d\leq1\\ \text{max}_{t\in T}\prod_{p\in\mathcal{P}(H,t)}\mathcal{F}(p,d-1) &\text{otherwise} \end{cases}

We want to prove that $\mathcal{F}(H,d)$ will return an equivalent value if we remove values from $T$ .

For every $t_1$ that is used as a better guess than $t_2$ :

\prod_{p\in\mathcal{P}(H,t_1)}\mathcal{F}(p,d-1) \geq \prod_{p\in\mathcal{P}(H,t_2)}\mathcal{F}(p,d-1)

We can prove this by cases:

If any value of the RHS is $0$ , the claim is clearly true because the domain is $\{0,1\}$ .

If all values of the RHS is $1$ , then all values of the LHS is $1$ because every partition of the LHS is a subset of a partition on the RHS.

qed

For least possible guesses

The definition for least possible guesses, taken from The best strategies from Wordle:

f(H) = \begin{cases} |H| &|H|\leq1\\ |H|+\text{min}_{t\in T}\sum_{s \ne GGGGG}f(P(H,t,s)) &\text{otherwise} \end{cases}

The same definition for “better guess” doesn’t work because the sum can exclude a partition ( $s \ne GGGGG$ ) . So for least guesses, we must add an additional check that $P(H,t_1,GGGGG)=P(H,t_2,GGGGG)$ .

Approximation

This is nice however I could not find a fast enough way of doing this.

Using the partitioning approach, I could find an $O(T^2H)$ algorithm, but with a reasonable approximation, there is an $O(T\log{T})$ approach. Typical values are $T = 4000$ and $H = 40$ .

An approximation of this partitioning check is to reason about which partitions $p_2$ would be split, by reasoning about how two words of $H$ are differentiated:

1 Non-letter rule

If a letter does not occur in any word of $H$ , then any occurrence of that letter in a test word will not differentiate any two words of $H$ .

2 Green letter rule

If every position of one letter across all $H$ has the same position in every word of $H$ (for all positions of the letter in the word), then any occurrence of that letter in a test word will not differentiate any two words of $H$ .

3 exactly-n letter rule (yellow letter rule)

If there are exactly $n$ occurrences of $c$ in every word in $H$ , then any occurrence of $c$ in $T$ that is in a position other than all the positions of $c$ in $H$ can be ignored.

If there are at most two $c$ $c$ in every word in $H$ $H$ , and one of them is green, the same rule does not apply
- reason: for guess $t$ , 1Y gives no information when $t$ has $1c$ but gives information when $t$ has $2c$ . Example: $1Y$ from $2c$ implies $h$ has $1c$

4 most-n letter rule

If there are at most $n$ occurrences of $c$ in every word in $H$ , a word $t$ with $≥n+m$ of $c$ can mark up to $m$ positions that don’t occur in any position of $H$ - they are guaranteed to be useless $B$

5 least-n letter rule

If there are at least $n$ occurrences of $c$ in every word in $H$ , a word with $t$ with $\le n$ of $c$ can mark all letters that don’t occur in any position of $H$ - they are guaranteed to be useless $Y$

Pruning Duplicate Nodes in N-Puzzle

2022-01-29T14:27:31+00:00

Introduction

See papers:

The above paper and technical document talks about a technique that can be used to reduce the number of nodes traversed by the IDA* search when finding an optimal solution for the n-puzzle.

IDA* search does not use any memory to store which nodes have already been visited, and so it will revisit the same nodes many times in the DFS search. The technique described in the paper is to create a state machine to govern successor nodes for a given path.

The technique is to find a list of forbidden operator strings, and then use the Aho-Corasick algorithm to build a finite state machine that can be used to skip operator strings that are redundant in the search. For example, exploring up to depth 2 produces operator strings lr, rl, ud, du, which builds an FSM with 9 states.

Results

The branching factor was calculated by finding the number of paths of lengths 23, 24, 25. The geometric mean was taken of the branching factors of odd and even parity, e.g $\sqrt{\frac{N_{25}}{N_{24}} \cdot \frac{N_{24}}{N_{23}}}$ .

The result for duplicate operators matches what has already been shown in this paper:

2000 - Time complexity of iterative-deepening-A in 2.4 Results.

Branching factor, calculated by:

./bin/puzzle -d databases/6666-reg --fsmDepthLimit 2 -e

The set of 16366272 duplicate operator strings for depth 22 can be downloaded here:

fsm-idfs-5x5-22.zip.

Type	Number Words, Number FSM States	Branching factor
idfs2	4 strings, 5 states	2.36762 = sqrt(2.39446*2.34108)
idfs14	18414 strings, 65558 states	2.24915 = sqrt(2.27542*2.22317)
idfs22	16366272 strings, 61701626 states	2.16475 = sqrt(2.18836*2.1414)

Experimental results

Using the first two problems from Solving the 24-Puzzle.

The single-threaded solver uses plain IDA*, not a hybrid search. See log file: duplicate-nodes-log.txt.

Number of nodes

problem#	idfs2	idfs14	idfs22
1	58,644,915,709	23,045,422,923 (39.29%)	14,531,111,159 (24.78%)
2	21,770,762,064	7,842,119,256 (36.02%)	5,089,706,902 (23.38%)

Runtime

problem#	idfs2	idfs14	idfs22
1	7,029.19s	2,977.23s (42.3%)	2,527.94s (36.0%)
2	2,447.82s	956.079s (39.0%)	870.295s (35.6%)

Nodes per second

fsm type	nodes per second
idfs2	8,485,342.7 nodes/s (80,415,677,773/9,477.01)
idfs14	7,852,813.5 nodes/s (30,887,542,179/3,933.309)
idfs22	5,773,826.1 nodes/s (19,620,818,061/3,398.235)

These results show that as the state machine grows in number of states, the nodes per second decreases, probably due to cache locality. However, the decrease in the branching factor drastically decreases the number of the nodes that need to be expanded in the search.

In these two problems, idfs22 outperformed idfs14, however there will be easier problems where idfs14 will perform better.

Algorithm to find the duplicate operator strings

The technique suggested in the paper uses a BFS to explore a larger version of the puzzle. For example, the 24-puzzle (5x5) - you would do a BFS on the 9x9 puzzle so that all paths would be found. In the search, paths where the bounding box of the blank tile is larger than 5 horizontally or vertically would be excluded. For example lrrrrr can not occur in the 24-puzzle, but can occur on the 9x9 grid.

By keeping track of the board states that have been visited before, we can exclude operator strings that revisit a known board state. We must ensure that the two board states can occur for every starting position of the blank tile. Taylor describes how to check this “Operator precondition”

To deal with this situation, a routing was written to test the “bounding box” of the actions of a pair of strings, A and B. B can be a duplicate if it is a match of A, and A has a bounding box contained within or identical to that of B.

For example:

When the blank tile begins at position 1 as in the video, then one of these operator strings can be considered a duplicate. However, if the blank tile begins in position 0, only the right operator string is valid, so there is no duplicate operator string in this case.

So to consider either of these as a duplicate operator is wrong, and could result in missing the optimal solution.

How to handle running out of memory

When using BFS and keeping track of seen board states, memory is quickly exhausted.

A key observation - if we have a list of operator strings of length n-1, and we know all shortest paths of length n for a given board state, we can make a function that will correctly choose the same subset of these shortest paths to be considered duplicate operators.

For this set of operator strings, we want to split it into two groups - a set of permitted operators, and a set of forbidden operators. We then must not violate the following constraint:

For each forbidden operator, there must be at least one permitted operator which has a length less than or equal to it, which bounding box is a subrange of the forbidden operator.

To compute this, we look at all 2-partitions of the set of operators for a given board state, and we have a sorting function that always determines the same “best” 2-partition to use.

Since the forbidden operator strings we are computing is a set, it does not matter if we process the same board state more than once! Because of this, an iterative DFS can be used to find all board states and all paths to that board state of a given depth. When searching at depth $n$ , an FSM with operator strings of length $n-1$ is used to find new operator strings of length $n$ . When memory is running low, a clean up process can begin, which is a DFS of depth $n$ which does not track any more board states, it will only find paths to known board states with paths of length $n$ . This clean-up process will find new duplicate operators, and remove the board states from memory since the duplicate operators of these board states has already been computed.

In this way, we can compute very long duplicate operators, at the cost of significantly more computation. When the clean up process is used, the branching factor is squared. So the branching factor moves from $O(2.36^n)$ to $O(5.57^n)$ . The memory requirement is reduced and is now bound by the number of paths for a board state of the same length, which is miniscule up to depth 22.

Improving the scoring function

A naive scoring function is to count the number of possible starting positions of the blank tile for a forbidden operator. For example, a 2x2 bounding box in a 5x5 grid will have 16 possible starting positions. When scoring a 2-partition, we can take the sum of the number of possible starting positions of each forbidden operator in the forbidden operator partition.

I could not find any improvement on this scoring function. An idea that was explored was looking at the sum of the probabilities of the starting positions of the blank tile. However this did not produce a different scoring order.

The ordering used was the sum of the tiles, with a fallback to ordering lexicographically if they had the same score. Note that changing the fallback ordering will result in a different set of forbidden operator strings, which suggests that there may be a scoring function that produces better run times.

FSM Structural Improvement

The first four duplicate operator strings are lr, rl, ud, du. The basic Aho-corasick algorithm is represented like:

This is a trie with nodes that have four children and a boolean if it is a word or not. Our implementation does not need a failure function because we populate every edge of the trie - it is not space efficient to use a double-array trie because our character set is small.

So the size of the trie is four ints + a boolean, which is 17 bytes per word:

struct TrieNode
{
    int children[4];
    bool isWord;
};
TrieNode trie[NUM_WORDS];

We can improve on this by noticing that we never visit words. When the IDA* detects a word through the FSM, it does not search on the branch. So we can replace every word in the trie with a dummy integer value.

In this example, there are 5 states from the original 9 states. This is also faster because checking if a child is a word has one less indirection:

// old structure
bool isChildAWord = trie[node.children[0]].isWord;

// new structure
bool isChildAWord = node.children[0] != WORD_STATE

The amount of space used in this representation is now 16 bytes per word, and there are now fewer words. For a dictionary of 178302 strings, the state reduction is from 832012 to 653710 states, which is about 21.5% smaller.

Solving the 24-Puzzle

2021-11-25T14:27:31+00:00

See papers:

Finding Optimal Solutions to the Twenty-Four Puzzle (1996) by Richard E. Korf and Larry A. Taylor.
Disjoint pattern database heuristics (2002) by Richard E. Korf, Ariel Felner.

In the 1996 paper, 10 instances of the 24-puzzle problem are looked at. Nine of ten of those instances are solved in the paper, with a lower bound for the tenth being established.

I have implemented a solver that can solve all 10 instances. This can be reproduced by running ./run from this commit 4f2a1be.

Instance	Nodes Generated¹	Time taken (s)	Optimal Solution
17 1 20 9 16 2 22 19 14 5 15 21 0 3 24 23 18 13 12 7 10 8 6 4 11	6,508,216,548	168	100
14 5 9 2 18 8 23 19 12 17 15 0 10 20 4 6 11 21 1 7 24 3 16 22 13	3,949,323,558	176	95
7 13 11 22 12 20 1 18 21 5 0 8 14 24 19 9 4 17 16 10 23 15 3 2 6	74,382,671,211	3172	108
18 14 0 9 8 3 7 19 2 15 5 12 1 13 24 23 4 21 10 20 16 22 11 6 17	75,528,769,943	1928	98
2 0 10 19 1 4 16 3 15 20 22 9 6 18 5 13 12 21 8 17 23 11 24 7 14	127,858,033,287	5536	101
16 5 1 12 6 24 17 9 2 22 4 10 13 18 19 20 0 23 7 21 15 11 8 3 14	373,302,608,938	8572	96
21 22 15 9 24 12 16 23 2 8 5 18 17 7 10 14 13 4 0 6 20 11 3 1 19	330,453,737,334	17656	104
6 0 24 14 8 5 21 19 9 17 16 20 10 13 2 15 11 22 1 3 7 23 4 18 12	68,097,975,369	3146	97
3 2 17 0 14 18 22 19 15 20 9 7 10 21 16 6 24 23 8 5 1 4 11 12 13	492,114,567,189	24793	113
23 14 0 24 17 9 20 21 2 18 10 13 22 1 3 11 4 16 6 5 7 12 8 15 19	10,172,974,643,392	422230	114

The nodes generated does not include the last iteration, so these values are in reality a bit higher.

Run on Ubuntu 20.04 with a Ryzen r9-3900x CPU with 32GB ram.

The solution to the tenth instance, using a demo from Michael Kim:

The demo can be reproduced using these strings:

To generate the board from a reverse board (empty on top left):
L U R U L L D R D R U U L L L D L D R U R U U R R D L D L U L D R D L U L U R U U R D R U R D D D L D L U L L D R U U U U L D D D R U U U R D D D L U U U R D D R D L L U R R U U R D D D D L U U U L L D R D D L U R R U R D D L L

To solve the board (a reverse sequence of above with inverted operators):
R R U U L D L L D R U U L U R R D D D R U U U U L D D L L D R R U L U U L D D D R U U U L D D D L U U U R D D D D L U R R D R U R U U U L D L U L D D L D R D R U L U R D R U R U L L D D L D L U R U R R R D D L U L U R R D L D R

A zip containing the log file of all the results is here.

Differences to the implementation in Korf & Taylor paper:

Better heuristics were used. This implementation uses walking distance and disjoint pattern databases. Michael Kim talks about it in his blog post. These heuristics made some of the problems easier to solve.
Uses multiprocessing to utilise 24 threads. A small BFS is used to generate a pool of starting nodes, which are then used in the IDA* search. Threads are allocated groups of starting nodes to solve for each iteration of the IDA* search. This could be generalised in a distributed system, since it is possible to generate a large pool of nodes to start with. For implementation details, see IdaStarMulti.cpp.
The FSM used to prune nodes was smaller. The forbidden words dictionary for this implementation was generated from 16 BFS searches of depth 14, which had 10018 words and 33451 states. Korf & Taylor generated a structure with 619,000 states, so this structure would have pruned many more nodes.

High score tetris AI using branching algorithms

2020-04-12T14:27:31+00:00

This post will explore how branching algorithms can improve the performance of agents playing tetris without gravity.

The techniques used are similar to Nintendo Tetris AI Revisited by Meat Fighter. It may be worth reading his post first, as this post builds upon his work

PSO configuration

We define an evaluate function as a heuristic that gives a score to a tetris board and block type (e.g. I-PIECE). evaluate is calculated by getting the dot product of a feature vector and a weights vector of the same size, i.e. $\text{score} = W_v\cdot F_v$ . PSO was used to find optimal weights for the weights vector of evaluate. The feature vector and the PSO setup used is similar to that of Meat Fighter’s post. Note that the branching algorithm is not incorporated in the PSO training phase, evaluate is used directly.

Notable differences include:

Using global PSO from pyswarms, probably with different parameters
- {'c1': 2.5, 'c2': 2.5, 'w': 0.92}
- num particles: 96
- num iterations: 3600
Removing lock_height from the evaluation function

Our configuration for each particle is as follows:

each particle plays 150 games using the same piece sets
the score for a particle is the average of the top 50% of games

Branching algorithm description

The branching algorithm is inspired by alpha-beta pruning algorithms, except with only one agent. The inputs are a board and the next two block types from the piece set (call these $p_1$ and $p_2$ ). The idea is to search promising branches to find the best placement for $p_1$ .

Bfs algorithm

Define the following methods:

findAllMoves(board, blockType): finds all moves (a.k.a PieceInfo) given a board a blockType
evaluate(board, PieceInfo): uses predefined weights to give a score for the game state

Firstly, if there is a move that results in a tetris with $p_1$ , select that move.

If not, observe that we can apply these functions two times each to find an ideal position for $p_1$ just by using $p_1$ and $p_2$ . This would result in a tree of height $2$ , where we take the branch with the leaf node that has the best evaluation. This is the simplest way of utilising $p_1$ and $p_2$ , and describes what is used in Bfs in the results.

Bfs+branching algorithm

For more predictive power, we aim to maximise the expected evaluation of a partial search tree. To do this, we sum the product of the evaluation if we had block type $t$ , multiplied by the probability $P(t)$ based on the piece RNG algorithm. Adding a depth parameter $d$ to make the results computable, the recursive relation looks like:

\texttt{evalBoard}(b, d) = \sum_{t \in T}{P(t)\cdot\texttt{evalGivenBlockType}(b, t, d)}\\ \text{where }T\text{ is the set of block types}\\ \texttt{evalGivenBlockType}(b, t, d) = \begin{cases} \max\limits_{p\in M(b, t)}{\texttt{evaluate}(b, p)} & \text{if } d = 0 \\ \max\limits_{p\in M(b, t)}\texttt{evalBoard}(\texttt{applyPiece}(b, p), d-1) & \text{otherwise} \end{cases} \\ \text{where }M = \texttt{findAllMoves}

From this, we see that the branching factor for evalGivenBlockType calls is the number of block types ( $7$ ) multiplied by the number of moves from findAllMoves (typically $10-20$ ). To mitigate this, we define a pruning factor $n_1$ , and only continue the search in the top $n_1$ scores, bringing the branching factor down to $7n_1$ (however, evaluate still needs to be called on all of these states to establish which should be pruned).

An implementation detail is that evaluate adds an offset of -1e9 when a tetris is made. Slight modifications are made to ensure that this offset is carried through to the leaf node of the tree, to promote aggressive decisions

With pruning on $p_1$ and $p_2$ included (prune factors of first layer 3, subsequent layers 2), only one extra layer was able to be computed in a reasonable time (132x slower than no branching). Profiling reveals that the ratio of evaluate to findAllMoves is about the same between bfs and bfs+branching ( $73:19$ and $78:16$ respectively), and the number of calls to each function evaluate and findAllMoves is 154x and 146x respectively.

Bfs+branching visualisation

Below is a graph that visualises the branching factor. The numbers on the edges indicate how many children the parent node has. $s_1$ and $s_2$ are special because the block types are known. Note that, for each rectangle node, all the possible moves result in a call of evaluate, for pruning or finding the best value in the leaf node. This explains the number of calls to evaluate, despite the selective search (and therefore explains why it is much slower to compute)

Results

After training is finished, a different set of 150 games were used to see how the bot performs. This was run with two configurations - with and without branching - using an identical state evaluation function evaluate:

name	median	mean	maxout%
Bfs	1066910	1053015	81.33
Bfs+branching	1230460	1193441	94.67

Note that the highest score that can be achieved is 1542000. Raw data: bfs, bfs_branch.csv

We can also compare the probability of a bot achieving a certain score:

From this data we draw the following conclusions:

96% of the time, bfs+branching is superior
On average, the bfs+branching achieves an additional 140426 points (or a 13% improvement)
Notably, bfs+branching is 132x less efficient than bfs (913.5s vs 6.9s to compute)

Possible improvements

There are a few things that can be done to improve the performance of the agent:

Adjusting the parameters
Adding or finding more features
Using a neural network instead of a linear combination of parameters is an idea. Using a neural network can allow for cases where one feature can have more or less weight depending on another feature. For example, the agent should play more safely when the pile height is high or the board is too jagged

There are a few improvements that will be possible with better hardware:

Training with a larger dataset
Training with the branching algorithm instead of evaluate

Ultimately I’m satisfied with the results - a 13% improvement is significant, especially when considering it is approaching the theoretical maximum score. It would be nice if it was less computationally expensive. There is potential for offloading the evaluation function (which is 75% of the processing) to the GPU, which would be a substantial improvement