Q-learning agent that learns and plays Alan Parr’s Traffic Lights.
Trained Q-Tables are available here: Google Drive Folder
- Components
- A 3×4 rectangular grid board (3 rows, 4 columns).
- Each cell can be in one of four states: Empty, Green, Yellow, Red.
- Players
- Two players alternate turns.
- In simulations, Player 1 starts unless stated otherwise.
- Starting Position
- The game begins with all cells in the Empty state.
- Turn Actions
- On a turn, a player must select one cell and advance its state by exactly one step:
- Empty → Green
- Green → Yellow
- Yellow → Red
- Once a cell reaches Red, it cannot be changed further.
- A turn cannot be skipped.
- On a turn, a player must select one cell and advance its state by exactly one step:
- Winning Condition
- A player immediately wins if their move results in a line of three adjacent cells sharing the same non‑Empty color (Green, Yellow, or Red).
- A valid line may be:
- Horizontal (across a row);
- Vertical (down a column);
- Diagonal (across adjacent rows and columns).
- A line of four still qualifies as it contains a segment of three (actually, two).
TrafficLights_UI.py: PvP Tkinter UI (local two-player match; highlights a winning line when it appears).TrafficLights_QBot_Arena.py: PvE Tkinter UI (play against the learned Q-bot). Looks forqtable.jsonin the working directory.TrafficLights_TrainingCode.py: Train, resume from checkpoint, or evaluate against baseline bots (Random / Alternate / Myopic). Periodically writes:qtable.json: current policy*_checkpoint.json: resumable stateq*_deltas.jsonl: incremental updates (for auditability)
License: MIT (see LICENSE).
This code implements Tabular Q‑Learning with an ε‑greedy policy. The agent maintains a value table
- a random legal move with probability
$\varepsilon$ (Exploration), or - the greedy action
$\arg\max_a Q(s,a)$ with probability$1-\varepsilon$ (Exploitation).
Exploration is necessary because, from an empty board, the action space is large and most states are unseen early on; without random tries, the agent would get stuck repeating whatever it currently believes is best. The code uses a multiplicative decay per game so that
Evaluation should be performed with
Update rule. The current training loops apply terminal‑only updates: a Q‑value is adjusted only when the game ends. Let
If the opponent wins immediately after QBot’s last move
The code alternates between Player 1 and Player 2, so QBot learns to play in both seats.
- RandomBot: samples uniformly among legal moves. It is ideal for early training because it exposes many different opening positions quickly. However, once QBot becomes competent, games against RandomBot rarely progress into rich mid/late positions, so it provides diminishing returns. It may also reinforce blunder plays from QBot, as long as the game is won by QBot.
- AlternateBot: plays myopically (tries to win now, otherwise blocks an immediate loss) for an early, randomly chosen number of plays and then switches to random play. This curriculum forces the game past the opening so QBot experiences, explores, and updates on late‑game structures. The trade‑off is that, after switching to random, AlternateBot may fail to punish certain blunders, so QBot is not always corrected for every mistake.
- MyopicBot: a deterministic, tactical baseline: if a winning move exists this turn, it takes it; else if the opponent could win next turn, it blocks; otherwise it picks randomly among remaining legal moves. Training against MyopicBot provides consistent punishment for obvious tactical errors and requires QBot to spot immediate wins and blocks.
- HorizonBot: a lookahead opponent that searches N plies (half-moves) ahead using negamax/minimax with alpha-beta pruning. For each legal move, it simulates the best play from both sides up to depth N; if the horizon is reached, it falls back to a heuristic that favors immediate wins, avoids moves that allow an immediate loss, and rewards strong 3-in-a-row potential (with extra value on stable RED-based plans vs spoilable GREEN/YELLOW). Stronger than MyopicBot at spotting multi-move tactics, but slower as N increases.
Curriculum Mode (%): in addition to training against a single fixed opponent, the training includes a curriculum mode that mixes multiple opponents with percentages (summing to 100%). Each game, the opponent type is sampled according to these weights (e.g., 70% QBot self-play, 20% HorizonBot, 10% AlternateBot). This reduces overfitting to any single opponent style (it was, in fact, developed with self-play case in mind) and helps the agent see a wider range of mid/late-game positions.
Self‑play note: a QBot‑vs‑QBot mode is a natural extension for covering late‑game slips that baseline bots might miss and, in principle, it would double data throughput (both players learn simultaneously). This training code also includes QBot-vs-QBot, even though I only recommend it after a somewhat intense training with the Baseline Opponents.
-
Atomic Q‑table saves. When writing
qtable.json, the code first writes to a temporary file (e.g.,qtable.json.tmp), flushes to disk, and then performs an atomic rename. If the process stops mid‑write, the previousqtable.jsonremains intact. Crucial ifqtable.jsonis huge. -
Delta logger. Every Q‑value update is appended to a newline‑delimited JSON file (e.g.,
q_deltas.jsonl) as a compact record{s, a, v}containing the updated entry. On resume, the code replays all logged deltas into memory (or into an existing Q‑table) and then commits a single atomic save. After a successful consolidation, the delta file is rotated (e.g., renamed toq_deltas.jsonl.done) so it is not re‑applied. -
Checkpoints and safe interrupt. Periodically, and on Ctrl‑C, the code writes a checkpoint (e.g.,
*_checkpoint.json) storing the next game index and the current exploration rate$\varepsilon$ . On interrupt, it also flushes any outstanding deltas and performs one last atomic save, so training can be resumed exactly where it stopped without losing Q‑table progress.
- Python 3.x. No external pip dependencies.
- On Linux, install Tkinter if missing:
sudo apt install python3-tk.

