TrafficLightsGame-AI

Q-learning agent that learns and plays Alan Parr’s Traffic Lights.

Trained Q-Tables are available here: Google Drive Folder

Game Rules

Components
- A 3×4 rectangular grid board (3 rows, 4 columns).
- Each cell can be in one of four states: Empty, Green, Yellow, Red.
Players
- Two players alternate turns.
- In simulations, Player 1 starts unless stated otherwise.
Starting Position
- The game begins with all cells in the Empty state.
Turn Actions
- On a turn, a player must select one cell and advance its state by exactly one step:
  - Empty → Green
  - Green → Yellow
  - Yellow → Red
- Once a cell reaches Red, it cannot be changed further.
- A turn cannot be skipped.
Winning Condition
- A player immediately wins if their move results in a line of three adjacent cells sharing the same non‑Empty color (Green, Yellow, or Red).
- A valid line may be:
  - Horizontal (across a row);
  - Vertical (down a column);
  - Diagonal (across adjacent rows and columns).
- A line of four still qualifies as it contains a segment of three (actually, two).

Repository Map

TrafficLights_UI.py: PvP Tkinter UI (local two-player match; highlights a winning line when it appears).
TrafficLights_QBot_Arena.py: PvE Tkinter UI (play against the learned Q-bot). Looks for qtable.json in the working directory.
TrafficLights_TrainingCode.py: Train, resume from checkpoint, or evaluate against baseline bots (Random / Alternate / Myopic). Periodically writes:
- qtable.json: current policy
- *_checkpoint.json: resumable state
- q*_deltas.jsonl: incremental updates (for auditability)

License: MIT (see LICENSE).

How `TrafficLights_TrainingCode.py` Works

1) Learning Algorithm (Q‑Learning) and Exploration

This code implements Tabular Q‑Learning with an ε‑greedy policy. The agent maintains a value table $Q(s,a)$ estimating the long‑term utility of taking action $a$ in state $s$. During training, on each turn, it chooses:

a random legal move with probability $\varepsilon$ (Exploration), or
the greedy action $\arg\max_a Q(s,a)$ with probability $1-\varepsilon$ (Exploitation).

Exploration is necessary because, from an empty board, the action space is large and most states are unseen early on; without random tries, the agent would get stuck repeating whatever it currently believes is best. The code uses a multiplicative decay per game so that $\varepsilon$ starts high and decreases toward a floor $\varepsilon_{\text{end}}$: this yields broad exploration at first and stable exploitation later.
Evaluation should be performed with $\varepsilon = 0$ (no randomness) so the agent plays its learned greedy policy deterministically.

Update rule. The current training loops apply terminal‑only updates: a Q‑value is adjusted only when the game ends. Let $r\in{+1,-1}$ denote the outcome from QBot’s perspective $(+1=\text{win}, -1=\text{loss})$. If QBot wins on move $(s,a)$, then

$$ Q(s,a) \leftarrow Q(s,a) + \alpha ( +1 - Q(s,a) ) $$

If the opponent wins immediately after QBot’s last move $(\tilde{s},\tilde{a})$, that pending action is penalized:

$$ Q(\tilde{s},\tilde{a}) \leftarrow Q(\tilde{s},\tilde{a}) + \alpha ( -1 - Q(\tilde{s},\tilde{a}) ) $$

The code alternates between Player 1 and Player 2, so QBot learns to play in both seats.

2) Baseline Opponents and Training Curriculum

RandomBot: samples uniformly among legal moves. It is ideal for early training because it exposes many different opening positions quickly. However, once QBot becomes competent, games against RandomBot rarely progress into rich mid/late positions, so it provides diminishing returns. It may also reinforce blunder plays from QBot, as long as the game is won by QBot.
AlternateBot: plays myopically (tries to win now, otherwise blocks an immediate loss) for an early, randomly chosen number of plays and then switches to random play. This curriculum forces the game past the opening so QBot experiences, explores, and updates on late‑game structures. The trade‑off is that, after switching to random, AlternateBot may fail to punish certain blunders, so QBot is not always corrected for every mistake.
MyopicBot: a deterministic, tactical baseline: if a winning move exists this turn, it takes it; else if the opponent could win next turn, it blocks; otherwise it picks randomly among remaining legal moves. Training against MyopicBot provides consistent punishment for obvious tactical errors and requires QBot to spot immediate wins and blocks.
HorizonBot: a lookahead opponent that searches N plies (half-moves) ahead using negamax/minimax with alpha-beta pruning. For each legal move, it simulates the best play from both sides up to depth N; if the horizon is reached, it falls back to a heuristic that favors immediate wins, avoids moves that allow an immediate loss, and rewards strong 3-in-a-row potential (with extra value on stable RED-based plans vs spoilable GREEN/YELLOW). Stronger than MyopicBot at spotting multi-move tactics, but slower as N increases.

Curriculum Mode (%): in addition to training against a single fixed opponent, the training includes a curriculum mode that mixes multiple opponents with percentages (summing to 100%). Each game, the opponent type is sampled according to these weights (e.g., 70% QBot self-play, 20% HorizonBot, 10% AlternateBot). This reduces overfitting to any single opponent style (it was, in fact, developed with self-play case in mind) and helps the agent see a wider range of mid/late-game positions.

Self‑play note: a QBot‑vs‑QBot mode is a natural extension for covering late‑game slips that baseline bots might miss and, in principle, it would double data throughput (both players learn simultaneously). This training code also includes QBot-vs-QBot, even though I only recommend it after a somewhat intense training with the Baseline Opponents.

3) Persistence: Atomic Saves, Delta Logging, and Checkpoints

Atomic Q‑table saves. When writing qtable.json, the code first writes to a temporary file (e.g., qtable.json.tmp), flushes to disk, and then performs an atomic rename. If the process stops mid‑write, the previous qtable.json remains intact. Crucial if qtable.json is huge.
Delta logger. Every Q‑value update is appended to a newline‑delimited JSON file (e.g., q_deltas.jsonl) as a compact record {s, a, v} containing the updated entry. On resume, the code replays all logged deltas into memory (or into an existing Q‑table) and then commits a single atomic save. After a successful consolidation, the delta file is rotated (e.g., renamed to q_deltas.jsonl.done) so it is not re‑applied.
Checkpoints and safe interrupt. Periodically, and on Ctrl‑C, the code writes a checkpoint (e.g., *_checkpoint.json) storing the next game index and the current exploration rate $\varepsilon$. On interrupt, it also flushes any outstanding deltas and performs one last atomic save, so training can be resumed exactly where it stopped without losing Q‑table progress.

Visual Overview

Requirements

Python 3.x. No external pip dependencies.
On Linux, install Tkinter if missing: sudo apt install python3-tk.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
TrafficLights_QBot_Arena.py		TrafficLights_QBot_Arena.py
TrafficLights_TrainingCode.py		TrafficLights_TrainingCode.py
TrafficLights_UI.py		TrafficLights_UI.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrafficLightsGame-AI

Game Rules

Repository Map

How `TrafficLights_TrainingCode.py` Works

1) Learning Algorithm (Q‑Learning) and Exploration

2) Baseline Opponents and Training Curriculum

3) Persistence: Atomic Saves, Delta Logging, and Checkpoints

Visual Overview

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrafficLightsGame-AI

Game Rules

Repository Map

How TrafficLights_TrainingCode.py Works

1) Learning Algorithm (Q‑Learning) and Exploration

2) Baseline Opponents and Training Curriculum

3) Persistence: Atomic Saves, Delta Logging, and Checkpoints

Visual Overview

Requirements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How `TrafficLights_TrainingCode.py` Works

Packages