- Elyse Marie Uyiringiye
- Nice Eva Karabaranga
- Best Verie Iradukunda
- Raissa Irutingabo
[https://drive.google.com/drive/folders/18KEPQ7AlXoZxDiDu8lFAwod63G7g-8rQ?usp=sharing]
This project implements a Deep Q-Network (DQN) agent trained to play the Atari Boxing-v5 environment using Stable Baselines3 and Gymnasium. This requires training and comparing multiple DQN configurations with different hyperparameters and policy architectures (CNNPolicy vs MLPPolicy).
- Gymnasium Atari environment:
ALE/Boxing-v5 - Framework: Stable-Baselines3
DQN
scripts/train.py: trains different DQN experiments and saves model artifactsscripts/play.py: loads trained best model and runs evaluation gameplay
stable-baselines3 (≥2.3.2): DQN algorithm and policy implementations gymnasium[atari] (≥1.0.0): Atari environment wrapper ale-py (≥0.10.1): Atari Learning Environment interface torch: Neural network backend pandas, matplotlib: Data analysis and visualization
pip install -r requirements.txt
AutoROM --accept-license
| Experiment | Policy | Mean Reward | Std Reward | Train Time (min) | Notes |
|---|---|---|---|---|---|
| boxing_exp01_baseline_cnn | CnnPolicy | 5.9 | 5.50 | 14.23 | Best overall |
| boxing_exp05_high_gamma_cnn | CnnPolicy | 4.7 | 3.66 | 14.11 | Higher gamma |
| boxing_exp08_less_exploration_cnn | CnnPolicy | 4.1 | 4.50 | 14.36 | Less exploration |
| boxing_exp02_small_batch_cnn | CnnPolicy | 3.8 | 5.27 | 13.62 | Smaller batch |
| boxing_exp04_low_gamma_cnn | CnnPolicy | 1.2 | 3.91 | 14.08 | Lower gamma |
| boxing_exp06_gamma_zero_cnn | CnnPolicy | -0.3 | 6.33 | 14.16 | No future reward |
| boxing_exp03_large_batch_cnn | CnnPolicy | -0.6 | 3.29 | 14.75 | Large batch |
| boxing_exp07_more_exploration_cnn | CnnPolicy | -0.7 | 5.08 | 13.96 | More exploration |
| boxing_exp09_more_updates_cnn | CnnPolicy | -1.3 | 4.27 | 19.49 | More gradient steps |
| boxing_exp11_small_batch_mlp | MlpPolicy | -14.4 | 5.64 | 7.04 | MLP policy |
| boxing_exp10_baseline_mlp | MlpPolicy | -27.3 | 4.47 | 7.18 | Worst performance |
The best-performing model is:boxing_exp01_baseline_cnn with mean reward of 5.9
The baseline CNN model outperformed others because it used balanced default hyperparameters that ensured stable learning. With only 100,000 timesteps, modified settings like high gamma or altered exploration failed to converge. The CNN also captured spatial features effectively, making it more robust than other configurations, especially under limited training conditions.
- experiments/BestVerie_experiments.ipynb: Full training pipeline with callbacks and visualization
- Hyperparameter_tables/BestVerie_hyperparameter_results.csv: Summary results table
- In drive link: /BestVerie.zip: logs,models, everything related to the training process
(My model play )[!https://drive.google.com/file/d/14716OGsd0Bl2DVU9waerE_U_6jVZnpA7/view?usp=sharing]
Raissa Experiments (10 Configurations)Notebook used: experiments/irutingabo-experiments.ipynbThis section defines 10 experiment configurations arranged as 5 paired CNN/MLP groups — each pair shares identical hyperparameters so that CnnPolicy and MlpPolicy can be compared fairly under the same conditions.CSV Results: results/raissa/tables/raissa_hyperparameter_results.csvMarkdown Results: results/raissa/tables/raissa_hyperparameter_results.md
| Member Name | Experiment | Hyperparameter Set | Noted Behavior |
|---|---|---|---|
| Raissa | raissa_exp01_cnn_baseline | lr=0.0001, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | CNN configuration for Boxing; compare stability and final reward against baseline. Mean reward: -9.80 +/- 9.41. |
| Raissa | raissa_exp02_mlp_baseline | lr=0.0001, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | MLP ablation on Boxing; typically weaker than CNN on pixel observations. Mean reward: -6.80 +/- 6.85. |
| Raissa | raissa_exp03_cnn_low_lr | lr=5e-05, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | CNN configuration for Boxing; compare stability and final reward against baseline. Mean reward: -1.20 +/- 1.17. |
| Raissa | raissa_exp04_mlp_low_lr | lr=5e-05, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | MLP ablation on Boxing; typically weaker than CNN on pixel observations. Mean reward: -2.40 +/- 6.34. |
| Raissa | raissa_exp05_cnn_high_lr | lr=0.00025, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | CNN configuration for Boxing; compare stability and final reward against baseline. Mean reward: -7.20 +/- 8.01. |
| Raissa | raissa_exp06_mlp_high_lr | lr=0.00025, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | MLP ablation on Boxing; typically weaker than CNN on pixel observations. Mean reward: -7.00 +/- 4.94. |
| Raissa | raissa_exp07_cnn_slow_eps | lr=0.0001, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay=0.5 | CNN configuration for Boxing; compare stability and final reward against baseline. Mean reward: -8.00 +/- 9.90. |
| Raissa | raissa_exp08_mlp_slow_eps | lr=0.0001, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.1, epsilon_decay=0.5 | MLP ablation on Boxing; typically weaker than CNN on pixel observations. Mean reward: -15.00 +/- 3.74. |
| Raissa | raissa_exp09_cnn_large_batch | lr=0.0001, gamma=0.99, batch=128, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | CNN configuration for Boxing; compare stability and final reward against baseline. Mean reward: -4.00 +/- 6.20. |
| Raissa | raissa_exp10_mlp_large_batch | lr=0.0001, gamma=0.99, batch=128, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.1 | MLP ablation on Boxing; typically weaker than CNN on pixel observations. Mean reward: 2.80 +/- 5.46. |
| Experiment Name | Policy | Mean Reward | Std Dev | Train Min |
|---|---|---|---|---|
| raissa_exp10_mlp_large_batch | MlpPolicy | 2.8 | 5.46 | 1.53 |
| raissa_exp03_cnn_low_lr | CnnPolicy | -1.2 | 1.17 | 5.08 |
| raissa_exp04_mlp_low_lr | MlpPolicy | -2.4 | 6.34 | 1.55 |
| raissa_exp09_cnn_large_batch | CnnPolicy | -4.0 | 6.20 | 6.34 |
| raissa_exp02_mlp_baseline | MlpPolicy | -6.8 | 6.85 | 1.53 |
| raissa_exp06_mlp_high_lr | MlpPolicy | -7.0 | 4.94 | 1.53 |
| raissa_exp05_cnn_high_lr | CnnPolicy | -7.2 | 8.01 | 5.04 |
| raissa_exp07_cnn_slow_eps | CnnPolicy | -8.0 | 9.90 | 4.82 |
| raissa_exp01_cnn_baseline | CnnPolicy | -9.8 | 9.41 | 5.13 |
| raissa_exp08_mlp_slow_eps | MlpPolicy | -15.0 | 3.74 | 1.46 |
Findings and Insights
Best model: Raissa Exp10 MLP Large Batch, with a mean reward of 2.8. It was the only experiment to achieve a positive score across all 10 runs.
Surprising result: MLP outperformed CNN overall. The average mean reward across all five CNN experiments was -6.04, while the MLP average was -5.68. This is unexpected since Boxing is a pixel-based environment. However, with only 50,000 timesteps, the CNN likely did not have enough time to learn useful visual features. In contrast, the MLP trained faster (about 1.5 minutes vs 5 minutes) and made better use of the limited training budget.
What helped: Lower learning rate. Reducing the learning rate to 5e-5 produced the best CNN result (-1.2) and the second-best MLP result (-2.4). More cautious updates appear to stabilize training under a short timestep budget.
What helped: Larger batch size for MLP. The MLP with a batch size of 128 performed best overall. Smoother gradient estimates provided a stronger learning signal, allowing the model to develop a basic positive strategy.
What hurt: Higher learning rate. At 2.5e-4, updates were too aggressive, leading to unstable training and inconsistent Q-value estimates.
What hurt the most: Slow epsilon decay. Extending exploration to 50% of training was the worst-performing decision. Spending too much of a short 50k-step budget on random exploration left insufficient time for effective exploitation.
| Property | Value |
|---|---|
| Environment | ALE/Boxing-v5 |
| Action Space | Discrete(18) |
| Observation Space | Box(0, 255, (210, 160, 3), uint8) |
| Algorithm | DQN + CnnPolicy |
| Total Timesteps | 100,000 per experiment |
| Member | Hyperparameter Set | Noted Behavior |
|---|---|---|
| Elyse | lr=1e-4, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 1 - Baseline] Stable training. Reward improves steadily. Agent learns to land punches over time. Reference point for all other experiments. Mean Reward: 3.8 |
| Elyse | lr=5e-4, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 2 - High LR] Q-values diverge catastrophically. Training loss spikes. lr=5e-4 is too aggressive — worst experiment. Mean Reward: -41.0 |
| Elyse | lr=1e-5, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 3 - Low LR] Very slow convergence. Agent still near-random at midpoint. Rewards far below baseline. Mean Reward: -4.4 |
| Elyse | lr=1e-4, gamma=0.90, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 4 - Low Gamma] Agent is myopic — ignores long-term scoring. Rewards plateau lower than baseline. Mean Reward: 1.4 |
| Elyse | lr=1e-4, gamma=0.999, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 5 - High Gamma] Agent values long-term strategy. Slightly slower early learning but more deliberate play. Mean Reward: 3.4 |
| Elyse | lr=1e-4, gamma=0.99, batch=128, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 6 - Large Batch] Smoother loss curve but fewer updates per timestep. Final performance below baseline. Mean Reward: -1.2 |
| Elyse | lr=1e-4, gamma=0.99, batch=16, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 | [Exp 7 - Small Batch] Noisy gradients but frequent updates. Surprisingly good performance. Mean Reward: 7.4 |
| Elyse | lr=1e-4, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.10, epsilon_decay=0.50 | [Exp 8 - Slow Epsilon Decay] Extended exploration fills replay buffer with diverse transitions. Mean Reward: 3.4 |
| Elyse | lr=1e-4, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.05 | [Exp 9 - Fast Epsilon Decay] Commits to exploitation early. Best result — Boxing is simple enough that fast exploitation beats extended exploration. Mean Reward: 11.2 Best |
| Elyse | lr=1e-4, gamma=0.99, batch=32, epsilon_start=1.0, epsilon_end=0.05, epsilon_decay=0.10 [MlpPolicy] | [Exp 10 - MLP Ablation] Same hyperparameters as Exp 1, only policy differs. Cannot extract spatial features. 37-point gap vs CNN confirms CnnPolicy is essential. Mean Reward: -33.2 |
| # | Experiment | Policy | lr | gamma | batch | ε_start | ε_end | ε_decay | Mean Reward |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Baseline | CnnPolicy | 1e-4 | 0.99 | 32 | 1.0 | 0.05 | 0.10 | 3.80 ± 6.05 |
| 2 | High LR | CnnPolicy | 5e-4 | 0.99 | 32 | 1.0 | 0.05 | 0.10 | -41.00 ± 1.67 |
| 3 | Low LR | CnnPolicy | 1e-5 | 0.99 | 32 | 1.0 | 0.05 | 0.10 | -4.40 ± 2.06 |
| 4 | Low Gamma | CnnPolicy | 1e-4 | 0.90 | 32 | 1.0 | 0.05 | 0.10 | 1.40 ± 3.61 |
| 5 | High Gamma | CnnPolicy | 1e-4 | 0.999 | 32 | 1.0 | 0.05 | 0.10 | 3.40 ± 5.12 |
| 6 | Large Batch | CnnPolicy | 1e-4 | 0.99 | 128 | 1.0 | 0.05 | 0.10 | -1.20 ± 3.71 |
| 7 | Small Batch | CnnPolicy | 1e-4 | 0.99 | 16 | 1.0 | 0.05 | 0.10 | 7.40 ± 7.09 |
| 8 | Slow ε Decay | CnnPolicy | 1e-4 | 0.99 | 32 | 1.0 | 0.10 | 0.50 | 3.40 ± 3.01 |
| 9 | Fast ε Decay | CnnPolicy | 1e-4 | 0.99 | 32 | 1.0 | 0.01 | 0.05 | 11.20 ± 6.76 |
| 10 | MLP Ablation | MlpPolicy | 1e-4 | 0.99 | 32 | 1.0 | 0.05 | 0.10 | -33.20 ± 2.40 |
The best-performing model was exp09_fast_eps (CnnPolicy) with a mean reward of 11.2, being the highest in the group. Fast epsilon decay worked well here because Boxing has a simple and dense reward signal; the agent receives feedback on every punch, so it learns a good strategy quickly and benefits more from exploiting that strategy early rather than spending time on random exploration. The high learning rate experiment (lr=5e-4) was the worst, scoring -41.0, because overly large updates destabilised the Q-values and the agent never converged. The CNN policy consistently outperformed MLP across all comparable experiments with identical settings. CNN scored 3.8 while MLP scored -33.2, confirming that convolutional layers are essential for pixel-based games where spatial understanding of the opponent's position and movement is critical to learning an effective policy.
Agent uses GreedyQPolicy (exploration_rate=0.0 → always picks argmax Q(s,a)).
| Property | Value |
|---|---|
| Environment | ALE/Boxing-v5 |
| Action Space | Discrete(18) |
| Observation Space | Box(0, 255, (210, 160, 3), uint8) |
| Algorithm | DQN + CnnPolicy |
| Total Timesteps | 100,000 per experiment |
| Member | Hyperparameter Set | Noted Behavior |
|---|---|---|
| Nice | lr=2.5e-4, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.20 | [Exp 1 - Baseline] Moderate config, stable reference point. Mean Reward: -0.20 |
| Nice | lr=2.5e-4, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.0, eps_fraction=0.20 | [Exp 2 - Zero Eps End] Fully greedy at end, there's no residual exploration. Best performer. Mean Reward: +4.40 ✅ Best |
| Nice | lr=2.5e-4, gamma=0.0, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.20 | [Exp 3 - Zero Gamma] Fully myopic agent ignores all future rewards. Expected poor performance confirmed. Mean Reward: -2.80 |
| Nice | lr=1e-3, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.20 | [Exp 4 - Very High LR] Aggressive updates cause unstable Q-values. Worst CNN experiment. Mean Reward: -24.80 |
| Nice | lr=1e-6, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.20 | [Exp 5 - Tiny LR] Near-zero learning rate: agent barely updates. Stagnant performance. Mean Reward: -11.80 |
| Nice | lr=2.5e-4, gamma=0.999, batch=256, eps_start=1.0, eps_end=0.02, eps_fraction=0.20 | [Exp 6 - Large Batch + High Gamma] Stable gradients with strong future valuation. Mean Reward: +0.40 |
| Nice | lr=2.5e-4, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.01 | [Exp 7 - Instant Exploit] Epsilon collapses in the first 1% of training; almost no exploration phase. Mean Reward: +2.00 |
| Nice | lr=2.5e-4, gamma=0.95, batch=64, eps_start=1.0, eps_end=0.02, eps_fraction=0.90 | [Exp 8 - Full Explore] Explores for 90% of training: agent learns slowly and struggles to exploit. Mean Reward: 0.00 |
| Nice | lr=2.5e-4, gamma=0.99, batch=64, eps_start=1.0, eps_end=0.01, eps_fraction=0.15 | [Exp 9 - Best Guess] Balanced config combining good gamma and low eps_end. Mean Reward: +1.00 |
| Nice | lr=2.5e-4, gamma=0.99, batch=64, eps_start=1.0, eps_end=0.0, eps_fraction=0.20 | [Exp 11 - Zero Eps + High Gamma] Greedy convergence combined with strong future valuation. Mean Reward: -0.20 |
| Nice | lr=2.5e-4, gamma=0.95, batch=128, eps_start=1.0, eps_end=0.0, eps_fraction=0.20 | [Exp 12 - Zero Eps + Large Batch] Larger batch stabilises gradients with fully greedy end. Mean Reward: -0.40 |
| Nice | lr=5e-4, gamma=0.99, batch=64, eps_start=1.0, eps_end=0.0, eps_fraction=0.15 | [Exp 13 - Tuned LR + Zero Eps] Slightly higher LR with zero eps and high gamma; LR too high for zero eps. Mean Reward: -20.20 |
| Metric | Value |
|---|---|
| Mean Reward | 4.40 ± 3.72 |
| Policy | CnnPolicy |
| Key Setting | eps_end=0.0 (fully greedy convergence) |
- Best config:
exp02_zero_eps_end:setting epsilon_end to 0.0 forces fully greedy exploitation at the end of training, giving the highest reward - Worst config:
exp04_very_high_lr: lr=1e-3 caused Q-value instability, scoring -24.80 - Zero gamma finding: Setting gamma=0.0 makes the agent fully myopic, meaning it only optimises for immediate reward and cannot learn long-term boxing strategy
- Exploration tradeoff: Both instant exploitation (exp07) and full exploration (exp08) underperformed, confirming that a balanced epsilon decay is important
- Notebook:
experiments/Nice_experiments.ipynb - Best model:
Nice_dqn_model.zip - Hyperparameter table:
hyperparameter_table_nice.csv - Reward comparison:
assets/nice_reward_comparison.png - Training curves:
assets/nice_training_curves.png