A comparative study of classical and deep reinforcement learning algorithms on the task of multi-goal sequential navigation task with dynamic obstacles in a gridworld setup on a Pygame environment.
ScreenCapture of Trained skrl PPO agent evaluated over 100 episodes
Screencast.from.04-14-2026.01.31.03.AM.webm
This project was developed as part of the coursework for the Machine Learning class at Hochschule Bonn-Rhein-Sieg during Winter Semester 2025. We implemented and then compared two reinforcement learning approaches on a grid world with dynamic obstacles:
- Q-Learning — classical tabular RL (model-free, discrete)
- PPO (Proximal Policy Optimization) — deep RL with Actor-Critic neural networks
Both agents are trained to navigate a 10×10 grid, visiting 4 goals in sequence while avoiding 5 dynamically moving obstacles.
The environment is a custom 10×10 grid built with Pygame.
| Index | Feature | Description |
|---|---|---|
| 0 | agent_x |
Agent row position (0–9) |
| 1 | agent_y |
Agent column position (0–9) |
| 2 | goal_idx |
Current goal index (0–3) |
| 3 | sensor_up |
Obstacle/wall directly above (0 or 1) |
| 4 | sensor_right |
Obstacle/wall directly right (0 or 1) |
| 5 | sensor_down |
Obstacle/wall directly below (0 or 1) |
| 6 | sensor_left |
Obstacle/wall directly left (0 or 1) |
| Action | Direction |
|---|---|
| 0 | Up |
| 1 | Right |
| 2 | Down |
| 3 | Left |
GOAL_SEQUENCE = [(2, 2), (5, 5), (8, 2), (1, 8)] # Same for both algorithmsThe agent must reach all 4 goals in order within 200 steps.
| Event | Reward |
|---|---|
| Reaching a goal | +100 |
| Completing all 4 goals | +50 (sequence bonus) |
| Each timestep | −1 (step penalty) |
| Obstacle collision | −10 (for hitting an obstacle, and if the dynamic obstacle moves onto agent position) |
- 5 obstacles move randomly each step (up/right/down/left/stay)
- They never overlap with goal positions
- They can move into the agent's position, triggering a collision penalty
Below is the tensorboard results from Q-learning followed by PPO agent evaluated over 100 episodes on the same goal sequence and environment configuration.

| Metric | Q-Learning | PPO (Deep RL) |
|---|---|---|
| Success Rate | 100% | 100% |
| SPL (Path Efficiency) | 0.91 | 0.83 |
| Avg Steps (Success) | 32.23 | 35.57 |
| Collision Rate | 83% | 65% |
| Optimal Episodes (29 steps) | 31 / 100 | 0 / 100 |
| Training Time | ~30 min | ~12 min |
Optimal path length = 29 steps (Manhattan distance through all 4 goals).
- Both algorithms achieved perfect 100% task completion.
- Q-Learning produced more optimal paths (SPL 0.91 vs 0.83) and reached exact optimal length in 31% of episodes.
- PPO navigated more safely with significantly fewer collisions (65% vs 83%).
- PPO converged by 100K timesteps — the remaining 300K was spent in stablising the loss curve while maintaining the high cumulative reward.
- Thus we can conclude for grid world with discrete and manageable action-space, Q-learning is better suited than the PPO algorithm to derive an optimal RL policy agent.
| Metric | Definition |
|---|---|
| Success Rate | % of episodes where all 4 goals were reached within the 200-step timeout |
| Collision Rate | % of episodes where the agent hit at least one obstacle |
| SPL | (Optimal Steps / Actual Steps) × Success — path efficiency (1.0 = perfect) |
| Avg Steps | Mean timesteps per successful episode (lower = more efficient) |
The repository is divided into two main sections for each algorithm:
GridWorld-RL/
├── GridWorld_Q-Learning/ # Q-Learning Implementation
│ ├── evaluate_q_learning.py # Evaluation & video recording script
│ ├── grid_world.py # GridWorld environment (Pygame)
│ └── train_q_learning.py # Q-learning training script
├── gridworld_rl/ # PPO Implementation
│ ├── configs/
│ │ └── ppo_config.py # PPO hyperparameter configuration (skrl)
│ ├── envs/
│ │ ├── grid_world.py # Core GridWorld logic
│ │ └── gym_wrapper.py # Gym-compatible wrapper for skrl, required for registering the grid world with openai gym environment
│ ├── models/
│ │ └── networks.py # Actor (PolicyNetwork) & Critic (ValueNetwork)
│ ├── utils/
│ │ └── metrics.py # Evaluation metrics
│ ├── evaluate.py # PPO evaluation script
│ ├── train.py # PPO training script
│ ├── ppo_eval_log.csv # PPO evaluation results
│ └── ppo_eval_summary.txt # PPO evaluation summary
├── ML_Course_Project_Presentation_Cuthinho-Ramamoorthy.pdf # Project Presentation
└── README.md
Classical tabular reinforcement learning. Stores a Q-value for every (state, action) pair.
Update Rule:
Q(s, a) ← Q(s, a) + α [r + γ · max Q(s', a') − Q(s, a)]
Hyperparameters:
| Parameter | Value |
|---|---|
| Learning Rate (α) | 0.1 |
| Discount Factor (γ) | 0.99 |
| Epsilon Start | 1.0 |
| Epsilon End | 0.05 |
| Epsilon Decay | 0.998 |
| Training Episodes | 3,000 |
| Max Steps per Episode | 200 |
Deep RL using a separate Actor-Critic architecture implemented with skrl.
PPO Clipped Objective:
L(θ) = E[min(r_t(θ) · Â_t, clip(r_t(θ), 1−ε, 1+ε) · Â_t)]
where r_t(θ) = π_θ(a|s) / π_θ_old(a|s) and ε = 0.2.
Hyperparameters:
| Parameter | Value |
|---|---|
| Learning Rate | 3.0 × 10⁻⁴ |
| Discount Factor (γ) | 0.99 |
| GAE Lambda (λ) | 0.95 |
| PPO Clip Range (ε) | 0.2 |
| Value Clip | 0.2 |
| Entropy Coefficient | 0.02 |
| Value Loss Scale | 0.5 |
| Max Gradient Norm | 0.5 |
| Rollout Length | 2,048 timesteps/env |
| Learning Epochs | 10 |
| Mini-Batches | 32 |
| Parallel Environments | 16 |
| Effective Batch Size | 32,768 (2,048 × 16) |
| Total Timesteps | 400,000 |
| Optimizer | Adam |
Both Actor and Critic networks share the same layer configuration but are independently trained:
PolicyNetwork (Actor): 7 → [256, ReLU] → [256, ReLU] → [256, ReLU] → 4
ValueNetwork (Critic): 7 → [256, ReLU] → [256, ReLU] → [256, ReLU] → 1
| Component | Details |
|---|---|
| Activation | ReLU |
| Total Parameters | ~35,717 |
| Actor Output | 4 action logits → softmax → probabilities |
| Critic Output | 1 scalar V(s) — expected cumulative reward |
# Clone the repository
git clone https://github.com/saiga006/GridWorld-RL.git
cd GridWorld-RL
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install torch numpy pygame opencv-python skrl tensorboardcd GridWorld_Q-Learning
# Train
python train_q_learning.py
# Saves policy to: q_learning_policy.pkl
# Logs to: q_learning_training_log.csv
# Evaluate
python evaluate_q_learning.py
# Evaluates 100 episodes
# Saves results to: q_learning_eval_log.csv
# Records video to: q_learning_evaluation.mp4cd gridworld_rl
# Train
python train.py
# Trains for 400,000 timesteps using 16 parallel envs
# Logs to: runs/ (TensorBoard)
# Saves model to: runs/gridworld_ppo_*/final_agent.pt
# Evaluate
python evaluate.py --model runs/gridworld_ppo_*/final_agent.pt (give --render to see the trained rl agent move through obstacle)
# Saves results to: ppo_eval_log.csvcd gridworld_rl
tensorboard --logdir runs/
# Open: http://localhost:6006- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- skrl library: https://skrl.readthedocs.io/
- OpenAI Spinning Up: https://spinningup.openai.com/
- Sai Mukkundan Ramamoorthy (for SKRL environment setup and PPO RL Agent)
- Aaron Cuthinho (for Grid world environment setup & Q-learning Agent)