This is for basic reinforcement learning: algorithms and main equations for dynamic programming (DP), monte carlo method (MC), temporal difference (TD) and deep reinforcement learning (DRL). The task of the robot is to collect data of all sensors in the shortest possible time while it avoids any collisions to the obstacles.
- 5x5 grid env. (grid_env_55.ipynb)
- Fast convergence, recommended!!!
- 10X10 grid env. (grid_env.ipynb)
- Slow convergence
- There are two versions of DP: state value based and action value based
- Policy evaluation, policy improvent
- Policy iteration
- Value iteration
- On-policy first visit MC
- Off-policy first visit MC
- SARSA, Q-learning, Expected SARSA, Double Q-learning
We have a robot that aims to collect data of several low-powered IoT sensors. As the sensors are low-powered, they cannot communcate over long ranges. Hence, the robot must approach each sensor to collect their data. The robot starts its mission from the start terminal. There is a charging station in the environment so that the robot can recharge its battery if it is running out of energy. Also, there are several obstacles in the environment.
A sample result:
In the following image, we have depicted the environment:
red square: starting position
green square: charging station
Black circles: IoT sensors
Blue blocks: obstacles
In this project, we define the state as a four channel image, shown below
Based on this definition, we can use CNNs to solve the MDP.





