TD learning mehtods are methods that are model free and don't require the task to be episodic. This method can be applied for evaluation and control of continuing tasks. It is computationally cheaper than the Monte Carlo mehtod due to the fact that the updates are made online and we need not wait till the end of every episode. TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap). Here the target is value function of the next state and not the true return.
Which is better TD or MC, is a Question open for debate and research
- SARSA
- Q-Learning
- Comparitive study of SARSA and Q-Learning
- SARSA on WindyGridworld
- Q-Learning on Cliff-walk
- Comparision of Q-learning and SARSA on Cliff walk
The red one is from Q-Learning and the blue from SARSA
- For detailed proofs and theory refer Sutton and Barto.
- For understanding the environment visit the github page.


