Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

While I had previously created a simpler PPO implementation, the growing popularity of this algorithm in developing reasoning models (although differently than in robotic control applications) motivated me to explore its details further. This investigation led me to the valuable resources listed in the references section, which I then used directly to design and implement this better version of PPO. Walker2d Training Performance

Hyperparameters

The implementation uses the following key hyperparameters:

gamma = 0.99         # Discount factor
gae_lambda = 0.95    # GAE lambda parameter
clip_coef = 0.2      # PPO clipping coefficient
vf_coef = 0.5        # Value function loss coefficient
ent_coef = 0.0       # Entropy coefficient
lr = 3e-4            # Learning rate
batch_size = 2048    # Steps collected per update
minibatch_size = 64  # Size of minibatches for updates
  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms

  2. The 37 Implementation Details of Proximal Policy Optimization

  3. Arena Chapter 2: PPO Implementation

  4. What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study