PPO-EXP

While I had previously created a simpler PPO implementation, the growing popularity of this algorithm in developing reasoning models (although differently than in robotic control applications) motivated me to explore its details further. This investigation led me to the valuable resources listed in the references section, which I then used directly to design and implement this better version of PPO.

Hyperparameters

The implementation uses the following key hyperparameters:

gamma = 0.99         # Discount factor
gae_lambda = 0.95    # GAE lambda parameter
clip_coef = 0.2      # PPO clipping coefficient
vf_coef = 0.5        # Value function loss coefficient
ent_coef = 0.0       # Entropy coefficient
lr = 3e-4            # Learning rate
batch_size = 2048    # Steps collected per update
minibatch_size = 64  # Size of minibatches for updates

Name		Name	Last commit message	Last commit date
parent directory ..
PPO.py		PPO.py
README.MD		README.MD
episode_19999_reward_4346.5.gif		episode_19999_reward_4346.5.gif
kl_by_episode_final.png		kl_by_episode_final.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

Hyperparameters

FilesExpand file tree

PPO-EXP

Directory actions

More options

Directory actions

More options

Latest commit

History

PPO-EXP

Folders and files

parent directory

README.MD

Hyperparameters