While I had previously created a simpler PPO implementation, the growing popularity of this algorithm in developing reasoning models (although differently than in robotic control applications) motivated me to explore its details further. This investigation led me to the valuable resources listed in the references section, which I then used directly to design and implement this better version of PPO.

The implementation uses the following key hyperparameters:
gamma = 0.99 # Discount factor
gae_lambda = 0.95 # GAE lambda parameter
clip_coef = 0.2 # PPO clipping coefficient
vf_coef = 0.5 # Value function loss coefficient
ent_coef = 0.0 # Entropy coefficient
lr = 3e-4 # Learning rate
batch_size = 2048 # Steps collected per update
minibatch_size = 64 # Size of minibatches for updates