PPO based on stable_baselines3
A reinforcement learning practice
What is stable_baseline3 ?
stable_baselines3 is a Reinforcement Learning Implementation package developped by awesome researchers and engineers from Germany, France, USA and Finland as you can discover from the author list based on the paper:
Also, I really like the abstract as it is concise and informative. I really like that the researchers and engineers are taking the clean code into consideration and provide a high quality with comprehensive testing and elegant code. One thing missed mention in the abstract is that the system is implemented based on PyTorch, however, you can discover that from the cute logo.
I am going to explore how to use this package to implement the PPO algorithm in reinforcement learning.
Implemented algorithms (As of July 30, 2022)
From [1]
What is PPO algorithm?
Proximal Policy Optimization (PPO) is a policy based reinforcement learning algorithm. Some nice tutorials can be found in [3].
CartPole
Code From [6]
After training, we can achieve a good final state.
The training log can be found here. From the log, we can see, with the training going, the loss are greatly reduced for loss, policy_gradient_loss and value_loss.
LunarLander
In order to run on linux, you need
conda install swig # needed to build Box2D in the pip install
pip install box2d-py # a repackaged version of pybox2d
If no install, you will get the following error
AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'
Code and log can be found here
Reference
[1] stable_baseline3 GitHub
[2] stable_baseline3 paper
[5]stable baseline blog by authors of the paper from medium ads