PPO based on stable_baselines3

A reinforcement learning practice

2 min readAug 3, 2022

What is stable_baseline3 ?

stable_baselines3 is a Reinforcement Learning Implementation package developped by awesome researchers and engineers from Germany, France, USA and Finland as you can discover from the author list based on the paper:

Also, I really like the abstract as it is concise and informative. I really like that the researchers and engineers are taking the clean code into consideration and provide a high quality with comprehensive testing and elegant code. One thing missed mention in the abstract is that the system is implemented based on PyTorch, however, you can discover that from the cute logo.

I am going to explore how to use this package to implement the PPO algorithm in reinforcement learning.

Implemented algorithms (As of July 30, 2022)

From [1]

What is PPO algorithm?

Proximal Policy Optimization (PPO) is a policy based reinforcement learning algorithm. Some nice tutorials can be found in [3].

CartPole

Code From [6]

After training, we can achieve a good final state.

Figure: the final state of the trained CartPole model

The training log can be found here. From the log, we can see, with the training going, the loss are greatly reduced for loss, policy_gradient_loss and value_loss.

LunarLander

In order to run on linux, you need

conda install swig # needed to build Box2D in the pip install
pip install box2d-py # a repackaged version of pybox2d

If no install, you will get the following error

AttributeError: module 'gym.envs.box2d' has no attribute 'LunarLander'

Code and log can be found here

Reference

[1] stable_baseline3 GitHub

[2] stable_baseline3 paper

[3] PPO tutorial list

[4] stable_baseline3 blog

[5]stable baseline blog by authors of the paper from medium ads

[6] baseline3 blog in medium tds