Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/benchmarks #317

Merged
merged 23 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: 3.12
python-version: 3.11
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,11 @@ Additionally, we provide an [example project](https://github.com/cpnota/all-exam

## High-Quality Reference Implementations

The `autonomous-learning-library` separates reinforcement learning agents into two modules: `all.agents`, which provides flexible, high-level implementations of many common algorithms which can be adapted to new problems and environments, and `all.presets` which provides specific instansiations of these agents tuned for particular sets of environments, including Atari games, classic control tasks, and PyBullet robotics simulations. Some benchmark results showing results on-par with published results can be found below:
The `autonomous-learning-library` separates reinforcement learning agents into two modules: `all.agents`, which provides flexible, high-level implementations of many common algorithms which can be adapted to new problems and environments, and `all.presets` which provides specific instansiations of these agents tuned for particular sets of environments, including Atari games, classic control tasks, and MuJoCo/Pybullet robotics simulations. Some benchmark results showing results on-par with published results can be found below:

![atari40](benchmarks/atari40.png)
![pybullet](benchmarks/pybullet.png)
![atari40](benchmarks/atari_40m.png)
![atari40](benchmarks/mujoco_v4.png)
![pybullet](benchmarks/pybullet_v0.png)

As of today, `all` contains implementations of the following deep RL algorithms:

Expand Down
2 changes: 1 addition & 1 deletion all/environments/pybullet.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ class PybulletEnvironment(GymEnvironment):
short_names = {
"ant": "AntBulletEnv-v0",
"cheetah": "HalfCheetahBulletEnv-v0",
"humanoid": "HumanoidBulletEnv-v0",
"hopper": "HopperBulletEnv-v0",
"humanoid": "HumanoidBulletEnv-v0",
"walker": "Walker2DBulletEnv-v0",
}

Expand Down
6 changes: 4 additions & 2 deletions all/experiments/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,12 @@ def create_sbatch_script(self):
"output": os.path.join(self.outdir, "all_%A_%a.out"),
"error": os.path.join(self.outdir, "all_%A_%a.err"),
"array": "0-" + str(num_experiments - 1),
"partition": "1080ti-short",
"partition": "gpu-long",
"ntasks": 1,
"cpus-per-task": 4,
"mem-per-cpu": 4000,
"gres": "gpu:1",
"gpus-per-node": 1,
"time": "7-0",
}
sbatch_args.update(self.sbatch_args)

Expand Down
27 changes: 21 additions & 6 deletions all/policies/soft_deterministic.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,18 +20,32 @@ class SoftDeterministicPolicy(Approximation):
kwargs (optional): Any other arguments accepted by all.approximation.Approximation
"""

def __init__(self, model, optimizer=None, space=None, name="policy", **kwargs):
model = SoftDeterministicPolicyNetwork(model, space)
def __init__(
self,
model,
optimizer=None,
space=None,
name="policy",
log_std_min=-20,
log_std_max=4,
**kwargs
):
model = SoftDeterministicPolicyNetwork(
model, space, log_std_min=log_std_min, log_std_max=log_std_max
)
self._inner_model = model
super().__init__(model, optimizer, name=name, **kwargs)


class SoftDeterministicPolicyNetwork(RLNetwork):
def __init__(self, model, space):
def __init__(self, model, space, log_std_min=-20, log_std_max=4, log_std_scale=0.5):
super().__init__(model)
self._action_dim = space.shape[0]
self._tanh_scale = torch.tensor((space.high - space.low) / 2).to(self.device)
self._tanh_mean = torch.tensor((space.high + space.low) / 2).to(self.device)
self._log_std_min = log_std_min
self._log_std_max = log_std_max
self._log_std_scale = log_std_scale

def forward(self, state):
outputs = super().forward(state)
Expand All @@ -41,9 +55,10 @@ def forward(self, state):

def _normal(self, outputs):
means = outputs[..., 0 : self._action_dim]
logvars = outputs[..., self._action_dim :]
std = logvars.mul(0.5).exp_()
return torch.distributions.normal.Normal(means, std)
log_stds = outputs[..., self._action_dim :] * self._log_std_scale
clipped_log_stds = torch.clamp(log_stds, self._log_std_min, self._log_std_max)
stds = clipped_log_stds.exp_()
return torch.distributions.normal.Normal(means, stds)

def _sample(self, normal):
raw = normal.rsample()
Expand Down
4 changes: 2 additions & 2 deletions all/presets/continuous/ddpg.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
# Common settings
"discount_factor": 0.99,
# Adam optimizer settings
"lr_q": 3e-4,
"lr_pi": 3e-4,
"lr_q": 1e-3,
"lr_pi": 1e-3,
# Training settings
"minibatch_size": 256,
"update_frequency": 1,
Expand Down
4 changes: 2 additions & 2 deletions all/presets/continuous/sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
"discount_factor": 0.99,
# Adam optimizer settings
"lr_q": 1e-3,
"lr_pi": 3e-4,
"lr_pi": 1e-3,
# Training settings
"minibatch_size": 256,
"update_frequency": 1,
Expand All @@ -26,7 +26,7 @@
"replay_start_size": 5000,
"replay_buffer_size": 1e6,
# Exploration settings
"temperature_initial": 0.1,
"temperature_initial": 1.0,
"lr_temperature_scaling": 3e-5,
"entropy_backups": True,
"entropy_target_scaling": 1.0,
Expand Down
Binary file removed benchmarks/atari40.png
Binary file not shown.
Binary file added benchmarks/atari_40m.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions benchmarks/atari40.py → benchmarks/atari_40m.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ def main():
agents,
envs,
10e6,
logdir="benchmarks/atari40",
sbatch_args={"partition": "gpu-long"},
logdir="benchmarks/atari_40m",
sbatch_args={"partition": "gypsum-1080ti"},
)


Expand Down
Binary file added benchmarks/mujoco_v4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 34 additions & 0 deletions benchmarks/mujoco_v4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from all.environments import MujocoEnvironment
from all.experiments import SlurmExperiment
from all.presets.continuous import ddpg, ppo, sac


def main():
frames = int(5e6)

agents = [ddpg, ppo, sac]

envs = [
MujocoEnvironment(env, device="cuda")
for env in [
"Ant-v4",
"HalfCheetah-v4",
"Hopper-v4",
"Humanoid-v4",
"Walker2d-v4",
]
]

SlurmExperiment(
agents,
envs,
frames,
logdir="benchmarks/mujoco_v4",
sbatch_args={
"partition": "gpu-long",
},
)


if __name__ == "__main__":
main()
Binary file removed benchmarks/pybullet.png
Binary file not shown.
Binary file added benchmarks/pybullet_v0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 12 additions & 4 deletions benchmarks/pybullet.py → benchmarks/pybullet_v0.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,29 @@


def main():
frames = int(1e7)
frames = int(5e6)

agents = [ddpg, ppo, sac]

envs = [
PybulletEnvironment(env, device="cuda")
for env in PybulletEnvironment.short_names
for env in [
"AntBulletEnv-v0",
"HalfCheetahBulletEnv-v0",
"HopperBulletEnv-v0",
"HumanoidBulletEnv-v0",
"Walker2DBulletEnv-v0",
]
]

SlurmExperiment(
agents,
envs,
frames,
logdir="benchmarks/pybullet",
sbatch_args={"partition": "gpu-long"},
logdir="benchmarks/pybullet_v0",
sbatch_args={
"partition": "gpu-long",
},
)


Expand Down
28 changes: 17 additions & 11 deletions docs/source/guide/benchmark_performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Additionally, we use the following agent "bodies":

The results were as follows:

.. image:: ../../../benchmarks/atari40.png
.. image:: ../../../benchmarks/atari_40m.png

For comparison, we look at the results published in the paper, `Rainbow: Combining Improvements in Deep Reinforcement Learning <https://arxiv.org/abs/1710.02298>`_:

Expand All @@ -40,23 +40,29 @@ Our ``dqn`` and ``ddqn`` in particular were better almost across the board.
While there are some minor implementation differences (for example, we use ``Adam`` for most algorithms instead of ``RMSprop``),
our agents achieved very similar behavior to the agents tested by DeepMind.

MuJoCo Benchmark
------------------

`MuJoCo https://mujoco.org`_ is "a free and open source physics engine that aims to facilitate research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed."
The MuJoCo Gym environments are a common benchmark in RL research for evaluating agents with continuous action spaces.
We ran each continuous preset for 5 million timesteps (in this case, timesteps are equal to frames).
The learning rate was decayed over the course of training using cosine annealing.
The results were as follows:

.. image:: ../../../benchmarks/mujoco_v4.png

These results are similar to results found elsewhere, and in some cases better.
However, results can very based on hyperparameter tuning, implementation specifics, and the random seed.

PyBullet Benchmark
------------------

`PyBullet <https://pybullet.org/wordpress/>`_ provides a free alternative to the popular MuJoCo robotics environments.
While MuJoCo requires a license key and can be difficult for independent researchers to afford, PyBullet is free and open.
Additionally, the PyBullet environments are widely considered more challenging, making them a more discriminant test bed.
For these reasons, we chose to benchmark the ``all.presets.continuous`` presets using PyBullet.

Similar to the Atari benchmark, we ran each agent for 10 million timesteps (in this case, timesteps are equal to frames).
We ran each agent for 5 million timesteps (in this case, timesteps are equal to frames).
The learning rate was decayed over the course of training using cosine annealing.
To reduce the variance of the updates, we added an extra time feature to the state (t * 0.001, where t is the current timestep).
The results were as follows:

.. image:: ../../../benchmarks/pybullet.png

PPO was omitted from the plot for Humanoid because it achieved very large negative returns which interfered with the scale of the graph.
Note, however, that our implementation of soft actor-critic (SAC) is able to solve even this difficult environment.
.. image:: ../../../benchmarks/pybullet_v0.png

Because most research papers still use MuJoCo, direct comparisons are difficult to come by.
However, George Sung helpfully benchmarked TD3 and DDPG on several PyBullet environments [here](https://github.com/georgesung/TD3).
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@
"numpy~=1.22", # math library
"matplotlib~=3.7", # plotting library
"opencv-python-headless~=4.0", # used by atari wrappers
"torch~=2.0", # core deep learning library
"torch~=2.2", # core deep learning library
"tensorboard~=2.8", # logging and visualization
"cloudpickle~=2.0", # used to copy environments
],
Expand Down
Loading