
ElegantRL is featured with lightweight, efficient and stable, for researchers and practitioners.
-
Lightweight: The core codes <1,000 lines, using PyTorch, OpenAI Gym, and NumPy.
-
Efficient: performance is comparable with Ray RLlib.
-
Stable: as stable as Stable Baseline 3.
Model-free deep reinforcement learning (DRL) algorithms:
- DDPG, TD3, SAC, A2C, PPO(GAE) for continuous actions
- DQN, DoubleDQN, D3QN for discrete actions
For algorithm details, please check out OpenAI Spinning Up.
More policy gradient algorithms (Actor-Critic style): Policy gradient algorithms
-----kernel file----
elegantrl/net.py # Neural networks.
elegantrl/agent.py # Model-free RL algorithms.
elegantrl/main.py # run and learn the DEMO 1 ~ 3 in run__demo()
-----utils file----
elegantrl/env.py # gym env or custom env (MultiStockEnv Finance)
Examples.ipynb # run and learn the DEMO 1 ~ 3 in jupyter notebook (new version)
ElegantRL-examples.ipynb # run and learn the DEMO 1 ~ 3 in jupyter notebook (old version)
Results using ElegantRL
BipedalWalkerHardcore is a difficult task in continuous action space. There are only a few RL implementations can reach the target reward.
Check out a video on bilibili: Crack the BipedalWalkerHardcore-v2 with total reward 310 using IntelAC.
Necessary:
| Python 3.7
| PyTorch 1.0.2
Not necessary:
| Numpy 1.19.0 | For ReplayBuffer. Numpy will be install when installing PyTorch
| gym 0.17.2 | For RL training env. Gym provide some tutorial env for DRL training.
| box2d-py 2.3.8 | For gym. Use pip install Box2D (instead of box2d-py)
| matplotlib 3.2 | For plots. Evaluate the agent performance.
It is lightweight.
python3 Main.py
# You can see run__demo(gpu_id=0, cwd='AC_BasicAC') in Main.py.
- In default, it will train a stable-DDPG in LunarLanderContinuous-v2 for 2000 second.
- It would choose CPU or GPU automatically. Don't worry, I never use
.cuda()
. - It would save the log and model parameters file in Current Working Directory
cwd='AC_BasicAC'
. - It would print the total reward while training. Maybe I should use TensorBoardX?
- There are many comment in the code. I believe these comments can answer some of your questions.
The following steps:
- See
run__xxx()
inMain.py
. - Use
run__zoo()
to run an off-policy algorithm. Userun__ppo()
to run on-policy such as PPO. - Choose a DRL algorithm:
from Agent import AgentXXX
. - Choose a gym environment:
args.env_name = "LunarLanderContinuous-v2"
- Initialize the hyper-parameters using
args
. - Initialize
agent = AgentXXX()
: create the DRL agent based on the algorithm. - Initialize
buffer = ReplayBuffer()
: store the transitions. - Initialize
evaluator = Evaluator()
: evaluate and store the trained model. - Ater the training starts, the while-loop will break when the conditions are met (conditions: achieving the target score, maximum steps, or manually breaks).
agent.update_buffer(...)
The agent explores the environment within target steps, generates transition data, and stores it in the ReplayBuffer. Run in parallel.agent.update_policy(...)
The agent uses a batch from the ReplayBuffer to update the network parameters. Run in parallel.evaluator.evaluate_and_save(...)
Evaluate the performance of the agent and keep the model with the highest score. Independent of the training process.