Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
shixun404 committed Dec 11, 2021
1 parent 72845a5 commit 833e64b
Show file tree
Hide file tree
Showing 9 changed files with 578 additions and 114 deletions.
70 changes: 17 additions & 53 deletions docs/source/algorithms/maddpg.rst
Original file line number Diff line number Diff line change
@@ -1,69 +1,33 @@
.. _ddpg:
.. _maddpg:


MADDPG
==========

`Multi-Agent Deep Deterministic Policy Gradient (MADDPG) <https://arxiv.org/abs/1706.02275>`_ This implementation is based on DDPG and supports the following extensions:

- Experience replay: ✔️
- Target network: ✔️
- Gradient clipping: ✔️
- Reward clipping: ❌
- Prioritized Experience Replay (PER): ✔️
- Ornstein–Uhlenbeck noise: ✔️
- Implement is based on DDPG ✔️
- Init n DDPG Agent in MADDPG: ✔️

Code Snippet
------------

DDPG Agents is store in the list agents
.. code-block:: python
import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.envs.gym import build_env
from elegantrl.agents.AgentMADDPG import AgentMADDPG
# train and save
args = Arguments(env=build_env('simple_spread'), agent=AgentMADDPG())
train_and_evaluate(args)
# test
agent = AgentMADDPG()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)
env = build_env('simple_spread')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
def init(self,net_dim, state_dim, action_dim, learning_rate=1e-4,marl=True, n_agents = 1, if_use_per=False, env_num=1, agent_id=0):
self.agents = [AgentDDPG() for i in range(n_agents)]
self.explore_env = self.explore_one_env
self.if_off_policy = True
self.n_agents = n_agents
for i in range(self.n_agents):
self.agents[i].init(net_dim, state_dim, action_dim, learning_rate=1e-4,marl=True, n_agents = self.n_agents, if_use_per=False, env_num=1, agent_id=0)
self.n_states = state_dim
self.n_actions = action_dim
episode_reward += reward
if done:
print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
break
else:
state = next_state
env.render()
self.batch_size = net_dim
self.gamma = 0.95
self.update_tau = 0
self.device = torch.device(f"cuda:{agent_id}" if (torch.cuda.is_available() and (agent_id >= 0)) else "cpu")
Parameters
---------------------

.. autoclass:: elegantrl.agents.AgentMADDPG.AgentMADDPG
:members:

.. _ddpg_networks:

Networks
-------------

.. autoclass:: elegantrl.agents.net.Actor
:members:

.. autoclass:: elegantrl.agents.net.Critic
:members:
15 changes: 15 additions & 0 deletions docs/source/algorithms/mappo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
.. _mappo:


MAPPO
==========

`Multi-Agent Proximal Policy Optimization , a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in two popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge.

- Shared Network parameter for all agents ✔️
- This class is under test, we temporarily add all utils in AgentMAPPO: ✔️

MAPPO achieves strong results while exhibiting comparable sample efficiency.



10 changes: 10 additions & 0 deletions docs/source/algorithms/matd3.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. _matd3:


MATD3
==========

`Multi-Agent TD3 is based on MADDPG. The critic network for MATD3 is Critic Twin similar to double Q learning:

- Implement is based on MADDPG ✔️
- Use CriticTwin instead of Critic
8 changes: 2 additions & 6 deletions docs/source/algorithms/qmix.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,10 @@ QMix

`QMix <https://arxiv.org/abs/1803.11485>`_ QMix employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations This implementation is based on pymarl. We provide a demo in elegantrl_helloworld

Parameters
---------------------

.. autoclass:: elegantrl.agents.AgentMix.AgentMix
:members:

Networks
-------------

.. autoclass:: elegantrl.agents.mixer.QMixer
:members:
.. autoclass:: elegantrl.agents.AgentQmix

61 changes: 7 additions & 54 deletions docs/source/algorithms/redq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,65 +6,18 @@ REDQ

`REDQ <https://arxiv.org/abs/2101.05982>`_ REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. This implementation is based on SAC and supports the following extensions:

- Experience replay: ✔️
- Target network: ✔️
- Gradient clipping: ✔️
- Reward clipping: ❌
- Prioritized Experience Replay (PER): ✔️
- Leanable entropy regularization coefficient: ✔️
- Implement G, M, N Parameter
- Based On SAC class
- Works well in Mujoco


Code Snippet
------------

You can change G,M,N when call AgentREDQ.init
.. code-block:: python
import torch
from elegantrl.run import train_and_evaluate
from elegantrl.config import Arguments
from elegantrl.envs.gym import build_env
from elegantrl.agents.AgentREDQ import AgentREDQ
# train and save
args = Arguments(env=build_env('Pendulum-v0'), agent=AgentREDQ())
args.cwd = 'demo_Pendulum_SAC'
args.env.target_return = -200
args.reward_scale = 2 ** -2
train_and_evaluate(args)
# test
agent = AgentREDQ()
agent.init(args.net_dim, args.state_dim, args.action_dim)
agent.save_or_load_agent(cwd=args.cwd, if_save=False)
env = build_env('Pendulum-v0')
state = env.reset()
episode_reward = 0
for i in range(2 ** 10):
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
episode_reward += reward
if done:
print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
break
else:
state = next_state
env.render()
AgentREDQ.init(self, net_dim=256, state_dim=8, action_dim=2, reward_scale=1.0, gamma=0.99,
learning_rate=3e-4, if_per_or_gae=False, env_num=1, gpu_id=0, G=20, M=2, N=10):
Parameters
---------------------

.. autoclass:: elegantrl.agents.AgentREDQ.AgentREDQ
:members:

Networks
-------------

.. autoclass:: elegantrl.agents.net.ActorSAC
:members:

.. autoclass:: elegantrl.agents.net.Critic
:members:
13 changes: 13 additions & 0 deletions docs/source/algorithms/vdn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _vdn:


VDN
==========

`VDN <https://arxiv.org/abs/1706.05296>`_ AgentVDN and AgentQmix differs in Agnet netowrk. The Agent structure is similar.


Networks
-------------

.. autoclass:: elegantrl.agents.AgentQmix
Loading

0 comments on commit 833e64b

Please sign in to comment.