update

zhazhax · Dec 11, 2021 · 833e64b · 833e64b
1 parent 72845a5
commit 833e64b
Show file tree

Hide file tree

Showing 9 changed files with 578 additions and 114 deletions.
diff --git a/docs/source/algorithms/maddpg.rst b/docs/source/algorithms/maddpg.rst
@@ -1,69 +1,33 @@
-.. _ddpg:
+.. _maddpg:
 
 
 MADDPG
 ==========
 
 `Multi-Agent Deep Deterministic Policy Gradient (MADDPG) <https://arxiv.org/abs/1706.02275>`_  This implementation is based on DDPG and supports the following extensions:
 
--  Experience replay: ✔️
--  Target network: ✔️
--  Gradient clipping: ✔️
--  Reward clipping: ❌
--  Prioritized Experience Replay (PER): ✔️
--  Ornstein–Uhlenbeck noise: ✔️
+-  Implement is based on DDPG ✔️
+-  Init n DDPG Agent in MADDPG: ✔️
 
 Code Snippet
 ------------
-
+DDPG Agents is store in the list agents
 .. code-block:: python
 
-    import torch
-    from elegantrl.run import train_and_evaluate
-    from elegantrl.config import Arguments
-    from elegantrl.envs.gym import build_env
-    from elegantrl.agents.AgentMADDPG import AgentMADDPG
-
-    
-    # train and save
-    args = Arguments(env=build_env('simple_spread'), agent=AgentMADDPG())
-    train_and_evaluate(args) 
-    
-    # test
-    agent = AgentMADDPG()
-    agent.init(args.net_dim, args.state_dim, args.action_dim)
-    agent.save_or_load_agent(cwd=args.cwd, if_save=False)
-    
-    env = build_env('simple_spread')
-    state = env.reset()
-    episode_reward = 0
-    for i in range(2 ** 10):
-        action = agent.select_action(state)
-        next_state, reward, done, _ = env.step(action)
+    def init(self,net_dim, state_dim, action_dim, learning_rate=1e-4,marl=True, n_agents = 1,   if_use_per=False, env_num=1, agent_id=0):
+        self.agents = [AgentDDPG() for i in range(n_agents)]
+        self.explore_env = self.explore_one_env
+        self.if_off_policy = True
+        self.n_agents = n_agents
+        for i in range(self.n_agents):
+            self.agents[i].init(net_dim, state_dim, action_dim, learning_rate=1e-4,marl=True, n_agents = self.n_agents,   if_use_per=False, env_num=1, agent_id=0)
+        self.n_states = state_dim
+        self.n_actions = action_dim
         
-        episode_reward += reward
-        if done:
-            print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
-            break
-        else:
-            state = next_state
-        env.render()
+        self.batch_size = net_dim
+        self.gamma = 0.95
+        self.update_tau = 0
+        self.device = torch.device(f"cuda:{agent_id}" if (torch.cuda.is_available() and (agent_id >= 0)) else "cpu")
               
               
               
-Parameters
----------------------
-
-.. autoclass:: elegantrl.agents.AgentMADDPG.AgentMADDPG
-   :members:
-
-.. _ddpg_networks:
-
-Networks
--------------
-
-.. autoclass:: elegantrl.agents.net.Actor
-   :members:
-
-.. autoclass:: elegantrl.agents.net.Critic
-   :members:
diff --git a/docs/source/algorithms/mappo.rst b/docs/source/algorithms/mappo.rst
@@ -0,0 +1,15 @@
+.. _mappo:
+
+
+MAPPO
+==========
+
+`Multi-Agent Proximal Policy Optimization , a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in two popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge.
+
+-  Shared Network parameter for all agents ✔️
+-  This class is under test, we temporarily add all utils in AgentMAPPO: ✔️
+
+MAPPO achieves strong results while exhibiting comparable sample efficiency. 
+
+
+
diff --git a/docs/source/algorithms/matd3.rst b/docs/source/algorithms/matd3.rst
@@ -0,0 +1,10 @@
+.. _matd3:
+
+
+MATD3
+==========
+
+`Multi-Agent TD3 is based on MADDPG. The critic network for MATD3 is Critic Twin similar to double Q learning:
+
+-  Implement is based on MADDPG ✔️
+-  Use CriticTwin instead of Critic
diff --git a/docs/source/algorithms/qmix.rst b/docs/source/algorithms/qmix.rst
@@ -6,14 +6,10 @@ QMix
 
 `QMix <https://arxiv.org/abs/1803.11485>`_ QMix employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations This implementation is based on pymarl. We provide a demo in elegantrl_helloworld
 
-Parameters
----------------------
 
-.. autoclass:: elegantrl.agents.AgentMix.AgentMix
-   :members:
 
 Networks
 -------------
 
-.. autoclass:: elegantrl.agents.mixer.QMixer
-   :members:
+.. autoclass:: elegantrl.agents.AgentQmix
+
diff --git a/docs/source/algorithms/redq.rst b/docs/source/algorithms/redq.rst
@@ -6,65 +6,18 @@ REDQ
 
 `REDQ <https://arxiv.org/abs/2101.05982>`_ REDQ has three carefully integrated ingredients which allow it to achieve its high performance: (i) a UTD ratio >> 1; (ii) an ensemble of Q functions; (iii) in-target minimization across a random subset of Q functions from the ensemble. This implementation is based on SAC and supports the following extensions:
 
--  Experience replay: ✔️
--  Target network: ✔️
--  Gradient clipping: ✔️
--  Reward clipping: ❌
--  Prioritized Experience Replay (PER): ✔️
--  Leanable entropy regularization coefficient: ✔️
+- Implement G, M, N Parameter
+- Based On SAC class
+- Works well in Mujoco
+
 
 Code Snippet
 ------------
-
+You can change G,M,N when call AgentREDQ.init 
 .. code-block:: python
 
-    import torch
-    from elegantrl.run import train_and_evaluate
-    from elegantrl.config import Arguments
-    from elegantrl.envs.gym import build_env
-    from elegantrl.agents.AgentREDQ import AgentREDQ
-    
-    # train and save
-    args = Arguments(env=build_env('Pendulum-v0'), agent=AgentREDQ())
-    args.cwd = 'demo_Pendulum_SAC'
-    args.env.target_return = -200
-    args.reward_scale = 2 ** -2
-    train_and_evaluate(args) 
-    
-    # test
-    agent = AgentREDQ()
-    agent.init(args.net_dim, args.state_dim, args.action_dim)
-    agent.save_or_load_agent(cwd=args.cwd, if_save=False)
-    
-    env = build_env('Pendulum-v0')
-    state = env.reset()
-    episode_reward = 0
-    for i in range(2 ** 10):
-        action = agent.select_action(state)
-        next_state, reward, done, _ = env.step(action)
-        
-        episode_reward += reward
-        if done:
-            print(f'Step {i:>6}, Episode return {episode_reward:8.3f}')
-            break
-        else:
-            state = next_state
-        env.render()
-
+    AgentREDQ.init(self, net_dim=256, state_dim=8, action_dim=2, reward_scale=1.0, gamma=0.99,
+            learning_rate=3e-4, if_per_or_gae=False, env_num=1, gpu_id=0, G=20, M=2, N=10):
               
               
               
-Parameters
----------------------
-
-.. autoclass:: elegantrl.agents.AgentREDQ.AgentREDQ
-   :members:
-
-Networks
--------------
-
-.. autoclass:: elegantrl.agents.net.ActorSAC
-   :members:
-
-.. autoclass:: elegantrl.agents.net.Critic
-   :members:
diff --git a/docs/source/algorithms/vdn.rst b/docs/source/algorithms/vdn.rst
@@ -0,0 +1,13 @@
+.. _vdn:
+
+
+VDN
+==========
+
+`VDN <https://arxiv.org/abs/1706.05296>`_ AgentVDN and AgentQmix differs in Agnet netowrk. The Agent structure is similar.  
+
+
+Networks
+-------------
+
+.. autoclass:: elegantrl.agents.AgentQmix