Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Search & Rescue Multi-Agent Environment #259

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

zombie-einstein
Copy link
Contributor

@zombie-einstein zombie-einstein commented Nov 4, 2024

Add a multi-agent search and rescue environment where a set of agents has to locate moving targets on a 2d space.

Changes

  • Adds the Esquilax library as a dependency
  • Adds the swarm environment group/type (was not sure the new environment fit into an existing group, but happy to move if you think it would better fit somewhere else)
  • Implement some common swarm/flock functionality (can be used if more environments of this type are added)
  • Implement the search and rescue environment and docs

Todo

  • Need to add images and animations, was waiting to finalise code before adding.

Questions

  • I only forwarded the Environment import to jumanji.environments do types also need forwarding somewhere?
  • I didn't add an animate method to the environment, but saw that some other do? Easy enough to add.
  • Do you want defaults for all the environment parameters? Not sure there are really "natural" choices, but could add sensible defaults to avoid some typing.
  • Are the API docs auto-generated somehow, or do I need to add a link manually?

* Initial prototype

* feat: Add environment tests

* fix: Update esquilax version to fix type issues

* docs: Add docstrings

* docs: Add docstrings

* test: Test multiple reward types

* test: Add smoke tests and add max-steps check

* feat: Implement pred-prey environment viewer

* refactor: Pull out common viewer functionality

* test: Add reward and view tests

* test: Add rendering tests and add test docstrings

* docs: Add predator-prey environment documentation page

* docs: Cleanup docstrings

* docs: Cleanup docstrings
@CLAassistant
Copy link

CLAassistant commented Nov 4, 2024

CLA assistant check
All committers have signed the CLA.

@zombie-einstein
Copy link
Contributor Author

Here you go @sash-a this is correct now. Will grab a look at the contributor license and Ci failure now.

@zombie-einstein
Copy link
Contributor Author

I think CI issue is I've Esquilax set to Python >=3.10, seems you've a PR open to upgrade Python version, is it worth holding on for that?

@sash-a
Copy link
Collaborator

sash-a commented Nov 4, 2024

Python version PR is merged now so hopefully it will pass 😄

Should have time during the week to review this, really appreciate the contribution!

Copy link
Collaborator

@sash-a sash-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial review with some high level comments about jumanji conventions. Will go through it more in depth once these are addressed. In general it's looking really nice and well documented!

Not quite sure on the new swarms package, but also not sure where else we would put it. Not sure on it especially if we only have 1 env and no news ones planned.

One thing I don't quite understand is the benefit of amap over vmap specifically in the case of this env?

Please @ me when it's ready for another review or if you have any questions.

jumanji/environments/swarms/common/types.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/predator_prey/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/predator_prey/types.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/predator_prey/types.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/types.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/predator_prey/env.py Outdated Show resolved Hide resolved
@sash-a
Copy link
Collaborator

sash-a commented Nov 5, 2024

As for your questions in the description:

I only forwarded the Environment import to jumanji.environments do types also need forwarding somewhere?

Nope just the environment is fine

I didn't add an animate method to the environment, but saw that some other do? Easy enough to add.

Please do add animation it's a great help.

Do you want defaults for all the environment parameters? Not sure there are really "natural" choices, but could add sensible defaults to avoid some typing.

We do want defaults, I think we can discuss what makes sense.

Are the API docs auto-generated somehow, or do I need to add a link manually?

It's generated with mkdocs, we need an entry in docs/api/environments and mkdocs.yml, see this recently closed PR for an example of which files we change

One big thing I've realized that this is missing after my review is training code. We like to validate that the env works. I'm not 100% sure if this is possible because the env has two teams, so which reward do you optimize, maybe training with simple heuristic, eg you are the predator and the prey moves randomly? For examples see the training folder, you should only need to create a network. An example of this should also be in the above PR.

* refactor: Formatting fixes

* fix: Implement rewards as class

* refactor: Implement observation as NamedTuple

* refactor: Implement initial state generator

* docs: Update docstrings

* refactor: Add env animate method

* docs: Link env into API docs
@zombie-einstein
Copy link
Contributor Author

Hi @sash-a, just merged changes that I think address all the comments, and the animate method, and API docs link.

Not quite sure on the new swarms package, but also not sure where else we would put it. Not sure on it especially if we only have 1 env and no news ones planned.

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

One thing I don't quite understand is the benefit of amap over vmap specifically in the case of this env?

Yeah in a couple cases using it is overkill, hang-over from when I was writing this example with esquilax demo in mind! Makes sense to use vmap instead if the other arguments are not being used.

@zombie-einstein
Copy link
Contributor Author

I'll look at adding something to training next. I think random prey with trained predators makes sense, will look to implement.

@sash-a
Copy link
Collaborator

sash-a commented Nov 6, 2024

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

If you can add more that would be great! Then I'm happy to keep the swarm package as is. What we'd be most interested in is some kind of env with only 1 team and strictly co-operative like predators vs heuristic prey or visa versa, not sure if you planned to make any envs like this?

But I had a quick look at the changes and it mostly looks great! Will leave an in depth review later today/tomorrow 😄

Also I updated the CI yesterday, we're now using ruff, so you will need to update your pre-commit

@sash-a
Copy link
Collaborator

sash-a commented Nov 6, 2024

One other thing, the only reason I've been hesitant to add this to Jumanji is because it's not that related to industry problems which is a common focus between all the envs. I was thinking maybe we could re-frame the env from predator-prey to something else (without changing any code, just changing the idea). I was thinking maybe a continuous cleaner where your target position is changing or something to do with drones (maybe delivery), do you have any other ideas and would you be happy with this?

@zombie-einstein
Copy link
Contributor Author

Could you have something like a multi-agent package? Don't think you have similar at the moment? FYI was intending to add a couple more swarm/flock type envs if this one went ok.

If you can add more that would be great! Then I'm happy to keep the swarm package as is. What we'd be most interested in is some kind of env with only 1 team and strictly co-operative like predators vs heuristic prey or visa versa, not sure if you planned to make any envs like this?

Yeah I was very interested in developing envs for co-operative multi-agent RL so was keen to design or implement more environments along theses lines. There's a simpler version of this environment which is just the flock, i.e. where the agents move in a co-ordinated way with out colliding. Also seen an environment where the agents have to effectively cover an an area that I was going to look at.

Also I updated the CI yesterday, we're now using ruff, so you will need to update your pre-commit

How do I do this? I did try reinstalling pre-commit, but it raised an error that the config was invalid?

@zombie-einstein
Copy link
Contributor Author

One other thing, the only reason I've been hesitant to add this to Jumanji is because it's not that related to industry problems which is a common focus between all the envs. I was thinking maybe we could re-frame the env from predator-prey to something else (without changing any code, just changing the idea). I was thinking maybe a continuous cleaner where your target position is changing or something to do with drones (maybe delivery), do you have any other ideas and would you be happy with this?

Yeah definitely open to suggestions. I was thinking more in the abstract for this (will the agents develop some collective behaviour to avoid predators) but happy to modify towards something more concrete.

@sash-a
Copy link
Collaborator

sash-a commented Nov 6, 2024

Great to hear on the co-operative marl front those both sound like nice envs to have

How do I do this? I did try reinstalling pre-commit, but it raised an error that the config was invalid?

Couple things to try:

pip install -U pre-commit
pre-commit uninstall
pre-commit install

If this doesn't work check which pre-commit it should point to your virtual environment if it's pointing to your system python or some other system folder just uninstall that version and rerun the above.

Yeah definitely open to suggestions. I was thinking more in the abstract for this (will the agents develop some collective behaviour to avoid predators) but happy to modify towards something more concrete.

Agreed it would be nice to keep it abstract for the sake of research, but I think it's nice that this env suite is all industry focused. I quite like something to do with drones - seems quite industry focused although we must definitely avoid anything to do with war. I'll give it a think

@zombie-einstein
Copy link
Contributor Author

Hi @sash-a fixed the formatting and consolidated the predator-prey type.

@sash-a
Copy link
Collaborator

sash-a commented Nov 7, 2024

Thanks I'll try have a look tomorrow, sorry previous 2 days were a bit more busy than expected.

For the theme I'm think maratime search and rescue works well. It's relatively real world and fits the current dynamics

@zombie-einstein
Copy link
Contributor Author

Thanks I'll try have a look tomorrow, sorry previous 2 days were a bit more busy than expected.

For the theme I'm think maratime search and rescue works well. It's relatively real world and fits the current dynamic

Thanks, no worries. Actually yeah funnily enough a co-ordinated search was something I'd been looking into. Yeah could have one set of agent have some drift w random movements that need to be found inside the simulated region.

@sash-a
Copy link
Collaborator

sash-a commented Nov 8, 2024

Sorry still didn't have time to review today and Mondays are usually super busy for me, but I'll get to this next week!

As for the theme do you think we should then change the dynamics a bit to make prey heuristically controlled to move sort of randomly?

@zombie-einstein
Copy link
Contributor Author

Sorry still didn't have time to review today and Mondays are usually super busy for me, but I'll get to this next week!

As for the theme do you think we should then change the dynamics a bit to make prey heuristically controlled to move sort of randomly?

No worries, sure I'll do a revision this weekend!

* feat: Prototype search and rescue environment

* test: Add additional tests

* docs: Update docs

* refactor: Update target plot color based on status

* refactor: Formatting and fix remaining typos.
@zombie-einstein
Copy link
Contributor Author

Hi @sash-a, this turned into a larger rewrite (sorry for the extra review work, let me know if you want me to close this PR and just start with a fresh one) but think it's a more realistic scenario

  • A team of agents is searching for targets in the environment region
  • Targets are controlled by an fixed update algorithm (that has an interface to allow other behaviours)
  • Agents are only rewarded the first time a target is located
  • To detect the target they must come within a fixed range of them.
  • Agents visualise the local environment, i.e. the location of of other agents in their vicinity.

A couple choices we may want to consider:

  • Agents are individually rewarded, we could have some interface for reward shaping (to promote co-operation), but could also leave this external to the environment for the user to implement?
  • At the moment agents only visualise other neighbours. A twist on this I considered was once targets are revealed they are then visualised (i.e. can be seen) by each agent as part of their local view.
  • Do we want to scale rewards with how quickly targets are found, feels like it would make sense?
  • I've assigned a fixed number of steps to locate the targets, but also seems it would makes sense to terminate the episode when all located?
  • As part of the observation I've included the remaining steps and targets as normalised floats, but not sure if you have some convention for values like this (i.e. just use integer values and let use rescale them)

@sash-a
Copy link
Collaborator

sash-a commented Nov 12, 2024

Thanks for this @zombie-einstein I'll start having a look now 😄
I think leave the PR as is, no need to create a new one.

that has an interface to allow other behaviors

awesome!

Agents are only rewarded the first time a target is located

Agreed I think we should actually hide targets once they are located so as to not confuse other agents.

Agents are individually rewarded

I think individual is fine and externally users can sum it outside if they want. e.g we do this in mava for connector

At the moment agents only visualise other neighbours. A twist on this I considered was once targets are revealed they are then visualised (i.e. can be seen) by each agent as part of their local view.

Not quite following what you mean here. I would say an agent should observe all agents and targets (that have not yet been rescued) within their local view.

Do we want to scale rewards with how quickly targets are found, feels like it would make sense?

Maybe add this as an optional reward type, I think I prefer 1 if target is saved and 0 otherwise - makes the env quite hard, but we should test what works best.

I've assigned a fixed number of steps to locate the targets, but also seems it would makes sense to terminate the episode when all located?

Definitely!

As part of the observation I've included the remaining steps and targets as normalised floats, but not sure if you have some convention for values like this (i.e. just use integer values and let use rescale them)

We don't have a convention for this. I wouldn't add remaining steps to the obs directly I don't see why the algorithm would need that, although again needs to be tested. Agreed with remaining targets, makes sense to observe that. I think normalised floats makes sense.

Copy link
Collaborator

@sash-a sash-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing job with this rewrite, haven't had time to fully look at everything but it does look great so far!

Some high level things:

  • Please add a generator, dynamics and viewer test (see examples of the viewer test for other envs)
  • Can you also add tests for the common/updates
  • Can you start looking into the networks and testing for jumanji

Sorry a bit tedious tasks, but I really like the env we've landed on 😄

docs/environments/search_and_rescue.md Outdated Show resolved Hide resolved
docs/environments/search_and_rescue.md Outdated Show resolved Hide resolved
docs/environments/search_and_rescue.md Outdated Show resolved Hide resolved
docs/environments/search_and_rescue.md Outdated Show resolved Hide resolved
docs/environments/search_and_rescue.md Outdated Show resolved Hide resolved
jumanji/environments/swarms/common/updates.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/search_and_rescue/env.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/search_and_rescue/env.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/search_and_rescue/env.py Outdated Show resolved Hide resolved
jumanji/environments/swarms/search_and_rescue/env.py Outdated Show resolved Hide resolved
@zombie-einstein
Copy link
Contributor Author

Thanks @sash-a, just a couple follow ups to your questions:

At the moment agents only visualise other neighbours. A twist on this I considered was once targets are revealed they are then visualised (i.e. can be seen) by each agent as part of their local view.

Not quite following what you mean here. I would say an agent should observe all agents and targets (that have not yet been rescued) within their local view.

So I was picturing (and as currently implemented) a situation where the searchers have to come quite close the targets to "find" them (as if they are obscured/hard to find), but the agents have a larger vision range to visualise the location of other searchers agents (to allow them to improve search patterns for example).

My feeling was that this created more of a search task, where if the targets are part of their larger vision range it feels like it could be more of a routing type task.

I then thought it may be good to include found targets in the vision to allow agents to visualise density of located targets.

As part of the observation I've included the remaining steps and targets as normalised floats, but not sure if you have some convention for values like this (i.e. just use integer values and let use rescale them)

We don't have a convention for this. I wouldn't add remaining steps to the obs directly I don't see why the algorithm would need that, although again needs to be tested. Agreed with remaining targets, makes sense to observe that. I think normalised floats makes sense.

I thought if treating it as a time-sensitive task some indication of the remaining time to find targets could be a useful feature of the observation.

Please add a generator, dynamics and viewer test (see examples of the viewer test for other envs)
Can you also add tests for the common/updates
Can you start looking into the networks and testing for jumanji

Yup will do!

@zombie-einstein
Copy link
Contributor Author

Great thanks @sash-a, I'll grab a look, I suspect that as it stands with only 2 agents this may be really difficult stochastic, the agents would see nothing until they bump into a target pretty much by chance. I actually just added an additional observation type that includes found and unfound targets in the view so I will try this first, should be easier (the different observations now treat this as if they have different visual channels for the different agent/target information).

@sash-a
Copy link
Collaborator

sash-a commented Dec 9, 2024

Sounds great, I'd try other settings that make it easier also like increasing sight range and radius and generally anything else you can think of that would make it easier

@zombie-einstein
Copy link
Contributor Author

@sash-a I think latest JAX release 0.4.36 is breaking something, getting a bunch of IndexError: list index out of range in the CI. Testing locally 0.4.35 was running fine, but upgrade to 0.4.36 seems to fail. Want me to create a new PR to pin it less then 0.4.36?

@sash-a
Copy link
Collaborator

sash-a commented Dec 9, 2024

Yes please! I was getting it this morning in Mava also 😄

Seems to be related to jax-ml/jax#25332

@zombie-einstein
Copy link
Contributor Author

Just wanted to track a couple final design choices I think have got a bit lost in the comments:

  • Do we want to include some reward decay, to reward faster finding of targets
  • Related to this do we want to include time/step in the observation?
  • Do we want to add a velocity field to target states, not a massive change, and it will make it easier to extend target dynamics to something more complex

@sash-a
Copy link
Collaborator

sash-a commented Dec 9, 2024

  • I think having this an a different type of reward would be a good thing
  • The convention is usually to add step_count to the observation and then you can also pass time_limit to the network so it can infer the remaining steps. On that note could you please change max_steps to time_limit as I'm trying to make this a convention.
  • I think that would be nice, especially if you are finding that targets movement is unnatural or possibly hard to pursue. At least having it there in case we want to improve it in the future would be great!

@zombie-einstein
Copy link
Contributor Author

zombie-einstein commented Dec 9, 2024

Hey @sash-a, it seems like at least for the fully visibility observation (i.e. agent can see unfound targets) this is relatively straightforward to train.
Below is for a single agent (pink) and 2 agents (blue) w 100 targets (that are mostly stationary)

image

single agent seems to be doing pretty well to get most of the targets, and w two agents they finish before the time limit. I'm guessing rewards recorded here are individual agents right? Only thing I want to double check is that for 2 agents rewards seem to max out at 50. This makes sense in that the total the agents can receive 100 in total, but it seems unlikely that one agent would not randomly locate more targets than the other.

Will run more experiments, wanted to ask:

  • Is there a way to pass arguments that are classes to the environment constructor in Mava? Can you make a wrapped constructor somewhere that can be passed arguments from a config?
  • Is there a built in way to generate an animation during/after training?

@sash-a
Copy link
Collaborator

sash-a commented Dec 10, 2024

I'm guessing rewards recorded here are individual agents right?

The rewards are the mean over all agents, and summed over the episode.

Is there a way to pass arguments that are classes to the environment constructor in Mava? Can you make a wrapped constructor somewhere that can be passed arguments from a config?

You can do this through hydra.instantiate and some config trickery, but it's much easier to find the environment.make call in the system file and replacing it with exactly what you want, just be sure to add all the necessary wrappers in the order we do it in utils.make_env. We designed Mava to be hacked around with so I'd recommend this instead of changing the config.

Is there a built in way to generate an animation during/after training?

No, but we have some scripts for this:

def _rollout_mava_system(
    env: Environment, params: FrozenDict, actor_state: ActorState, key: PRNGKey, act_fn: EvalActFn
) -> list:
    env_step = jax.jit(env.step)
    jit_act = jax.jit(act_fn)

    key, env_key = jax.random.split(key)
    state, ts = env.reset(env_key)

    states = []

    # Only loop once
    while not ts.last():
        # Eval env is wrapped in the record metrics wrapper. We just store the true env state
        # for jumanji to be able to render.
        states.append(state.env_state)
        ts = jax.tree_map(lambda x: x[jnp.newaxis, ...], ts)
        key, act_key = jax.random.split(key, 2)
        action, actor_state = jit_act(params, ts, act_key, actor_state)

        # note: dangerous squeeze, but we don't want a batch or time dim here
        state, ts = env_step(state, action.squeeze())

    return states

Then it's use in ff_ippo:

states = _rollout_mava_system(eval_env, tree.map(lambda x: x[0], trained_params), {}, key, eval_act_fn)
env.animate(states, save_path="your-path-here")

* Use channels view parameters

* Rename parameters

* Include step-number in observation

* Add velocity field to targets

* Add time scaled reward function
@sash-a
Copy link
Collaborator

sash-a commented Dec 11, 2024

Hey just tracking that issue with JAX, it's been fixed in 0.4.37, so can you please unpin the JAX version in this PR

vision_range=0.1,
view_angle=searcher_view_angle,
agent_radius=0.01,
env_size=self.generator.env_size,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case do we want to pass these arguments in from the constructor arguments to make tweaking these values a bit a more streamlined the arguments are standardised across the different vision models? Though appreciate this is a standard pattern.

I did this when testing with Mava, passing in the type and constructing here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye it can be quite a mission to instantiate, it's possible we should reconsider this pattern, but for now leave it as is just to stay consistent

###

# Search-and-Rescue environment
register(id="SearchAndRescue-v0", entry_point="jumanji.environments:SearchAndRescue")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth registering the environment with different vision models?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean fully versus partially observable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think you mean the different observation functions. I'd say we should aim to use the observation that only visualizes targets and searchers within their vision cones. If we see a good training curve with this then that should be the default.

In general I'd like to have a set of scenarios for most/all environments in jumanji (see #248). So it would be cool to think of a set (3-4) easy/hard envs and we can register those. If they happen to have different observation models that's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was picturing something along these lines, like the easy one is where un-found targets are visible, and then harder versions use the version with hidden targets.

@zombie-einstein
Copy link
Contributor Author

Hi @sash-a, I just pushed a bunch of changes to add those features from the comments above (scaled rewards, time-step, and target velocities) so think this no mostly feature complete. Added a couple last comments above I wanted to run by you.

I also now want to do a final pass overt the docstrings and docs, and add a couple more tests.

From testing, the environment with full visibility is relatively straightforward, but from the animation the agents seem to be essentially routing to the closest next unfound target (I need to find a way to upload a gif here!).

I've not made much more progress with the model with hidden targets, though I need to test a few more configurations, I had a couple thoughts:

  • The observation has no information about absolute agent locations, so it could be hard to learn an effective search pattern that attempts to cover the spae.
  • It could be worth trying a global state that contains more information about agent location on the search area? Though his may be partially missing the point of the environment.

@zombie-einstein
Copy link
Contributor Author

This is with 4 searching agents, 100 targets and targets hidden until found. They seem to do pretty well and continue (slowly learning), though looking at their behaviours, again there's not a log of co-ordination, though this is probably to be expected from fully independent agents

image

* Update docstrings

* Update tests

* Update environment readme
@sash-a
Copy link
Collaborator

sash-a commented Dec 12, 2024

Hey @zombie-einstein this is really great progress and agreed it seems pretty much feature complete to me 🔥! Just a heads up I will be on holiday from the 16th of December to 6th January. I won't be able to do code review in that time, but I'll be available for any questions you may have. I think it's realistic to expect this to be merged early January. I'll just have a final look once I'm back and hopefully everything will be good to go 😄

The observation has no information about absolute agent locations, so it could be hard to learn an effective search pattern that attempts to cover the spae.

Ok this is definitely an issue when learning and something I've noticed as being very important to the performance in other environments. I would add the absolute position of the current agent to its observation, and normalize it by world size.

It could be worth trying a global state that contains more information about agent location on the search area? Though his may be partially missing the point of the environment.

Ye maybe the global state could be the absolute position of all agents and the target part of their observation?

though this is probably to be expected from fully independent agents

I would highly recommend trying rec-mappo in mava (assuming there's a sensible global state) because this problem seems like it would definitely benefit from some recurrence in the policy.

@zombie-einstein
Copy link
Contributor Author

Ok this is definitely an issue when learning and something I've noticed as being very important to the performance in other environments. I would add the absolute position of the current agent to its observation, and normalize it by world size.

I would highly recommend trying rec-mappo in mava (assuming there's a sensible global state) because this problem seems like it would definitely benefit from some recurrence in the policy.

Great yeah this makes sense, will add this and try it out.

Just to check, the global state should have one entry per agent, something like [n-agents, global-obs]? Thinking if the position part of the observation needs to be specific to each agent, for example if the positions are an array

[p0, p1, p2, ....]

the per-agent position observation might need to be rotated like?

[
    [p0, p1, p2, ...., pn],
    [p1, p2,  ..., pn, p0],
    [p2, ...., pn, p0, p1],
]

@sash-a
Copy link
Collaborator

sash-a commented Dec 12, 2024

Just to check, the global state should have one entry per agent, something like [n-agents, global-obs]? Thinking if the position part of the observation needs to be specific to each agent, for example if the positions are an array

That's the way we do it in mava, 1 per agent. But to be honest it's a bit of an open question as to how the global state should be structured. In my opinion it should be the same across all agents (I'm pretty sure this is what the theory says) but empirically we've seen good results when it's tailored to each agent. So what I would do for a first pass is a global state of shape (num_agents, global_state shape) where it is the same along the first axis if this doesn't work well then try tailor it per agent.

Also I'm assuming this was just an example above, but don't forget the global state should include information about targets. Some global state ideas for targets, in all cases these include the normalised absolute positions of all agents:

  • Absolute positions of all targets if they are visible to any agents otherwise some out of bounds default value
  • Always adding target positions to the global state regardless of if they are visible (this one is probably cheating a bit though).
  • The target part of all agents observations

Honestly often concatenating all observations is a good global state also, it's hard to know what will work best without testing it.

@sash-a
Copy link
Collaborator

sash-a commented Jan 6, 2025

Hi @zombie-einstein hope you had a great new year! Just wondering what the status of this is, how is the training looking?

@zombie-einstein
Copy link
Contributor Author

Hi @zombie-einstein hope you had a great new year! Just wondering what the status of this is, how is the training looking?

Happy new year @sash-a, hope you've had a good break. I've been catching up with some other work over the break (including some tweaks to Esquilax like not requiring random keys), I got the training working but need to tweak the shared observation. It's on my list to hopefully finalise this week.

@zombie-einstein
Copy link
Contributor Author

Hey @sash-a, so some progress with this with some sensible looking training curves (though I think there is like some improvement to be made).

So this is with

  • 4 agents
  • 25 targets
  • Rewards scaled (linearly decreasing) by time
  • Recursive MAPPO
  • Agents share their view on targets and global agent positions
  • Greedy evaluation

image

You can see the agents completing the search more efficiently over the course of training.

Without the reward scaling, I've been struggling to get similar results, possibly the task is still kind of stochastic for individual agents, without some more specific search planning/scheduling.

@sash-a
Copy link
Collaborator

sash-a commented Jan 9, 2025

@zombie-einstein these curves are looking great!

So when you say linearly scaled rewards, do you mean that finding a target early gives more reward than finding one later on? If so I'd say that's perfect.

It does seem to peak quite early, which may just be that the current env configuration is a bit easy? Can you try tweaking the parameters to make it a bit harder, maybe more agents or a smaller fov?

How many steps was this run for, was it the mava default?

Given these curves I am pretty much happy to merge, I will give it another once over next week and hopefully will be able to merge after that!

@zombie-einstein
Copy link
Contributor Author

@zombie-einstein these curves are looking great!

So when you say linearly scaled rewards, do you mean that finding a target early gives more reward than finding one later on? If so I'd say that's perfect.

Yeah exactly this, though this is configurable and allows custom implementations of the scaling.

It does seem to peak quite early, which may just be that the current env configuration is a bit easy? Can you try tweaking the parameters to make it a bit harder, maybe more agents or a smaller fov?

Yeah I think there is kind of simple local optima which is to move to explore the space (i.e. not just sit in one spot or circle) and then bump into the targets. Feels like a really consistent strategy would be pretty involved, i.e. having a way of truly covering the space. Sure I'll try a couple more permutations, could reduce the number of targets again, or reduce agent speed.

How many steps was this run for, was it the mava default?

I increased num_updates to 2,000, but think I need to also play with some of the other parameters? I also increased the critic parameters to handle the larger global observation.

Given these curves I am pretty much happy to merge, I will give it another once over next week and hopefully will be able to merge after that!

Nice one, I'll try grab a look over it before since there's been a lot of changes.

@sash-a
Copy link
Collaborator

sash-a commented Jan 9, 2025

Yeah I think there is kind of simple local optima which is to move to explore the space (i.e. not just sit in one spot or circle) and then bump into the targets. Feels like a really consistent strategy would be pretty involved, i.e. having a way of truly covering the space. Sure I'll try a couple more permutations, could reduce the number of targets again, or reduce agent speed.

Ye makes sense, it's definitely a hard problem. Great edit whichever parameters you feel makes the most sense

I increased num_updates to 2,000, but think I need to also play with some of the other parameters? I also increased the critic parameters to handle the larger global observation.

Ah interesting, then that's not such a surprising plateau as that would give you around 40m timesteps

Nice one, I'll try grab a look over it before since there's been a lot of changes.

Great, let me know when you're ready for me to have a look again 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants