Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Error when Running Training Command #28

Open
aaravnavani opened this issue Oct 8, 2022 · 6 comments
Open

Error when Running Training Command #28

aaravnavani opened this issue Oct 8, 2022 · 6 comments

Comments

@aaravnavani
Copy link

aaravnavani commented Oct 8, 2022

Hello,

When I ran the training command given in the readme python train.py task=quadruped_walk I got this error:

File "/home/anavani/anaconda3/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 168, in ensure_overrides_used raise ConfigCompositionException(msg) hydra.errors.ConfigCompositionException: Could not override 'task'. Did you mean to override task@_global_? To append to your default list use +task=quadruped_walk

I changed the command to python train.py +task=quadruped_walk and this seemed to fix the issue. However, after I let it train for a bit, I got this error

It seems as if the +task=quadruped_walk is causing an EOF error, but I'm not sure what is causing the seocnd error. I would really appreciate any help. @denisyarats @Aladoro @desaixie @medric49

@medric49
Copy link
Contributor

medric49 commented Oct 9, 2022

The first error is probably because you are not using hydra-core==1.1.0 and hydra-submitit-launcher==1.1.5 .
Please make sure that you use these versions.

@aaravnavani
Copy link
Author

@medric49 Ah, ok. I had hydra-core==1.2.0 and hydra-submitit-launcher==1.2.0. I installed the correct versions and when I ran the training, I did not get the override error. I started the training again, so hopefully I do not get the second error that I listed in my original post. Thanks for the help and I will let you know if I run into any issues.

@aaravnavani
Copy link
Author

@medric49 I ran the training command and after about 6 hours, I got this:

| train | F: 916000 | S: 458000 | E: 916 | L: 1000 | R: 461.0975 | BS: 458000 | FPS: 0.7374 | T: 6:45:55

but then I got this error: RuntimeError: DataLoader worker (pid 320456) is killed by signal: Killed.

Here is the full error. I'm not sure what's causing this error (maybe batch size or out of memory), so I would appreciate any help.

@medric49
Copy link
Contributor

medric49 commented Oct 9, 2022

I am not sure what the issue could be, but it seems to be an out-of-memory, and that's why your system killed the program.
Try to reduce the replay_buffer_size parameter in config.yaml which indicates how many training steps you can save in your memory.
The more episodes you run, the more training steps the algo stores in your memory, and ~458000 seems to be the limit of your current pc.
Try a value like 400000.

@aaravnavani
Copy link
Author

@medric49 Ah, ok. Currently, the replay_buffer_size is 1000000 not 458000. I will change it to 400,000 and let you know what happens.

@medric49
Copy link
Contributor

medric49 commented Oct 9, 2022

Okay.
Yes, the current value is 1000000.
But, 458000 is the number of steps at which your program stopped:

| train | F: 916000 | S: 458000 | E: 916 | L: 1000 | R: 461.0975 | BS: 458000 | FPS: 0.7374 | T: 6:45:55

which is lower than 1000000.
So, if it is really the cause of your issue, then a buffer limit lower than 458000 is preferable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants