Error when Running Training Command #28

aaravnavani · 2022-10-08T06:24:08Z

Hello,

When I ran the training command given in the readme python train.py task=quadruped_walk I got this error:

File "/home/anavani/anaconda3/lib/python3.9/site-packages/hydra/_internal/defaults_list.py", line 168, in ensure_overrides_used raise ConfigCompositionException(msg) hydra.errors.ConfigCompositionException: Could not override 'task'. Did you mean to override task@_global_? To append to your default list use +task=quadruped_walk

I changed the command to python train.py +task=quadruped_walk and this seemed to fix the issue. However, after I let it train for a bit, I got this error

It seems as if the +task=quadruped_walk is causing an EOF error, but I'm not sure what is causing the seocnd error. I would really appreciate any help. @denisyarats @Aladoro @desaixie @medric49

The text was updated successfully, but these errors were encountered:

medric49 · 2022-10-09T00:06:11Z

The first error is probably because you are not using hydra-core==1.1.0 and hydra-submitit-launcher==1.1.5 .
Please make sure that you use these versions.

aaravnavani · 2022-10-09T00:34:00Z

@medric49 Ah, ok. I had hydra-core==1.2.0 and hydra-submitit-launcher==1.2.0. I installed the correct versions and when I ran the training, I did not get the override error. I started the training again, so hopefully I do not get the second error that I listed in my original post. Thanks for the help and I will let you know if I run into any issues.

aaravnavani · 2022-10-09T16:23:23Z

@medric49 I ran the training command and after about 6 hours, I got this:

| train | F: 916000 | S: 458000 | E: 916 | L: 1000 | R: 461.0975 | BS: 458000 | FPS: 0.7374 | T: 6:45:55

but then I got this error: RuntimeError: DataLoader worker (pid 320456) is killed by signal: Killed.

Here is the full error. I'm not sure what's causing this error (maybe batch size or out of memory), so I would appreciate any help.

medric49 · 2022-10-09T17:11:00Z

I am not sure what the issue could be, but it seems to be an out-of-memory, and that's why your system killed the program.
Try to reduce the replay_buffer_size parameter in config.yaml which indicates how many training steps you can save in your memory.
The more episodes you run, the more training steps the algo stores in your memory, and ~458000 seems to be the limit of your current pc.
Try a value like 400000.

aaravnavani · 2022-10-09T17:27:10Z

@medric49 Ah, ok. Currently, the replay_buffer_size is 1000000 not 458000. I will change it to 400,000 and let you know what happens.

medric49 · 2022-10-09T17:56:52Z

Okay.
Yes, the current value is 1000000.
But, 458000 is the number of steps at which your program stopped:

| train | F: 916000 | S: 458000 | E: 916 | L: 1000 | R: 461.0975 | BS: 458000 | FPS: 0.7374 | T: 6:45:55

which is lower than 1000000.
So, if it is really the cause of your issue, then a buffer limit lower than 458000 is preferable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when Running Training Command #28

Error when Running Training Command #28

aaravnavani commented Oct 8, 2022 •

edited

Loading

medric49 commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

medric49 commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

medric49 commented Oct 9, 2022

Error when Running Training Command #28

Error when Running Training Command #28

Comments

aaravnavani commented Oct 8, 2022 • edited Loading

medric49 commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

medric49 commented Oct 9, 2022

aaravnavani commented Oct 9, 2022

medric49 commented Oct 9, 2022

aaravnavani commented Oct 8, 2022 •

edited

Loading