You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, I learned the LOGO algorithm you published in ICLR2022, and the method of guiding policy learning with demonstration data has achieved remarkable results in mujoco with theoretical assurance. It is a good job! I have some doubts about the collection method of behavioral data and the theoretical derivation, and hope to get your answers.
Following your instructions, I collected behavior data as follows:
Take Hopper-v2 as an example:
(1) The default iteration numbers is 1500. I trained TRPO in dense reward settings for 1000 iterations (namely, the training and test envs are all dense reward settings).
(2) Using the TRPO model trained in 1000 iterations to collecte about 10 eposides (about 3000 rows of data) as the behavioral data.
Except that the behavior data is different from the default data in LOGO code, the other training settings are consistent with the LOGO code. But my training results are very poor (as the below picture). I think it should be a problem of behavioral data collection. Can you elaborate further on your approach to constructing behavioral data, and can you open source the corresponding parts of your behavioral data collection program?
question 2:
question 3:
The text was updated successfully, but these errors were encountered:
Dear Desik Rengarajan,
Recently, I learned the LOGO algorithm you published in ICLR2022, and the method of guiding policy learning with demonstration data has achieved remarkable results in mujoco with theoretical assurance. It is a good job! I have some doubts about the collection method of behavioral data and the theoretical derivation, and hope to get your answers.
Following your instructions, I collected behavior data as follows:
Take Hopper-v2 as an example:
(1) The default iteration numbers is 1500. I trained TRPO in dense reward settings for 1000 iterations (namely, the training and test envs are all dense reward settings).
(2) Using the TRPO model trained in 1000 iterations to collecte about 10 eposides (about 3000 rows of data) as the behavioral data.
Except that the behavior data is different from the default data in LOGO code, the other training settings are consistent with the LOGO code. But my training results are very poor (as the below picture). I think it should be a problem of behavioral data collection. Can you elaborate further on your approach to constructing behavioral data, and can you open source the corresponding parts of your behavioral data collection program?
question 2:
question 3:
The text was updated successfully, but these errors were encountered: