Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of env.step() #2

Open
svarner9 opened this issue Jul 3, 2024 · 7 comments
Open

Order of env.step() #2

svarner9 opened this issue Jul 3, 2024 · 7 comments

Comments

@svarner9
Copy link

svarner9 commented Jul 3, 2024

Hello,

I am running the code and trying to make some improvements. One thing that I came across that I am questioning is the order in which the code is performed in the env.step() function. I am curious why the observation and reward are obtained before the action is taken?

From my understanding, the agent makes a decision which is the 'action' argument that is passed to env.step(action). Then based on this action, it should expect to get back a reward. It seems right now that the action->reward->action->reward cycle is out of sync, since the reward is calculated before the current action is taken.

Is there a reason that it was done this way?

Thank you so much for starting development on this! I look forward to your response :)

Best,
Sam

@ocram444
Copy link
Owner

ocram444 commented Jul 3, 2024 via email

@svarner9
Copy link
Author

svarner9 commented Jul 3, 2024

Hi Marco,

Thank you for your quick reply!

I definitely understand the need to have a delay between taking the action, and making the observation (and calculating the reward). However, I am just concerned that the model does not know that there is a step delay between the action it takes and the observation/reward it receives.

It looks like the env.step(action) is called internally during the model.learn() process, which means that model.learn() is assuming by default that the action it gives to .step() corresponds to the reward and state returned (or at least I think so).

I looked at some of the default gym environments to see if this was the case and I think it is indeed... Here is the CartPole example (CartPole). In this example, you can see in their .step() function that the action is taken at the beginning, then the state is updated, then the reward is calculated. They don't have to wait at all between the action and observation because they don't have any realtime dynamics, they are just integrating forward with simple kinematics.

I am thinking maybe the order in our step function should be something like this instead:

  1. Take action
  2. Wait for ~0.5 seconds
  3. Record observation
  4. Calculate reward
  5. Return current state and reward

What are your thoughts on this? Do you know if the model is correctly interpreting the current setup with the observation and reward applying to the action from the previous step?

Finally, is there a good place to chat more about ideas? I joined the EldenBot discord server, but I was wondering if this repo has a discord server as well?

Thanks so much!

Best,
Sam

@ocram444
Copy link
Owner

ocram444 commented Jul 4, 2024 via email

@svarner9
Copy link
Author

svarner9 commented Jul 4, 2024

Okay I see. I'm fairly new to this so I'm still learning some stuff. I found this post which further supports the validity of delayed rewards... StackExchange.

I have been training a model for some time now and improvement seems to either be very slow or altogether nonexistent. I have been thinking of some ideas to improve the learning, and the most reasonable idea so far is to seed a model with some successful trajectories to begin with.

Namely, instead of allowing the model to generate random moves that create reinforcement, I would just like to feed a prescribed set of moves (say from a boss run that I have done, where I recorded my inputs and captured the environment every half-second) into the model as well as the corresponding rewards. This way the model starts already with some good examples.

Do you know if there is a simple way to train the PPO model (or maybe some other model that can be later restarted as PPO) with predetermined training data? The main idea is that the search for an optimal strategy will be faster since the model will be trained on some attempts that were optimal, before becoming fully self sufficient and progressing with the current PPO training.

Best,
Sam

@ocram444
Copy link
Owner

ocram444 commented Jul 4, 2024 via email

@svarner9
Copy link
Author

svarner9 commented Jul 4, 2024

Okay that makes senss. I did see that project you are referring to I believe.

I don't really want to pre-program the optimal moves, rather I want the framework to remain exactly the same, but I just want to feed it successful runs (at least in the beginning) before switching to fully unsupervised. The current training data is generated by the agent taking random moves and then collecting observations and rewards. I can make the same type of training data from a successful run, and feed that as if the agent did it themselves. I am having trouble figuring out how to do this with the PPO model though. It doesn't seem to have an option for user supplied training data. Maybe there is a way to modify mode.learn() to take actions from file instead of generating them from the current policy. That is my next thing to try out.

I have been trying to train on the first mini boss in the DLC (Blackgaol Knight) and the agent ends up just dying within the first 2-3 seconds most times, which is probably a big reason why the training is very stagnant.

I am on the discord server but I don't see anything there, all of the channels are empty for me. Maybe I can't see the history because I just recently joined?

Best,
Sam

@ocram444
Copy link
Owner

ocram444 commented Jul 6, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants