-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order of env.step() #2
Comments
Hi Sam,
Yes, that's true. The reason we collect the observation and calculate the
reward before taking the action is because the reward always counts for the
previous step. So, we need to calculate it first before taking the action
of the current step.
The action-reward-action-reward cycle really is action->next
observation->previous reward->action-... in reality.We wanted to give the
game some time to process the step instead of calculating its reward
immediately. While it is possible to take the action first and immediately
calculate its reward, this approach would either delay the observation by
the time it takes to calculate the reward before it's handed to the model
to decide the next action, or the reward is calculated for the previous
step as it is now.
Thank you so much for your interest in the development! I look forward to
any further questions or feedback you may have.
Best regards,
Marco
…On Wed, Jul 3, 2024, 08:53 Samuel Varner ***@***.***> wrote:
Hello,
I am running the code and trying to make some improvements. One thing that
I came across that I am questioning is the order in which the code is
performed in the env.step() function. I am curious why the observation and
reward are obtained before the action is taken?
From my understanding, the agent makes a decision which is the 'action'
argument that is passed to env.step(action). Then based on this action, it
should expect to get back a reward based on that action. It seems right now
that the action->reward->action->reward cycle is out of sync, since the
reward is calculated before the current action is taken.
Is there a reason that it was done this way?
Thank you so much for starting development on this! I look forward to your
response :)
Best,
Sam
—
Reply to this email directly, view it on GitHub
<#2>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3C7Z7LKZUIHOMMXXEYR6CDZKONWZAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4DOOBSGQ4TQOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi Marco, Thank you for your quick reply! I definitely understand the need to have a delay between taking the action, and making the observation (and calculating the reward). However, I am just concerned that the model does not know that there is a step delay between the action it takes and the observation/reward it receives. It looks like the env.step(action) is called internally during the model.learn() process, which means that model.learn() is assuming by default that the action it gives to .step() corresponds to the reward and state returned (or at least I think so). I looked at some of the default gym environments to see if this was the case and I think it is indeed... Here is the CartPole example (CartPole). In this example, you can see in their .step() function that the action is taken at the beginning, then the state is updated, then the reward is calculated. They don't have to wait at all between the action and observation because they don't have any realtime dynamics, they are just integrating forward with simple kinematics. I am thinking maybe the order in our step function should be something like this instead:
What are your thoughts on this? Do you know if the model is correctly interpreting the current setup with the observation and reward applying to the action from the previous step? Finally, is there a good place to chat more about ideas? I joined the EldenBot discord server, but I was wondering if this repo has a discord server as well? Thanks so much! Best, |
Hi Sam,
The approach we're using is quite standard for OpenAI Gym and reinforcement
learning in general. The reward typically corresponds to the action taken
in the previous step, which is why it's calculated first before taking the
current action. This method ensures the environment processes the step and
updates its state correctly before moving on.
I think your suggested approach can work well for real-time environments,
and it’s a valid alternative. However, calculating the reward for the
previous step is a common practice in reinforcement learning. This ensures
a smoother transition and accurate processing within the environment (At
least ChatGPT says so).
For further discussion and more ideas, the EldenBot Discord server is the
right place. The community there is very helpful and knowledgeable about
this and the other projects.
Best regards,
Marco
…On Wed, Jul 3, 2024 at 6:44 PM Samuel Varner ***@***.***> wrote:
Hi Marco,
Thank you for your quick reply!
I definitely understand the need to have a delay between taking the
action, and making the observation (and calculating the reward). However, I
am just concerned that the model does not know that there is a step delay
between the action it takes and the observation/reward it receives.
It looks like the env.step(action) is called internally during the
model.learn() process, which means that model.learn() is assuming by
default that the action it gives to .step() corresponds to the reward and
state returned (or at least I think so).
I looked at some of the default gym environments to see if this was the
case and I think it is indeed... Here is the CartPole example (CartPole
<https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py>).
In this example, you can see in their .step() function that the action is
taken at the beginning, then the state is updated, then the reward is
calculated. They don't have to wait at all between the action and
observation because they don't have any realtime dynamics, they are just
integrating forward with simple kinematics.
I am thinking maybe the order in our step function should be something
like this instead:
1. Take action
2. Wait for ~0.5 seconds
3. Record observation
4. Calculate reward
5. Return current state and reward
What are your thoughts on this? Do you know if the model is correctly
interpreting the current setup with the observation and reward applying to
the action from the previous step?
Finally, is there a good place to chat more about ideas? I joined the
EldenBot discord server, but I was wondering if this repo has a discord
server as well?
Thanks so much!
Best,
Sam
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3C7Z7JQQVNM3M5IOQ5P4P3ZKQS5TAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWG43TMNZYGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Okay I see. I'm fairly new to this so I'm still learning some stuff. I found this post which further supports the validity of delayed rewards... StackExchange. I have been training a model for some time now and improvement seems to either be very slow or altogether nonexistent. I have been thinking of some ideas to improve the learning, and the most reasonable idea so far is to seed a model with some successful trajectories to begin with. Namely, instead of allowing the model to generate random moves that create reinforcement, I would just like to feed a prescribed set of moves (say from a boss run that I have done, where I recorded my inputs and captured the environment every half-second) into the model as well as the corresponding rewards. This way the model starts already with some good examples. Do you know if there is a simple way to train the PPO model (or maybe some other model that can be later restarted as PPO) with predetermined training data? The main idea is that the search for an optimal strategy will be faster since the model will be trained on some attempts that were optimal, before becoming fully self sufficient and progressing with the current PPO training. Best, |
Hi Sam,
Yes, training in reinforcement learning is known to take a very long time.
That's why it's always parallelized in projects where that's possible, like
running 1000 instances of the game to train simultaneously. That's
obviously not possible with Elden Ring, though.
The training does work, though, and when training on simple bosses like the
Beastman of Farm Azula, Patches, or Mad Pumpkin Head, the agent did
eventually beat them in my test runs. Somewhere on the Discord, there are
training result screenshots, and you can look into TensorBoard logging to
visualize the rewards and episode lengths if you want the hard data.
Regarding the semi-supervised learning approach, there is another project
that was shared on the EldenBot Discord server some time ago. I forget the
name, but it pre-programmed optimal moves, collected and extracted meaning
from sound cues and visuals to determine what move the boss is performing,
and then had an RL agent decide from one of the optimal moves to respond
with. That seemed to work well for specific bosses but takes a lot of work
and will only ever work for bosses you specifically adapt the codebase to.
Our reinforcement learning approach is supposed to be more general, with
the agent being able to choose from basic inputs as actions and learn
strategies it can apply to multiple bosses and situations.
Best regards,
Marco
…On Thu, Jul 4, 2024, 06:34 Samuel Varner ***@***.***> wrote:
Okay I see. I'm fairly new to this so I'm still learning some stuff. I
found this post which further supports the validity of delayed rewards...
StackExchange
<https://ai.stackexchange.com/questions/12551/openai-gym-interface-when-reward-calculation-is-delayed-continuous-control-wit>
.
I have been training a model for some time now and improvement seems to
either be very slow or altogether nonexistent. I have been thinking of some
ideas to improve the learning, and the most reasonable idea so far is to
seed a model with some successful trajectories to begin with.
Namely, instead of allowing the model to generate random moves that create
reinforcement, I would just like to feed a prescribed set of moves (say
from a boss run that I have done, where I recorded my inputs and captured
the environment every half-second) into the model as well as the
corresponding rewards. This way the model starts already with some good
examples.
Do you know if there is a simple way to train the PPO model (or maybe some
other model that can be later restarted as PPO) with predetermined training
data? The main idea is that the search for an optimal strategy will be
faster since the model will be trained on some attempts that were optimal,
before becoming fully self sufficient and progressing with the current PPO
training.
Best,
Sam
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3C7Z7P7QGXT3DLELAMM763ZKTGGXAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGEYDCNZVGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Okay that makes senss. I did see that project you are referring to I believe. I don't really want to pre-program the optimal moves, rather I want the framework to remain exactly the same, but I just want to feed it successful runs (at least in the beginning) before switching to fully unsupervised. The current training data is generated by the agent taking random moves and then collecting observations and rewards. I can make the same type of training data from a successful run, and feed that as if the agent did it themselves. I am having trouble figuring out how to do this with the PPO model though. It doesn't seem to have an option for user supplied training data. Maybe there is a way to modify mode.learn() to take actions from file instead of generating them from the current policy. That is my next thing to try out. I have been trying to train on the first mini boss in the DLC (Blackgaol Knight) and the agent ends up just dying within the first 2-3 seconds most times, which is probably a big reason why the training is very stagnant. I am on the discord server but I don't see anything there, all of the channels are empty for me. Maybe I can't see the history because I just recently joined? Best, |
Hey,
Yes the agents character should be appropriately leveled to not get one
shot by the boss.
The discord server should work. https://discord.gg/TKaHrukq
Berst,
Marco
…On Thu, Jul 4, 2024 at 7:06 PM Samuel Varner ***@***.***> wrote:
Okay that makes senss. I did see that project you are referring to I
believe.
I don't really want to pre-program the optimal moves, rather I want the
framework to remain exactly the same, but I just want to feed it successful
runs (at least in the beginning) before switching to fully unsupervised.
The current training data is generated by the agent taking random moves and
then collecting observations and rewards. I can make the same type of
training data from a successful run, and feed that as if the agent did it
themselves. I am having trouble figuring out how to do this with the PPO
model though. It doesn't seem to have an option for user supplied training
data. Maybe there is a way to modify mode.learn() to take actions from file
instead of generating them from the current policy. That is my next thing
to try out.
I have been trying to train on the first mini boss in the DLC (Blackgaol
Knight) and the agent ends up just dying within the first 2-3 seconds most
times, which is probably a big reason why the training is very stagnant.
I am on the discord server but I don't see anything there, all of the
channels are empty for me. Maybe I can't see the history because I just
recently joined?
Best,
Sam
—
Reply to this email directly, view it on GitHub
<#2 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3C7Z7LPI2O5T46XK47ESSTZKV6KTAVCNFSM6AAAAABKI5HMQSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGM3DAMRZGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello,
I am running the code and trying to make some improvements. One thing that I came across that I am questioning is the order in which the code is performed in the env.step() function. I am curious why the observation and reward are obtained before the action is taken?
From my understanding, the agent makes a decision which is the 'action' argument that is passed to env.step(action). Then based on this action, it should expect to get back a reward. It seems right now that the action->reward->action->reward cycle is out of sync, since the reward is calculated before the current action is taken.
Is there a reason that it was done this way?
Thank you so much for starting development on this! I look forward to your response :)
Best,
Sam
The text was updated successfully, but these errors were encountered: