Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pilot status for pilots in PollTime related sleep cycle #7636

Open
MarcusEbert opened this issue Jun 4, 2024 · 5 comments
Open

Pilot status for pilots in PollTime related sleep cycle #7636

MarcusEbert opened this issue Jun 4, 2024 · 5 comments
Assignees
Labels

Comments

@MarcusEbert
Copy link

PILOT_STATES = [SUBMITTED, WAITING, RUNNING, DONE, FAILED, DELETED, ABORTED, UNKNOWN, "Scheduled"]

For the pilot status options defined above, it seems there is no status to indicate a pilot that is running in the batch queue but did not get any payload in the last cycle.
Should this status be made available to the system for monitoring as well as for the sitedirector's decision about submitting new jobs? (If total number of pilots in such sleep mode is larger than the number of available payload, then no new pilot needs to get submitted)

@fstagni
Copy link
Contributor

fstagni commented Jun 4, 2024

Indeed there's no such status. If we are considering making it a status, that would be something like "FINISHED_EMPTY", which would mean "Done with no matched payloads", because of course not matching jobs is not an error! What you are suggesting makes sense, for example, for resources for which pilots are submitted but for which there are no matching payloads, maybe because of non-fully-supported CPU types (I am thinking about ARM).

Introducing such status is not difficult, using it for decisions in SiteDirector requires a bit of accurate work. To be fair, I would have this in DiracX, because we are reluctant to implement new functionalities before that. Unless you want to give it a try yourself...

@MarcusEbert
Copy link
Author

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

The main issue I see is the following we see on our site in production:

  • a pilot can run multiple payloads
  • user jobs are very short (let's say ~1h or less)
  • there are often times without available payloads for a site
  • even if there are 1000 payloads available to run on a site, the pilots submitted (one for each payload) will not start all at the same time

Before all pilots submitted to the site are running, the first that started already finished all available payload. Once there is no more payload available, all pilots are idle and try to poll for new payload periodically depending on "PollingTime" and the hardcoded increase of sleep time once a poll did not succeed in a payload (which would be good to have that also enabled/disabled via a configuration option).
That means that in the above scenario a large amount of pilots can do nothing for a long time, basically wasting CPU resources that could be used otherwise, until new payload for a specific experiment arrives.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload. That results in more submitted pilots and again in a larger number of pilots that get no payload and go into sleep since the payload will already be processed when current sleeping pilots poll the next time.

What I suggest is that the site director submits new pilots based on
(available payload for a site - idle pilots in the batch system - running pilots without a payload)

To do so, the status of such running pilots without payload needs to be known.
Alternatively, one could get the number of running pilots without payload via
(running pilots at a site - running payload at a site)
which may not need a new status)

@fstagni
Copy link
Contributor

fstagni commented Jun 5, 2024

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

OK, so I slightly misunderstood your first message: you are not talking about pilots that did not match any job, but pilots for which the last n cycles of the JobAgent did not match jobs.

... try to poll for new payload periodically depending on "PollingTime" and the hardcoded increase of sleep time once a poll did not succeed in a payload (which would be good to have that also enabled/disabled via a configuration option).

This can be easily done.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload.

The SiteDirector consumes info from the Computing Element. What you are suggesting to have is taking into consideration also:

  1. the number of jobs pilots can potentially match
  2. knowing that there are sleeping pilots
  1. is almost impossible to assess. 2) is potentially possible.

What I suggest is that the site director submits new pilots based on (available payload for a site - idle pilots in the batch system - running pilots without a payload)

This is possible (but won't be much precise anyway)

To do so, the status of such running pilots without payload needs to be known. Alternatively, one could get the number of running pilots without payload via (running pilots at a site - running payload at a site) which may not need a new status)

Instead of having a status (that at pilot would be something like "RUNNING_IDLE" , or "SLEEPING", we can also increment or decrement a (central) counter.

@MarcusEbert
Copy link
Author

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

OK, so I slightly misunderstood your first message: you are not talking about pilots that did not match any job, but pilots for which the last n cycles of the JobAgent did not match jobs.

That's correct.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload.

The SiteDirector consumes info from the Computing Element. What you are suggesting to have is taking into consideration also:

  1. the number of jobs pilots can potentially match

  2. knowing that there are sleeping pilots

  3. is almost impossible to assess. 2) is potentially possible.

  1. may be possible if the system knows what a specific pilot could match and if the status of each pilot is known to be "Running_Idle"/"Sleeping".

Having 2. and assuming any job for that site can be matched to any of the sleeping pilots would help in a first step. Adding a config option for a site, e.g. "account for sleeping pilots = yes/no" could disable this feature in case a pilot can only match a specific payload and it is not known which payloads a pilot could potentially match.

What I suggest is that the site director submits new pilots based on (available payload for a site - idle pilots in the batch system - running pilots without a payload)

This is possible (but won't be much precise anyway)

Why would that no be more precise than how it is right now?

To do so, the status of such running pilots without payload needs to be known. Alternatively, one could get the number of running pilots without payload via (running pilots at a site - running payload at a site) which may not need a new status)

Instead of having a status (that at pilot would be something like "RUNNING_IDLE" , or "SLEEPING", we can also increment or decrement a (central) counter.

What do you suggest would be counted then?

@fstagni fstagni self-assigned this Sep 12, 2024
@fstagni fstagni added the WMS label Sep 12, 2024
@atsareg
Copy link
Contributor

atsareg commented Dec 13, 2024

I would like to understand if there is indeed an issue here. I mean if we have idle pilots waiting for payloads, then it means that there are no actually payloads in the TaskQueue. But in this case new pilots would not be submitted because there is a check before submission (see SiteDirector._getPilotsWeMayWantToSubmit() method). May be you refer to cases when new payloads are added in the previously empty TaskQueue and there is already a number of sleeping pilots out there. In this case, the new "Sleeping" status can be useful and those pilots can be just added to the number of Waiting pilots at the site when evaluating the number of pilots to be submitted. This can reduce the number of pilots sent unnecessarily indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants