Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove hard limit on events #1233

Open
sethrj opened this issue May 13, 2024 · 9 comments
Open

Remove hard limit on events #1233

sethrj opened this issue May 13, 2024 · 9 comments
Labels
physics Particles, processes, and stepping algorithms

Comments

@sethrj
Copy link
Member

sethrj commented May 13, 2024

Problems:

  • Event "ids" are effectively 64-bit unique identifiers used for global reproducibility. They are not consecutive from zero as we assumed when we first made them.
  • When using multiple streams, events are usually partitioned among streams as opposed to being consecutive, so each stream will have num_events * (num_threads - 1) / num_threads zeros for its track counter.
  • Using atomic counters for each event to determine track IDs results in nonreproducible track IDs.
  • Having an event ID higher than max_event raises an assertion.

We could simplify reproducibility by requiring all tracks in flight to be from the same event (i.e., the usual way we integrate into Geant4). We could store the single event ID on the "core state" object since we don't need access to it on GPU. (Or, as of #1447, we just use the "unique event ID" to reset the RNG reproducibly.)

If we want to run multiple events simultaneously, the "event IDs" should be more like event slots so we can have up to N events in flight simultaneously, and once that event finishes we send an end-of-event "action" and let the event slot manage a new unique event. This methodology could let us have a single CPU thread + GPU simultaneously handle multiple Geant4 workers.

@sethrj sethrj added the physics Particles, processes, and stepping algorithms label May 13, 2024
@sethrj sethrj changed the title Make track IDs reproducible across runs Require one event at a time to make track IDs reproducible across runs Jun 27, 2024
@sethrj sethrj changed the title Require one event at a time to make track IDs reproducible across runs Require one event at a time Jun 27, 2024
@sethrj
Copy link
Member Author

sethrj commented Jul 9, 2024

From @amandalund :

with our standalone app I’ve typically seen better performance transporting multiple events at a time (e.g. about ~2x faster for the cms2018+field+msc-vecgeom-gpu regression problem with merge_events=true ). Our regression suite merges events for celer-sim on the gpu, right? I don’t think I’ve run those problems in single-event mode in a while, but it would be good to compare. I’d still probably prefer to keep that capability.

My follow up:

fair enough! I think it'd also be good to run with a few more events (to reduce the load imbalance from each core having only a single event) and make sure that the merge_events=false case has optimal track slots

@amandalund
Copy link
Contributor

amandalund commented Sep 15, 2024

Here's where the speedup running one event at a time stands with all the latest performance improvements (but not partiitioning by charge when running with merge_events off, since it doesn't help there):
rel-throughput-merge-events

Still a bit slower when not combining events, but I haven't tried increasing the total number of events or optimizing the number of track slots (which may also be problem dependent).

EDIT: Here's the same plot but with 4x the number of track slots when merge_events is off:
rel-throughput-merge-events-4xts

and the same as above but partitioning by charge with merge_events off as well:
rel-throughput-merge-events-partition-4xts

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

@amandalund Why doesn't partitioning by charge help? 😕 (I don't think it helps GPU+G4 since we don't yet pass that option through right?) And is this after #1405? And so this is just the regression problems with and without "merge event"?

@amandalund
Copy link
Contributor

amandalund commented Sep 16, 2024

Right, toggling merge_events and after #1405. And yeah, we don't pass the option to celer-g4 so shouldn't expect any difference there. Here's the speedup partitioning charge with merge_events off (this is with 16 CPU cores and one A100):
rel-throughput-merge-events-partition

It helps a bit in some cases, but more often seems to hurt... I'm not really sure why that is yet.

EDIT: Same plot as above but with 4x the number of track slots:
rel-throughput-nomerge-partition-4xts

@amandalund
Copy link
Contributor

Ok @sethrj good news: I played around with different numbers of tracks slots, and increasing this number does improve performance. I added plots above using 4x as many track slots.

With more slots the partitioning does help, and continues to help relatively more the larger the state size gets (though the overall performance gets worse past a certain point), so could be that with a larger state we get less mixing for longer.

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

@amandalund To maintain the memory limits (especially with CMS and on V100) I have the runner script reduce the number of track slots when merge_events is off on GPU. Can you verify from the output what are the reported track slot count and number of primaries per thread?

EDIT: and on an unrelated note, do you mind setting transparent=False for the PNGs that you upload to github? They render like this with a dark background 😅
Screenshot 2024-09-16 at 15 17 46

@amandalund
Copy link
Contributor

2^18 track slots and 1300 primaries per thread (I know for celer-sim we do that division in the Runner class).

@sethrj
Copy link
Member Author

sethrj commented Sep 16, 2024

Oh ok that's great, I was afraid that was with the 2^20 number .

@sethrj
Copy link
Member Author

sethrj commented Jan 23, 2025

@amandalund I think we can make a "multiple simultaneous event ring buffer" work by trading off storage space for kernel launches. Suppose our event IDs are just the index into 'simultaneous events'. Then we could put a loop around the primary/secondary construction kernels which are super-fast:

  • Filter all the following launches by the event ID to be added
  • Find the number of initializers being produced by each thread, but then accumulate them: the final value is the number of initializers produced by that event in the current step.
  • Set the new track ID using the "max track ID counter" (which is passed as a constant kernel input, not updated atomically) plus the accumulated initializer counter plus the within-kernel loop over number secondaries per parent track
  • Update the max track ID counter using the previously accumulated maximum (this could be done using an async copy during the previous kernel).

Using a loop instead of increasing resources makes the book-keeping trivial and reduces the resource requirements for initialization. We can also "prioritize" which events are completed first by reordering the event IDs before the loop so that the higher-priority initializers are created last and put onto the top of the initializer stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physics Particles, processes, and stepping algorithms
Projects
None yet
Development

No branches or pull requests

2 participants