Remove hard limit on events #1233

sethrj · 2024-05-13T18:40:46Z

Problems:

Event "ids" are effectively 64-bit unique identifiers used for global reproducibility. They are not consecutive from zero as we assumed when we first made them.
When using multiple streams, events are usually partitioned among streams as opposed to being consecutive, so each stream will have num_events * (num_threads - 1) / num_threads zeros for its track counter.
Using atomic counters for each event to determine track IDs results in nonreproducible track IDs.
Having an event ID higher than max_event raises an assertion.

We could simplify reproducibility by requiring all tracks in flight to be from the same event (i.e., the usual way we integrate into Geant4). We could store the single event ID on the "core state" object since we don't need access to it on GPU. (Or, as of #1447, we just use the "unique event ID" to reset the RNG reproducibly.)

If we want to run multiple events simultaneously, the "event IDs" should be more like event slots so we can have up to N events in flight simultaneously, and once that event finishes we send an end-of-event "action" and let the event slot manage a new unique event. This methodology could let us have a single CPU thread + GPU simultaneously handle multiple Geant4 workers.

The text was updated successfully, but these errors were encountered:

sethrj · 2024-07-09T01:46:33Z

From @amandalund :

with our standalone app I’ve typically seen better performance transporting multiple events at a time (e.g. about ~2x faster for the cms2018+field+msc-vecgeom-gpu regression problem with merge_events=true ). Our regression suite merges events for celer-sim on the gpu, right? I don’t think I’ve run those problems in single-event mode in a while, but it would be good to compare. I’d still probably prefer to keep that capability.

My follow up:

fair enough! I think it'd also be good to run with a few more events (to reduce the load imbalance from each core having only a single event) and make sure that the merge_events=false case has optimal track slots

amandalund · 2024-09-15T21:13:37Z

Here's where the speedup running one event at a time stands with all the latest performance improvements (but not partiitioning by charge when running with merge_events off, since it doesn't help there):

Still a bit slower when not combining events, but I haven't tried increasing the total number of events or optimizing the number of track slots (which may also be problem dependent).

EDIT: Here's the same plot but with 4x the number of track slots when merge_events is off:

and the same as above but partitioning by charge with merge_events off as well:

sethrj · 2024-09-16T12:23:34Z

@amandalund Why doesn't partitioning by charge help? 😕 (I don't think it helps GPU+G4 since we don't yet pass that option through right?) And is this after #1405? And so this is just the regression problems with and without "merge event"?

amandalund · 2024-09-16T12:47:44Z

Right, toggling merge_events and after #1405. And yeah, we don't pass the option to celer-g4 so shouldn't expect any difference there. Here's the speedup partitioning charge with merge_events off (this is with 16 CPU cores and one A100):

It helps a bit in some cases, but more often seems to hurt... I'm not really sure why that is yet.

EDIT: Same plot as above but with 4x the number of track slots:

amandalund · 2024-09-16T18:37:39Z

Ok @sethrj good news: I played around with different numbers of tracks slots, and increasing this number does improve performance. I added plots above using 4x as many track slots.

With more slots the partitioning does help, and continues to help relatively more the larger the state size gets (though the overall performance gets worse past a certain point), so could be that with a larger state we get less mixing for longer.

sethrj · 2024-09-16T19:17:01Z

@amandalund To maintain the memory limits (especially with CMS and on V100) I have the runner script reduce the number of track slots when merge_events is off on GPU. Can you verify from the output what are the reported track slot count and number of primaries per thread?

EDIT: and on an unrelated note, do you mind setting transparent=False for the PNGs that you upload to github? They render like this with a dark background 😅

amandalund · 2024-09-16T19:36:16Z

2^18 track slots and 1300 primaries per thread (I know for celer-sim we do that division in the Runner class).

sethrj · 2024-09-16T19:47:41Z

Oh ok that's great, I was afraid that was with the 2^20 number .

sethrj · 2025-01-23T14:49:35Z

@amandalund I think we can make a "multiple simultaneous event ring buffer" work by trading off storage space for kernel launches. Suppose our event IDs are just the index into 'simultaneous events'. Then we could put a loop around the primary/secondary construction kernels which are super-fast:

Filter all the following launches by the event ID to be added
Find the number of initializers being produced by each thread, but then accumulate them: the final value is the number of initializers produced by that event in the current step.
Set the new track ID using the "max track ID counter" (which is passed as a constant kernel input, not updated atomically) plus the accumulated initializer counter plus the within-kernel loop over number secondaries per parent track
Update the max track ID counter using the previously accumulated maximum (this could be done using an async copy during the previous kernel).

Using a loop instead of increasing resources makes the book-keeping trivial and reduces the resource requirements for initialization. We can also "prioritize" which events are completed first by reordering the event IDs before the loop so that the higher-priority initializers are created last and put onto the top of the initializer stack.

sethrj added the physics Particles, processes, and stepping algorithms label May 13, 2024

sethrj changed the title ~~Make track IDs reproducible across runs~~ Require one event at a time to make track IDs reproducible across runs Jun 27, 2024

sethrj changed the title ~~Require one event at a time to make track IDs reproducible across runs~~ Require one event at a time Jun 27, 2024

sethrj mentioned this issue Sep 4, 2024

Refactor primary construction into aux data + generator #1390

Open

amandalund mentioned this issue Sep 28, 2024

Add track_order option to celer-g4 and default to partitioning by charge on GPU #1433

Merged

sethrj mentioned this issue Oct 14, 2024

Define a "unique event ID" different from event counter #1447

Merged

sethrj changed the title ~~Require one event at a time~~ Remove hard limit on events Jan 22, 2025

sethrj mentioned this issue Jan 22, 2025

Add global Celeritas input definition #1562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove hard limit on events #1233

Remove hard limit on events #1233

sethrj commented May 13, 2024 •

edited

Loading

sethrj commented Jul 9, 2024

amandalund commented Sep 15, 2024 •

edited

Loading

sethrj commented Sep 16, 2024

amandalund commented Sep 16, 2024 •

edited

Loading

amandalund commented Sep 16, 2024

sethrj commented Sep 16, 2024 •

edited

Loading

amandalund commented Sep 16, 2024

sethrj commented Sep 16, 2024

sethrj commented Jan 23, 2025 •

edited

Loading

Remove hard limit on events #1233

Remove hard limit on events #1233

Comments

sethrj commented May 13, 2024 • edited Loading

sethrj commented Jul 9, 2024

amandalund commented Sep 15, 2024 • edited Loading

sethrj commented Sep 16, 2024

amandalund commented Sep 16, 2024 • edited Loading

amandalund commented Sep 16, 2024

sethrj commented Sep 16, 2024 • edited Loading

amandalund commented Sep 16, 2024

sethrj commented Sep 16, 2024

sethrj commented Jan 23, 2025 • edited Loading

sethrj commented May 13, 2024 •

edited

Loading

amandalund commented Sep 15, 2024 •

edited

Loading

amandalund commented Sep 16, 2024 •

edited

Loading

sethrj commented Sep 16, 2024 •

edited

Loading

sethrj commented Jan 23, 2025 •

edited

Loading