GPU start/stop being logged as INT64_MAX #1838

elliottslaughter · 2025-02-28T01:28:04Z

I have a profile where GPU task start/stop times are being logged as 0x8000000000000000, i.e., INT64_MAX.

What is this supposed to mean exactly? What is the profiler supposed to do with this?

[src/state.rs:4746:17] &record = GPUTaskInfo {
    op_id: OpID(
        460,
    ),
    task_id: TaskID(
        454,
    ),
    variant_id: VariantID(
        1192,
    ),
    proc_id: ProcID(
        2089670227099910148,
    ),
    create: Timestamp(
        553760398,
    ),
    ready: Timestamp(
        871627394,
    ),
    start: Timestamp(
        871771612,
    ),
    stop: Timestamp(
        871937060,
    ),
    gpu_start: Timestamp(
        9223372036854775808,
    ),
    gpu_stop: Timestamp(
        9223372036854775808,
    ),
    creator: Some(
        EventID(
            9223372105692741656,
        ),
    ),
    critical: Some(
        EventID(
            9223372105786064901,
        ),
    ),
    fevent: EventID(
        9223372105771384836,
    ),
}

Log file: 0.log

The text was updated successfully, but these errors were encountered:

lightsighter · 2025-03-02T20:20:20Z

Do you know if this particular "GPU task" actually ran any kernels on the GPU? Legion is just passing through whatever Realm gives it for the GPU timing profiling response:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.cc?ref_type=heads#L860-861

I suspect that what is happening is that this GPU task doesn't actually launch any kernels and therefore does not incur any GPU work start/stop times, so having INT_MAX results for both timestamps means that there was no GPU work for this task. @elliottslaughter can you confirm if that is the case or not?

elliottslaughter · 2025-03-02T23:35:10Z

@mariodirenzo will have to confirm, I don't actually know anything about the specific application.

Are Legion timestamps signed? I was under the impression they were unsigned, because we can never produce a value < 0. For the record a UINT64_MAX value would have triggered an assertion much earlier because the current parser expects GPU timestamps to be non-optional.

lightsighter · 2025-03-03T06:24:20Z

Are Legion timestamps signed?

Both Realm and Legion timestamps are signed. I don't know why they are that way. Legion is just consistent with Realm (I need to make this relationship more explicit).

lightsighter · 2025-03-03T06:27:58Z

@mariodirenzo will have to confirm, I don't actually know anything about the specific application.

@elliottslaughter Can you fish the name of the task out of the profiler so @mariodirenzo knows which task to look at?

elliottslaughter · 2025-03-03T19:06:41Z

Ok, here are the task and variant:

[src/state.rs:4747:17] state.task_kinds.get(task_id) = Some(
    TaskKind {
        task_id: TaskID(
            454,
        ),
        name: Some(
            "UpdatePrimitivePropertiesFromConserved",
        ),
    },
)
[src/state.rs:4748:17] state.variants.get(&(*task_id, *variant_id)) = Some(
    Variant {
        variant_id: VariantID(
            1192,
        ),
        message: false,
        _ordered_vc: false,
        name: "UpdatePrimitivePropertiesFromConserved",
        task_id: Some(
            TaskID(
                454,
            ),
        ),
        color: None,
    },
)

mariodirenzo · 2025-03-04T22:06:49Z

That task launches multiple kernels. Some of these kernels are in separate streams

lightsighter · 2025-03-05T08:29:51Z

Some of the kernels or all of the kernels? Also were you running with -cuda:cupti 0 by any chance?

mariodirenzo · 2025-03-05T11:26:43Z

Some of the kernels or all of the kernels?

Only a few. There is a kernel running on the task stream. An event is recorded after the first kernel. When it is triggered, the task executes a few kernels on parallel streams. When these kernels are done, CUDA events guarantee that the default task kernel waits for the completion of all the task work.

Also were you running with -cuda:cupti 0 by any chance?

No, I was not

lightsighter · 2025-03-05T16:16:41Z

@muraj Can you think of a reason that Realm would effectively return "invalid timestamp" (INT64_MAX) values for the GPU work start/stop times even when the task has CUDA activity inside of it? The other strange thing here is that the values being returned are INT64_MAX and not INVALID_TIMESTAMP (LLONG_MIN).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU start/stop being logged as INT64_MAX #1838

GPU start/stop being logged as INT64_MAX #1838

elliottslaughter commented Feb 28, 2025

lightsighter commented Mar 2, 2025

elliottslaughter commented Mar 2, 2025

lightsighter commented Mar 3, 2025

lightsighter commented Mar 3, 2025

elliottslaughter commented Mar 3, 2025

mariodirenzo commented Mar 4, 2025

lightsighter commented Mar 5, 2025 •

edited

Loading

mariodirenzo commented Mar 5, 2025

lightsighter commented Mar 5, 2025

GPU start/stop being logged as INT64_MAX #1838

GPU start/stop being logged as INT64_MAX #1838

Comments

elliottslaughter commented Feb 28, 2025

lightsighter commented Mar 2, 2025

elliottslaughter commented Mar 2, 2025

lightsighter commented Mar 3, 2025

lightsighter commented Mar 3, 2025

elliottslaughter commented Mar 3, 2025

mariodirenzo commented Mar 4, 2025

lightsighter commented Mar 5, 2025 • edited Loading

mariodirenzo commented Mar 5, 2025

lightsighter commented Mar 5, 2025

lightsighter commented Mar 5, 2025 •

edited

Loading