Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU start/stop being logged as INT64_MAX #1838

Open
elliottslaughter opened this issue Feb 28, 2025 · 9 comments
Open

GPU start/stop being logged as INT64_MAX #1838

elliottslaughter opened this issue Feb 28, 2025 · 9 comments

Comments

@elliottslaughter
Copy link
Contributor

I have a profile where GPU task start/stop times are being logged as 0x8000000000000000, i.e., INT64_MAX.

What is this supposed to mean exactly? What is the profiler supposed to do with this?

[src/state.rs:4746:17] &record = GPUTaskInfo {
    op_id: OpID(
        460,
    ),
    task_id: TaskID(
        454,
    ),
    variant_id: VariantID(
        1192,
    ),
    proc_id: ProcID(
        2089670227099910148,
    ),
    create: Timestamp(
        553760398,
    ),
    ready: Timestamp(
        871627394,
    ),
    start: Timestamp(
        871771612,
    ),
    stop: Timestamp(
        871937060,
    ),
    gpu_start: Timestamp(
        9223372036854775808,
    ),
    gpu_stop: Timestamp(
        9223372036854775808,
    ),
    creator: Some(
        EventID(
            9223372105692741656,
        ),
    ),
    critical: Some(
        EventID(
            9223372105786064901,
        ),
    ),
    fevent: EventID(
        9223372105771384836,
    ),
}

Log file: 0.log

@lightsighter
Copy link
Contributor

Do you know if this particular "GPU task" actually ran any kernels on the GPU? Legion is just passing through whatever Realm gives it for the GPU timing profiling response:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/legion/legion_profiling.cc?ref_type=heads#L860-861

I suspect that what is happening is that this GPU task doesn't actually launch any kernels and therefore does not incur any GPU work start/stop times, so having INT_MAX results for both timestamps means that there was no GPU work for this task. @elliottslaughter can you confirm if that is the case or not?

@elliottslaughter
Copy link
Contributor Author

@mariodirenzo will have to confirm, I don't actually know anything about the specific application.

Are Legion timestamps signed? I was under the impression they were unsigned, because we can never produce a value < 0. For the record a UINT64_MAX value would have triggered an assertion much earlier because the current parser expects GPU timestamps to be non-optional.

@lightsighter
Copy link
Contributor

Are Legion timestamps signed?

Both Realm and Legion timestamps are signed. I don't know why they are that way. Legion is just consistent with Realm (I need to make this relationship more explicit).

@lightsighter
Copy link
Contributor

@mariodirenzo will have to confirm, I don't actually know anything about the specific application.

@elliottslaughter Can you fish the name of the task out of the profiler so @mariodirenzo knows which task to look at?

@elliottslaughter
Copy link
Contributor Author

Ok, here are the task and variant:

[src/state.rs:4747:17] state.task_kinds.get(task_id) = Some(
    TaskKind {
        task_id: TaskID(
            454,
        ),
        name: Some(
            "UpdatePrimitivePropertiesFromConserved",
        ),
    },
)
[src/state.rs:4748:17] state.variants.get(&(*task_id, *variant_id)) = Some(
    Variant {
        variant_id: VariantID(
            1192,
        ),
        message: false,
        _ordered_vc: false,
        name: "UpdatePrimitivePropertiesFromConserved",
        task_id: Some(
            TaskID(
                454,
            ),
        ),
        color: None,
    },
)

@mariodirenzo
Copy link

That task launches multiple kernels. Some of these kernels are in separate streams

@lightsighter
Copy link
Contributor

lightsighter commented Mar 5, 2025

Some of the kernels or all of the kernels? Also were you running with -cuda:cupti 0 by any chance?

@mariodirenzo
Copy link

Some of the kernels or all of the kernels?

Only a few. There is a kernel running on the task stream. An event is recorded after the first kernel. When it is triggered, the task executes a few kernels on parallel streams. When these kernels are done, CUDA events guarantee that the default task kernel waits for the completion of all the task work.

Also were you running with -cuda:cupti 0 by any chance?

No, I was not

@lightsighter
Copy link
Contributor

@muraj Can you think of a reason that Realm would effectively return "invalid timestamp" (INT64_MAX) values for the GPU work start/stop times even when the task has CUDA activity inside of it? The other strange thing here is that the values being returned are INT64_MAX and not INVALID_TIMESTAMP (LLONG_MIN).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants