Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid lazy init of Blas handles, fix for non-canonical dots #93

Merged
merged 2 commits into from
Jan 23, 2025

Conversation

pemeliya
Copy link

@pemeliya pemeliya commented Jan 20, 2025

@i-chaochen
Copy link

i-chaochen commented Jan 20, 2025

For https://ontrack-internal.amd.com/browse/SWDEV-482895 we need to backport to 0.4.30

@pemeliya
Copy link
Author

Sounds like we have some serious RCCL problem on CI nodes. Several multiGPU tests seem to fail with (e.g. functional_hlo_runner_test):

19:21:31  E0000 00:00:1737397209.417337   83920 pjrt_stream_executor_client.cc:3085] Execution of replica 0 failed: INTERNAL: xla/service/gpu/runtime/nccl_api.cc:515: NCCL operation ncclGroupEnd() failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Failed to find reverse path from remNode 0/11000 nlinks 2 to node 0/ae000'.
19:21:31  E0000 00:00:1737397209.417412   83923 pjrt_stream_executor_client.cc:3085] Execution of replica 0 failed: INTERNAL: xla/service/gpu/runtime/nccl_api.cc:515: NCCL operation ncclGroupEnd() failed: internal error - please report this issue to the NCCL developers. Last NCCL warning(error) log entry (may be unrelated) 'Failed to find reverse path from remNode 0/11000 nlinks 2 to node 0/ae000'.

@i-chaochen
Copy link

I think @hsharsha mentioned that is because few nodes are broken....

@hsharsha hsharsha merged commit 0cdfcdb into rocm-jaxlib-v0.4.35-qa Jan 23, 2025
6 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants