-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: dead lock in transfer actor in the case of GPU #488
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #488 +/- ##
===========================================
- Coverage 93.52% 67.93% -25.59%
===========================================
Files 1026 1026
Lines 79516 79532 +16
Branches 16475 16479 +4
===========================================
- Hits 74365 54031 -20334
- Misses 3468 23289 +19821
- Partials 1683 2212 +529
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
3678879
to
58c2aba
Compare
762cd55
to
8a58fe2
Compare
112bb40
to
8fea60f
Compare
Fix by #788 |
What do these changes do?
The current implementation of the transfer function leads to a deadlock when executing Xorbits on multiple GPUs. The issue arises from the
StorageHandlerActor.fetch_batch
function, which invokesSenderManagerActor.send_batch_data
and subsequently callsStorageHandlerActor.request_quota_with_spill
. Due to the locking mechanism within the StorageHandlerActor method call, a deadlock arises.UPDATE:
ucx
is enabled, performance is not improved on multi GPUs.Related issue number
Check code requirements