Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dead lock in transfer actor in the case of GPU #488

Closed
wants to merge 10 commits into from

Conversation

UranusSeven
Copy link
Contributor

@UranusSeven UranusSeven commented Jun 1, 2023

What do these changes do?

The current implementation of the transfer function leads to a deadlock when executing Xorbits on multiple GPUs. The issue arises from the StorageHandlerActor.fetch_batch function, which invokes SenderManagerActor.send_batch_data and subsequently calls StorageHandlerActor.request_quota_with_spill. Due to the locking mechanism within the StorageHandlerActor method call, a deadlock arises.

UPDATE:

  • fix dead lock on GPU
  • fix GPU buffer size 0 issue
  • known issue: when ucx is enabled, performance is not improved on multi GPUs.
  • known issue: too many actor calls in this implement. May lead to performance down.

Related issue number

Check code requirements

  • tests added / passed (if needed)
  • Ensure all linting tests pass

@XprobeBot XprobeBot added this to the v0.3.2 milestone Jun 1, 2023
@codecov
Copy link

codecov bot commented Jun 1, 2023

Codecov Report

Attention: Patch coverage is 78.26087% with 20 lines in your changes missing coverage. Please review.

Project coverage is 67.93%. Comparing base (b4d70cc) to head (8fea60f).
Report is 62 commits behind head on main.

Files with missing lines Patch % Lines
python/xorbits/_mars/services/storage/handler.py 81.94% 10 Missing and 3 partials ⚠️
python/xorbits/_mars/services/storage/transfer.py 61.11% 5 Missing and 2 partials ⚠️

❗ There is a different number of reports uploaded between BASE (b4d70cc) and HEAD (8fea60f). Click for more details.

HEAD has 13 uploads less than BASE
Flag BASE (b4d70cc) HEAD (8fea60f)
unittests 24 11
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #488       +/-   ##
===========================================
- Coverage   93.52%   67.93%   -25.59%     
===========================================
  Files        1026     1026               
  Lines       79516    79532       +16     
  Branches    16475    16479        +4     
===========================================
- Hits        74365    54031    -20334     
- Misses       3468    23289    +19821     
- Partials     1683     2212      +529     
Flag Coverage Δ
unittests 67.85% <78.26%> (-25.56%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@UranusSeven UranusSeven marked this pull request as draft June 2, 2023 06:20
@XprobeBot XprobeBot modified the milestones: v0.3.2, v0.4.1 Jul 1, 2023
@XprobeBot XprobeBot modified the milestones: v0.4.1, v0.4.3 Jul 11, 2023
@XprobeBot XprobeBot modified the milestones: v0.4.3, v0.4.4, v0.4.5 Jul 20, 2023
@XprobeBot XprobeBot modified the milestones: v0.5.0, v0.5.1 Jul 28, 2023
@ChengjieLi28 ChengjieLi28 marked this pull request as ready for review August 7, 2023 06:43
@ChengjieLi28 ChengjieLi28 changed the title REF: refactor transfer to avoid deadlocks BUG: dead lock in transfer actor in the case of GPU Aug 7, 2023
@XprobeBot XprobeBot added bug Something isn't working and removed refactor labels Aug 7, 2023
@ChengjieLi28 ChengjieLi28 force-pushed the ref/transfer branch 2 times, most recently from 762cd55 to 8a58fe2 Compare August 7, 2023 10:56
@XprobeBot XprobeBot modified the milestones: v0.5.1, v0.5.2 Aug 14, 2023
@XprobeBot XprobeBot modified the milestones: v0.5.2, v0.6.0, v0.6.1 Sep 8, 2023
@XprobeBot XprobeBot modified the milestones: v0.6.1, v0.6.2 Sep 15, 2023
@XprobeBot XprobeBot modified the milestones: v0.6.2, v0.6.3 Sep 20, 2023
@XprobeBot XprobeBot modified the milestones: v0.6.3, v0.7.0 Sep 25, 2023
@XprobeBot XprobeBot modified the milestones: v0.7.0, v0.7.1 Oct 23, 2023
@XprobeBot XprobeBot modified the milestones: v0.7.1, v0.7.2 Nov 21, 2023
@XprobeBot XprobeBot modified the milestones: v0.7.2, v0.7.3 Jan 5, 2024
@XprobeBot XprobeBot modified the milestones: v0.7.3, v0.7.4 Aug 22, 2024
@luweizheng
Copy link
Collaborator

Fix by #788

@luweizheng luweizheng closed this Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants