Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

Merged
merged 1 commit into from
Jul 22, 2022

Conversation

gjoseph92
Copy link
Collaborator

Reverts #6681

Closes #6749

cc @pentschev. I know you've found a workaround for dask-cuda in rapidsai/dask-cuda#955, but I'm kind of inclined to revert this anyway and then spend a little time implementing a better solution. Given the unexpected impact this had on dask-cuda, I'm worried it might break things for others too. We always knew the current implementation was a bit hacky anyway.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 21, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 16m 53s ⏱️ - 6m 4s
  2 976 tests  - 1    2 884 ✔️  - 3       88 💤 ±0  3 +1  1 🔥 +1 
22 064 runs   - 8  21 029 ✔️  - 6  1 029 💤  - 6  5 +3  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 277a397. ± Comparison against base commit add3663.

♻️ This comment has been updated with latest results.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thanks @gjoseph92 !

@crusaderky
Copy link
Collaborator

I'm against this revert. The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

@pentschev
Copy link
Member

The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

Can you provide evidence to sustain this claim? For example, I've never heard of anyone complaining about the deallocation issues that MALLOC_TRIM_THRESHOLD_ is trying to work around for Dask-CUDA, even though all Dask-CUDA users rely on environment variables that are different among workers. Plus, feel free to correct me if I'm wrong, but if the user simply specifies MALLOC_TRIM_THRESHOLD_=65535 python (or whatever process other than python), that would have the same effect without any other side effects, am I correct?

Particularly, I dislike the claim that it's fine to keep things broken that we know are broken because it now provides an automatic benefit for some users who could achieve the same behavior by one small change to their workflows (provided my understanding above is indeed correct).

@crusaderky
Copy link
Collaborator

crusaderky commented Jul 22, 2022

For example, I've never heard of anyone complaining about the deallocation issues

This is because, frequently, this is an issue of partial deallocations; in other words it will "solve itself" once the worker has spilled everything to disk - which will cost a 2 order-of-magnitude slowdown.

Also there's the fundamental problem of visibility. In case of OOM (pass the terminate threshold), the user will receive a nebulous KilledWorker, with no immediate indication that the culprit was memory usage (this issue is already discussed elsewhere). Finally, I suspect the that only a fraction of the users suffering from memory problems will spot that there are vast amounts of unmanaged memory, and only an even smaller fraction will be able to pin the cause to trimming. A lot more people will stop at "dask is very memory hungry".

feel free to correct me if I'm wrong, but if the user simply specifies MALLOC_TRIM_THRESHOLD_=65535 python (or whatever process other than python), that would have the same effect without any other side effects, am I correct?

Correct. This requires the user to investigate the problem and find the right page on the documentation. Yes we're linking it from the OOM warnings in the log, but that still requires fishing the right lines from the logs, which is a simple activity for you and me but not for someone who just wants to use dask as a tool and has a rightful expectation for it to just work.

TL;DR; most users are not power users.

@pentschev
Copy link
Member

Correct. This requires the user to investigate the problem and find the right page on the documentation. Yes we're linking it from the OOM warnings in the log, but that still requires fishing the right lines from the logs, which is a simple activity for you and me but not for someone who just wants to use dask as a tool and has a rightful expectation for it to just work.

TL;DR; most users are not power users.

I agree that this can be painful for users, and to be clear we're not saying we shouldn't ever do it. Our only ask is that we do so in a manner that will not be problematic for existing use cases, so reverting now before the release would buy us time to work on a solution that would work for everyone. IOW, I still hold my opinion that we shouldn't be ok with changes that break known existing use cases in favor usability changes, we should combine both, and given this is a new feature I would argue in favor of keeping what used to work still working until we come up with a solution to add the new feature that won't break previously functional use cases.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gjoseph92 could you fix the linting issue here?

@jakirkham
Copy link
Member

The lint appears to be unrelated to the files touched

distributed/shuffle/shuffle_extension.py:534: error: Unused "type: ignore" comment

@jakirkham jakirkham mentioned this pull request Jul 22, 2022
2 tasks
@jakirkham
Copy link
Member

Submitted PR ( #6779 ) to fix it. Should simplify reverting the revert later 🙂

@jakirkham
Copy link
Member

jakirkham commented Jul 22, 2022

Re-running the lint now that PR ( #6779 ) is merged.

Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs).

@jakirkham jakirkham requested a review from jrbourbeau July 22, 2022 19:32
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the conversation here an in the original issue (#6749) I'm going to merge this PR in order to get dask-cuda working again. This is just a temporary revert and, as @gjoseph92 mentioned (#6749 (comment)), we should restore a version of #6681 that addresses the concerns that have been brought up here. Totally acknowledge that there's not an ideal solution here and reverting is frustrating, but there seems to be broader consensus around merging this PR and then quickly getting #6681 restored.

Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs).

Yeah, agreed we don't need to worry about the linting build here given the changes in #6779

@jrbourbeau
Copy link
Member

Opened up #6780 so we don't loose track of getting #6681 back into main

@jacobtomlinson jacobtomlinson deleted the revert-6681-nanny_env_variables branch July 25, 2022 10:14
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WorkerProcess leaks environment variables to parent process
5 participants