-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777
Conversation
This reverts commit 2fcd520.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 16m 53s ⏱️ - 6m 4s For more details on these failures and errors, see this check. Results for commit 277a397. ± Comparison against base commit add3663. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, thanks @gjoseph92 !
I'm against this revert. The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers. |
Can you provide evidence to sustain this claim? For example, I've never heard of anyone complaining about the deallocation issues that Particularly, I dislike the claim that it's fine to keep things broken that we know are broken because it now provides an automatic benefit for some users who could achieve the same behavior by one small change to their workflows (provided my understanding above is indeed correct). |
This is because, frequently, this is an issue of partial deallocations; in other words it will "solve itself" once the worker has spilled everything to disk - which will cost a 2 order-of-magnitude slowdown. Also there's the fundamental problem of visibility. In case of OOM (pass the terminate threshold), the user will receive a nebulous KilledWorker, with no immediate indication that the culprit was memory usage (this issue is already discussed elsewhere). Finally, I suspect the that only a fraction of the users suffering from memory problems will spot that there are vast amounts of unmanaged memory, and only an even smaller fraction will be able to pin the cause to trimming. A lot more people will stop at "dask is very memory hungry".
Correct. This requires the user to investigate the problem and find the right page on the documentation. Yes we're linking it from the OOM warnings in the log, but that still requires fishing the right lines from the logs, which is a simple activity for you and me but not for someone who just wants to use dask as a tool and has a rightful expectation for it to just work. TL;DR; most users are not power users. |
I agree that this can be painful for users, and to be clear we're not saying we shouldn't ever do it. Our only ask is that we do so in a manner that will not be problematic for existing use cases, so reverting now before the release would buy us time to work on a solution that would work for everyone. IOW, I still hold my opinion that we shouldn't be ok with changes that break known existing use cases in favor usability changes, we should combine both, and given this is a new feature I would argue in favor of keeping what used to work still working until we come up with a solution to add the new feature that won't break previously functional use cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjoseph92 could you fix the linting issue here?
The lint appears to be unrelated to the files touched
|
Submitted PR ( #6779 ) to fix it. Should simplify reverting the revert later 🙂 |
Re-running the lint now that PR ( #6779 ) is merged. Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the conversation here an in the original issue (#6749) I'm going to merge this PR in order to get dask-cuda
working again. This is just a temporary revert and, as @gjoseph92 mentioned (#6749 (comment)), we should restore a version of #6681 that addresses the concerns that have been brought up here. Totally acknowledge that there's not an ideal solution here and reverting is frustrating, but there seems to be broader consensus around merging this PR and then quickly getting #6681 restored.
Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs).
Yeah, agreed we don't need to worry about the linting build here given the changes in #6779
This reverts commit 2fcd520.
Reverts #6681
Closes #6749
cc @pentschev. I know you've found a workaround for dask-cuda in rapidsai/dask-cuda#955, but I'm kind of inclined to revert this anyway and then spend a little time implementing a better solution. Given the unexpected impact this had on dask-cuda, I'm worried it might break things for others too. We always knew the current implementation was a bit hacky anyway.