Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

gjoseph92 · 2022-07-21T22:37:01Z

Reverts #6681

cc @pentschev. I know you've found a workaround for dask-cuda in rapidsai/dask-cuda#955, but I'm kind of inclined to revert this anyway and then spend a little time implementing a better solution. Given the unexpected impact this had on dask-cuda, I'm worried it might break things for others too. We always knew the current implementation was a bit hacky anyway.

This reverts commit 2fcd520.

github-actions · 2022-07-21T23:16:14Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 16m 53s ⏱️ - 6m 4s
  2 976 tests - 1   2 884 ✔️ - 3     88 💤 ±0 3 ❌ +1 1 🔥 +1
22 064 runs - 8 21 029 ✔️ - 6 1 029 💤 - 6 5 ❌ +3 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 277a397. ± Comparison against base commit add3663.

♻️ This comment has been updated with latest results.

pentschev

Sounds good to me, thanks @gjoseph92 !

crusaderky · 2022-07-22T09:33:14Z

I'm against this revert. The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

pentschev · 2022-07-22T09:43:14Z

The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

Can you provide evidence to sustain this claim? For example, I've never heard of anyone complaining about the deallocation issues that MALLOC_TRIM_THRESHOLD_ is trying to work around for Dask-CUDA, even though all Dask-CUDA users rely on environment variables that are different among workers. Plus, feel free to correct me if I'm wrong, but if the user simply specifies MALLOC_TRIM_THRESHOLD_=65535 python (or whatever process other than python), that would have the same effect without any other side effects, am I correct?

Particularly, I dislike the claim that it's fine to keep things broken that we know are broken because it now provides an automatic benefit for some users who could achieve the same behavior by one small change to their workflows (provided my understanding above is indeed correct).

crusaderky · 2022-07-22T10:59:01Z

For example, I've never heard of anyone complaining about the deallocation issues

This is because, frequently, this is an issue of partial deallocations; in other words it will "solve itself" once the worker has spilled everything to disk - which will cost a 2 order-of-magnitude slowdown.

Also there's the fundamental problem of visibility. In case of OOM (pass the terminate threshold), the user will receive a nebulous KilledWorker, with no immediate indication that the culprit was memory usage (this issue is already discussed elsewhere). Finally, I suspect the that only a fraction of the users suffering from memory problems will spot that there are vast amounts of unmanaged memory, and only an even smaller fraction will be able to pin the cause to trimming. A lot more people will stop at "dask is very memory hungry".

feel free to correct me if I'm wrong, but if the user simply specifies MALLOC_TRIM_THRESHOLD_=65535 python (or whatever process other than python), that would have the same effect without any other side effects, am I correct?

Correct. This requires the user to investigate the problem and find the right page on the documentation. Yes we're linking it from the OOM warnings in the log, but that still requires fishing the right lines from the logs, which is a simple activity for you and me but not for someone who just wants to use dask as a tool and has a rightful expectation for it to just work.

TL;DR; most users are not power users.

pentschev · 2022-07-22T12:07:33Z

Correct. This requires the user to investigate the problem and find the right page on the documentation. Yes we're linking it from the OOM warnings in the log, but that still requires fishing the right lines from the logs, which is a simple activity for you and me but not for someone who just wants to use dask as a tool and has a rightful expectation for it to just work.

TL;DR; most users are not power users.

I agree that this can be painful for users, and to be clear we're not saying we shouldn't ever do it. Our only ask is that we do so in a manner that will not be problematic for existing use cases, so reverting now before the release would buy us time to work on a solution that would work for everyone. IOW, I still hold my opinion that we shouldn't be ok with changes that break known existing use cases in favor usability changes, we should combine both, and given this is a new feature I would argue in favor of keeping what used to work still working until we come up with a solution to add the new feature that won't break previously functional use cases.

jrbourbeau

@gjoseph92 could you fix the linting issue here?

jakirkham · 2022-07-22T19:14:04Z

The lint appears to be unrelated to the files touched

distributed/shuffle/shuffle_extension.py:534: error: Unused "type: ignore" comment

jakirkham · 2022-07-22T19:16:29Z

Submitted PR ( #6779 ) to fix it. Should simplify reverting the revert later 🙂

jakirkham · 2022-07-22T19:25:27Z

Re-running the lint now that PR ( #6779 ) is merged.

Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs).

jrbourbeau

Based on the conversation here an in the original issue (#6749) I'm going to merge this PR in order to get dask-cuda working again. This is just a temporary revert and, as @gjoseph92 mentioned (#6749 (comment)), we should restore a version of #6681 that addresses the concerns that have been brought up here. Totally acknowledge that there's not an ideal solution here and reverting is frustrating, but there seems to be broader consensus around merging this PR and then quickly getting #6681 restored.

Edit: Looks like the merge ref is out-of-date on CI. So this isn't going to work without closing/reopening. Idk if it is worth that level of testing here (given that will restart all CI jobs).

Yeah, agreed we don't need to worry about the linting build here given the changes in #6779

jrbourbeau · 2022-07-22T20:01:27Z

Opened up #6780 so we don't loose track of getting #6681 back into main

This reverts commit 2fcd520.

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start (#6681)"

277a397

This reverts commit 2fcd520.

gjoseph92 mentioned this pull request Jul 21, 2022

WorkerProcess leaks environment variables to parent process #6749

Closed

jakirkham mentioned this pull request Jul 21, 2022

Release 2022.7.1 dask/community#263

Closed

4 tasks

pentschev approved these changes Jul 22, 2022

View reviewed changes

jrbourbeau reviewed Jul 22, 2022

View reviewed changes

jakirkham mentioned this pull request Jul 22, 2022

Fix mypy lint in CI #6779

Merged

2 tasks

jakirkham requested a review from jrbourbeau July 22, 2022 19:32

jrbourbeau approved these changes Jul 22, 2022

View reviewed changes

jrbourbeau merged commit 4130bb2 into main Jul 22, 2022

jrbourbeau mentioned this pull request Jul 22, 2022

Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6780

Closed

jacobtomlinson deleted the revert-6681-nanny_env_variables branch July 25, 2022 10:14

pentschev mentioned this pull request Jul 25, 2022

Restore CUDA_VISIBLE_DEVICES overwritten by Nanny rapidsai/dask-cuda#955

Closed

ian-r-rose mentioned this pull request Aug 4, 2022

Performance regression in cluster memory usage #6833

Closed

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" (dask#6777)

eea6f0f

This reverts commit 2fcd520.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

gjoseph92 commented Jul 21, 2022

github-actions bot commented Jul 21, 2022 •

edited

Loading

pentschev left a comment

crusaderky commented Jul 22, 2022

pentschev commented Jul 22, 2022

crusaderky commented Jul 22, 2022 •

edited

Loading

pentschev commented Jul 22, 2022

jrbourbeau left a comment

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022 •

edited

Loading

jrbourbeau left a comment

jrbourbeau commented Jul 22, 2022

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

Conversation

gjoseph92 commented Jul 21, 2022

github-actions bot commented Jul 21, 2022 • edited Loading

Unit Test Results

pentschev left a comment

Choose a reason for hiding this comment

crusaderky commented Jul 22, 2022

pentschev commented Jul 22, 2022

crusaderky commented Jul 22, 2022 • edited Loading

pentschev commented Jul 22, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022 • edited Loading

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau commented Jul 22, 2022

github-actions bot commented Jul 21, 2022 •

edited

Loading

crusaderky commented Jul 22, 2022 •

edited

Loading

jakirkham commented Jul 22, 2022 •

edited

Loading