`WorkerProcess` leaks environment variables to parent process #6749

pentschev · 2022-07-20T18:38:37Z

Since #6681, WorkerProcess leaks the environment specified via the env kwarg, for example the CUDA_VISIBLE_DEVICES variable we use in Dask-CUDA.

Before https://github.com//pull/6681

In [1]: import os

In [2]: from dask_cuda import LocalCUDACluster

In [3]: os.environ.get("CUDA_VISIBLE_DEVICES")

In [4]: cluster = LocalCUDACluster()
/datasets/pentschev/src/distributed/distributed/node.py:179: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 43355 instead
  warnings.warn(
2022-07-20 11:37:39,518 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,518 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,519 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,519 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,525 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,526 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,542 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,542 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,548 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,549 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:39,551 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:39,552 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

In [5]: os.environ.get("CUDA_VISIBLE_DEVICES")

In [6]:

After https://github.com//pull/6681

In [1]: import os

In [2]: from dask_cuda import LocalCUDACluster

In [3]: os.environ.get("CUDA_VISIBLE_DEVICES")

In [4]: cluster = LocalCUDACluster()
/datasets/pentschev/src/distributed/distributed/node.py:179: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 39759 instead
  warnings.warn(
2022-07-20 11:37:00,532 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,533 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,535 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,536 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,607 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,607 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,661 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,662 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,663 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,664 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,666 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,666 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 11:37:00,742 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 11:37:00,742 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize

In [5]: os.environ.get("CUDA_VISIBLE_DEVICES")
Out[5]: '7,0,1,2,3,4,5,6'

In [6]:

What happens now is that os.environ.updated(self.env) is called from the parent process and never reverted. One of the issues this causes is leaking environment variables between pytests. Furthermore, if multiple workers are created they may overwrite each other's variables (I'm not sure if a cluster can create WorkerProcesses with different environment variables, so this may be a non-issue).

This problem has been discussed in length in the past in #3682, which is a difficult problem to tackle from Python given any newly-spawned process must inherit environment variables from the parent process. One of the suggestions in #3682 (comment) was to create a lock to ensure multiple workers don't spawn simultaneously, which will likely increase a bit the spawn time but seems to be the only safe option in that situation.

Any thoughts here @crusaderky (original author of #6681)?

cc'ing @quasiben @kkraus14 @mrocklin for vis as well, who were active on the #3682 discussion.

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-07-20T19:06:20Z

@pentschev for those not familiar, what's the actual impact of CUDA_VISIBLE_DEVICES being set in the parent process? Leaking between tests seems problematic to me. But for a user, what does that do? Does it cause problems, or is it just ugly?

As you mention, it's difficult to do this properly. Just trying to understand the tradeoff right now being MALLOC_TRIM_THRESHOLD_ working and CUDA_VISIBLE_DEVICES not, and how urgent it is to address.

pentschev · 2022-07-20T19:31:06Z

@pentschev for those not familiar, what's the actual impact of CUDA_VISIBLE_DEVICES being set in the parent process? Leaking between tests seems problematic to me. But for a user, what does that do? Does it cause problems, or is it just ugly?

This will overwrite what the user specifies for the Dask client. For example, one may launch something like this:

$ CUDA_VISIBLE_DEVICES=1 python
>>> from dask_cuda import LocalCUDACluster
>>> cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=[0,1])
>>> client = Client(cluster)  # This is expected to run on CUDA_VISIBLE_DEVICES=1 as set when launching the process but is now random

The problem in the above is we now leave a non-deterministic and impossible to ensure behavior for the user, unless the user overwrites that before launching the Client.

Because this will definitely break some users' code in the wild, I'm inclined to say this is a critical issue, especially given we're approaching RAPIDS code freeze and release as well and not certain whether we do have time to wait for another Dask release. For the pytests I can temporarily work around the issue with a fixture, I might be able to do so with LocalCUDACluster as well (i.e., revert the change once the cluster has started), but I would require to investigate this a bit more to avoid breaking something else.

I'm also interested in hearing the opinion of folks about the lock idea, that is perhaps not too difficult to implement.

pentschev · 2022-07-20T20:44:39Z

I played a bit with a (hacky) lock implementation, and it seems like the overall time increases dramatically for Dask-CUDA, from ~5 seconds up to 22 seconds on a machine with 8 workers (i.e., GPUs):

Without Lock

In [1]: from dask_cuda import LocalCUDACluster

In [2]: %time cluster = LocalCUDACluster()
2022-07-20 13:35:28,329 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,329 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,342 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,342 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,343 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,343 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,347 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,347 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,364 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,364 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,365 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,365 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,367 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,367 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:28,369 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:28,369 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
CPU times: user 409 ms, sys: 170 ms, total: 579 ms
Wall time: 5.05 s

With Lock

In [1]: from dask_cuda import LocalCUDACluster

In [2]: %time cluster = LocalCUDACluster()
2022-07-20 13:35:49,058 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:49,058 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:52,016 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:52,016 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:54,712 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:54,712 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:35:57,360 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:35:57,360 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:36:00,034 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:36:00,034 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:36:02,640 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:36:02,640 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:36:05,372 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:36:05,372 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-07-20 13:36:08,073 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2022-07-20 13:36:08,074 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
CPU times: user 19.7 s, sys: 1.95 s, total: 21.6 s
Wall time: 22.2 s

With the above, it's unlikely this idea is feasible in practice as well. I'm not really sure what else could be done, perhaps we have to store os.environ.copy() before spawning any new workers and restore afterwards? I'm not sure if there are no other pitfalls with that approach though. Also this would need to occur at all places where new workers are spawned, for example during cluster.scale() as well.

gjoseph92 · 2022-07-21T15:07:31Z

@pentschev the lock doesn't seem like a great idea to me. I'd probably look into doing a double-fork/double-spawn (maybe ideally fork, then spawn, for speed?), and setting the vars in the first process, then using the second process as the actual worker. (There are some fun nuances there with orphaned processes in POSIX as well). Unfortunately I don't think multiprocessing has any hooks that let you run some code after the fork but before the exec in spawn mode; that's what we'd want. Could also look into implementing our own fork-exec, and cribbing from the multiprocessing module but not using it directly.

Either way though:

If you think this should block the release, should mention that on Release 2022.7.1 community#263
If you want to just revert Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681 and address this after the release, fine with me
If you want to try to get a fix in before the release, please do, but I won't have time to work on this today myself

jakirkham · 2022-07-21T18:05:24Z

I've gone ahead and surfaced this issue in the release planning issue ( dask/community#263 (comment) )

pentschev · 2022-07-21T18:20:22Z

Thanks John, I'm still trying to get more confirmations from folks whether this is a critical issue or not. In the meantime I'm trying to work around it in rapidsai/dask-cuda#955, but unsuccessfully so far. If that's something we can't really fix, maybe it would be best to revert #6681 so we can buy a bit more time to find a proper fix or alternative.

jakirkham · 2022-07-21T18:27:31Z

Wonder if we can just add an option to the config to enable/disable setting environment variables before process spawning and check for it here. Alternatively we could have a special set of environment variables that we only set this way (as opposed to treating them all this way)

distributed/distributed/nanny.py

Lines 674 to 676 in add3663

    
           # Must set env variables before spawning the subprocess. 
        
           # See note in Nanny docstring. 
        
           os.environ.update(self.env)

pentschev · 2022-07-21T20:58:59Z

I think I was now able to work around that in rapidsai/dask-cuda#955 . I'll just wait for confirmation until tomorrow morning, but unless some other problem emerges regarding that we should be fine with the release going out as is.

jakirkham · 2022-07-21T21:04:03Z

Thanks for the update Peter! 🙏

gjoseph92 · 2022-07-21T22:39:33Z

@pentschev I've also opened #6777 to revert. As I said over there, I'm kind of inclined to revert even though you have a fix—it just seems like bad behavior, and a weird thing to have to work around.

crusaderky · 2022-07-22T09:37:09Z

I'm against the revert. The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

I think the ugly but easy way forward is to have two dicts of env variables, one set before fork and another set afterwards. I understand this would this fix the op issue, relying on the fact that the dask-cuda worker first calls super().__init__ and only afterwards it reads the env variable?

pentschev · 2022-07-22T09:48:37Z

I'm against the revert. The amount of users benefiting from the automatic MALLOC_TRIM_THRESHOLD_ vastly, vastly outnumbers the number of users that need different env variables in different workers.

Only for completeness here, I left my two cents regarding this particular claim in #6777 (comment) .

I think the ugly but easy way forward is to have two dicts of env variables, one set before fork and another set afterwards.

This causes a race condition if you have multiple workers starting up at the same time, so you would still need to do it in a central place, like SpecCluster, which is what I attempted to do in rapidsai/dask-cuda#955 . Plus, the same needs to be done after any new worker starts again, even during the workflow, for example after cluster.scale() or a worker raises an exception and restarts.

pentschev · 2022-07-22T09:56:47Z

This causes a race condition if you have multiple workers starting up at the same time, so you would still need to do it in a central place, like SpecCluster, which is what I attempted to do in rapidsai/dask-cuda#955 . Plus, the same needs to be done after any new worker starts again, even during the workflow, for example after cluster.scale() or a worker raises an exception and restarts.

Actually, let me clarify a bit this comment. The ideal fix would be to avoid the race condition entirely, which would require some sort of lock as I mentioned trying out in #6749 (comment), but that dramatically increases cluster start time. The workaround in rapidsai/dask-cuda#955 only reverts the leaked CUDA_VISIBLE_DEVICES environment variable after a new worker is spawned, so the leak remains for any other environment variables that are passed to WorkerProcess. Also note that maybe you're not hitting issues because of this right now, but Distributed tests are very likely leaking environment variables that were used in one test to all subsequent tests, and that is potentially going to cause issues at some point, as it did for Dask-CUDA.

crusaderky · 2022-07-22T11:05:04Z

Let me rephrase my suggestion. I apologise in advance for the lack of examples and links, but I'm writing this from my phone in a hotel room.

There would be two dicts in distributed.yaml,

env, which works exactly the same as before Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681
pre-spawn-env, which works like env currently does after Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681 and contains the blas and trim variables that are currently in env.

pentschev · 2022-07-22T12:11:03Z

No need to apologize, I appreciate you taking the time to discuss this issue.

I would be fine with the proposed solution provided it no leakage will occur, which seems feasible. If we can demonstrate this not to leak environment variables, that would be awesome. However, I'm not sure if there's time to get that in before the release, so if there's no objection it would be great to merge #6777 and revert the changes so that the release can work smoothly and then work on this new proposal. Is that a reasonable tradeoff?

gjoseph92 · 2022-07-22T13:10:23Z

To be clear, I'm definitely not suggesting reverting and leaving it reverted. I was only suggesting reverting for today to make the release, then in the next week or two adding a different solution we're all happy with (like @crusaderky's proposal). It feels like a safer path to me, since we know it won't break things for other users using env vars in similar ways, even though it would delay getting MALLOC_TRIM_THRESHOLD_ in the hands of users even longer, which I'd be sad about.

jakirkham · 2022-07-22T17:48:50Z

Appreciate everyone taking the time to follow up here 🙂

Let me rephrase my suggestion. I apologise in advance for the lack of examples and links, but I'm writing this from my phone in a hotel room.

There would be two dicts in distributed.yaml,

env, which works exactly the same as before Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681

pre-spawn-env, which works like env currently does after Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6681 and contains the blas and trim variables that are currently in env.

Indeed this is one of the alternatives I was looking at above ( #6749 (comment) ). Another would just be a flag to enable/disable pre-spawn env var setting. No strong feelings on either

Back to Gabe's point though, we are releasing today. So unless this is getting fixed today. It makes sense to revert and try again post-release. Having been on the other side of this before can totally appreciate this can be frustrating. Though I think everyone here is interested in finding a better solution we all agree on in the medium term 👍

jakirkham · 2022-07-22T20:04:01Z

Readding this change is now tracked in issue ( #6780 )

pentschev mentioned this issue Jul 21, 2022

Restore CUDA_VISIBLE_DEVICES overwritten by Nanny rapidsai/dask-cuda#955

Closed

jakirkham mentioned this issue Jul 21, 2022

Release 2022.7.1 dask/community#263

Closed

4 tasks

gjoseph92 mentioned this issue Jul 21, 2022

Revert "Set MALLOC_TRIM_THRESHOLD_ before interpreter start" #6777

Merged

jrbourbeau closed this as completed in #6777 Jul 22, 2022

jrbourbeau mentioned this issue Jul 22, 2022

Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6780

Closed

pentschev mentioned this issue Aug 12, 2022

Environment Setting in the Nanny #3682

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`WorkerProcess` leaks environment variables to parent process #6749

`WorkerProcess` leaks environment variables to parent process #6749

pentschev commented Jul 20, 2022

gjoseph92 commented Jul 20, 2022

pentschev commented Jul 20, 2022 •

edited

Loading

pentschev commented Jul 20, 2022

gjoseph92 commented Jul 21, 2022

jakirkham commented Jul 21, 2022

pentschev commented Jul 21, 2022

jakirkham commented Jul 21, 2022

pentschev commented Jul 21, 2022

jakirkham commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

crusaderky commented Jul 22, 2022

pentschev commented Jul 22, 2022

pentschev commented Jul 22, 2022

crusaderky commented Jul 22, 2022 •

edited

Loading

pentschev commented Jul 22, 2022

gjoseph92 commented Jul 22, 2022

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022

WorkerProcess leaks environment variables to parent process #6749

WorkerProcess leaks environment variables to parent process #6749

Comments

pentschev commented Jul 20, 2022

gjoseph92 commented Jul 20, 2022

pentschev commented Jul 20, 2022 • edited Loading

pentschev commented Jul 20, 2022

gjoseph92 commented Jul 21, 2022

jakirkham commented Jul 21, 2022

pentschev commented Jul 21, 2022

jakirkham commented Jul 21, 2022

pentschev commented Jul 21, 2022

jakirkham commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

crusaderky commented Jul 22, 2022

pentschev commented Jul 22, 2022

pentschev commented Jul 22, 2022

crusaderky commented Jul 22, 2022 • edited Loading

pentschev commented Jul 22, 2022

gjoseph92 commented Jul 22, 2022

jakirkham commented Jul 22, 2022

jakirkham commented Jul 22, 2022

`WorkerProcess` leaks environment variables to parent process #6749

`WorkerProcess` leaks environment variables to parent process #6749

pentschev commented Jul 20, 2022 •

edited

Loading

crusaderky commented Jul 22, 2022 •

edited

Loading