Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in cluster memory usage #6833

Closed
ian-r-rose opened this issue Aug 4, 2022 · 3 comments · Fixed by #6841
Closed

Performance regression in cluster memory usage #6833

ian-r-rose opened this issue Aug 4, 2022 · 3 comments · Fixed by #6841

Comments

@ian-r-rose
Copy link
Collaborator

ian-r-rose commented Aug 4, 2022

Recently, some of us have started tracking some performance metrics for Dask clusters under a variety of different usage patterns. The idea is to be able to identify performance regressions before they are released (especially ones at scale which might not show up in a unit test context).

An example of these metrics is at this static site. It's only been collecting results for a few days, but already we seem to have come across a significant regression in cluster memory usage. Here is a test which measures array rechunking:

def test_rechunk_in_memory(small_client):
    x = da.random.random((50000, 50000))
    x.rechunk((50000, 20)).rechunk((20, 50000)).sum().compute()

and a screenshot of average cluster memory usage for that operation over the last week+ :
image
(I encourage folks to click through, this same behavior appears on a lot of tests around July 26)

The above is based on a coiled cluster, but I've reproduced it using a LocalCluster with the following procedure:

  1. Create a software environment with nightly dask versions from the dask conda channel:
    conda create -n memory-regression python=3.9 dask distributed numpy
    conda activate memory-regression
    # Install nightly from July 22nd
    conda install https://conda.anaconda.org/dask/label/dev/noarch/dask-2022.7.1a220722-py_ga55bfd36_21.tar.bz2 https://conda.anaconda.org/dask/label/dev/noarch/distributed-2022.7.1a220722-py_ga55bfd36_21.tar.bz2
    # Or install nightly from July 25th
    conda install https://conda.anaconda.org/dask/label/dev/noarch/dask-2022.7.2a220725-py_g55cc1a50_1.tar.bz2 https://conda.anaconda.org/dask/label/dev/noarch/distributed-2022.7.2a220725-py_g55cc1a50_1.tar.bz2
  2. Run the following script
import ctypes
import uuid

import dask.array as da
import distributed


sampler = distributed.diagnostics.MemorySampler()


def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)


if __name__ == "__main__":
    print(distributed.__version__)
    client = distributed.Client()
    mems = []
    for i in range(20):
        label = str(uuid.uuid4())

        with sampler.sample(label=label, client=client, measure="process"):
            x = da.random.random((20000, 20000))
            x.rechunk((20000, 20)).rechunk((20, 20000)).sum().compute()

        df = sampler.to_pandas()
        mems.append(df[label].mean())
        client.run(trim_memory)
        client.restart()

    print(mems)

This produces results like the following:

image

Timing-wise, this suggests to me that #6728 might have had some unintended side-effects in cluster memory usage, but I have not verified, nor do I know how it could be so drastic.
Edit, see below

@ian-r-rose
Copy link
Collaborator Author

ian-r-rose commented Aug 4, 2022

Oh, right, it's clearly #6777.

Nice to see a consistent story, I guess (I did a bisect to confirm)

@gjoseph92
Copy link
Collaborator

Indeed, #6777 is a pretty clear culprit!

Nice fire drill of the benchmarking. Clearly it works! Now we just need alerts.

@ian-r-rose
Copy link
Collaborator Author

Coming back down after #6841:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants