Performance regression in cluster memory usage #6833

ian-r-rose · 2022-08-04T21:15:24Z

Recently, some of us have started tracking some performance metrics for Dask clusters under a variety of different usage patterns. The idea is to be able to identify performance regressions before they are released (especially ones at scale which might not show up in a unit test context).

An example of these metrics is at this static site. It's only been collecting results for a few days, but already we seem to have come across a significant regression in cluster memory usage. Here is a test which measures array rechunking:

def test_rechunk_in_memory(small_client):
    x = da.random.random((50000, 50000))
    x.rechunk((50000, 20)).rechunk((20, 50000)).sum().compute()

and a screenshot of average cluster memory usage for that operation over the last week+ :

(I encourage folks to click through, this same behavior appears on a lot of tests around July 26)

The above is based on a coiled cluster, but I've reproduced it using a LocalCluster with the following procedure:

Create a software environment with nightly dask versions from the dask conda channel:

conda create -n memory-regression python=3.9 dask distributed numpy
conda activate memory-regression
# Install nightly from July 22nd
conda install https://conda.anaconda.org/dask/label/dev/noarch/dask-2022.7.1a220722-py_ga55bfd36_21.tar.bz2 https://conda.anaconda.org/dask/label/dev/noarch/distributed-2022.7.1a220722-py_ga55bfd36_21.tar.bz2
# Or install nightly from July 25th
conda install https://conda.anaconda.org/dask/label/dev/noarch/dask-2022.7.2a220725-py_g55cc1a50_1.tar.bz2 https://conda.anaconda.org/dask/label/dev/noarch/distributed-2022.7.2a220725-py_g55cc1a50_1.tar.bz2

Run the following script

import ctypes
import uuid

import dask.array as da
import distributed


sampler = distributed.diagnostics.MemorySampler()


def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)


if __name__ == "__main__":
    print(distributed.__version__)
    client = distributed.Client()
    mems = []
    for i in range(20):
        label = str(uuid.uuid4())

        with sampler.sample(label=label, client=client, measure="process"):
            x = da.random.random((20000, 20000))
            x.rechunk((20000, 20)).rechunk((20, 20000)).sum().compute()

        df = sampler.to_pandas()
        mems.append(df[label].mean())
        client.run(trim_memory)
        client.restart()

    print(mems)

This produces results like the following:

~~Timing-wise, this suggests to me that #6728 might have had some unintended side-effects in cluster memory usage, but I have not verified, nor do I know how it could be so drastic.~~
Edit, see below

The text was updated successfully, but these errors were encountered:

ian-r-rose · 2022-08-04T22:16:01Z

Oh, right, it's clearly #6777.

Nice to see a consistent story, I guess (I did a bisect to confirm)

gjoseph92 · 2022-08-04T23:27:37Z

Indeed, #6777 is a pretty clear culprit!

Nice fire drill of the benchmarking. Clearly it works! Now we just need alerts.

ian-r-rose · 2022-08-09T15:58:24Z

Coming back down after #6841:

ian-r-rose added performance regression memory labels Aug 4, 2022

fjetter mentioned this issue Aug 5, 2022

Set MALLOC_TRIM_THRESHOLD_ before interpreter start #6780

Closed

crusaderky mentioned this issue Aug 5, 2022

pre-spawn-environ #6841

Merged

crusaderky self-assigned this Aug 5, 2022

crusaderky closed this as completed in #6841 Aug 8, 2022

dcherian mentioned this issue Aug 8, 2022

performance regression testing for pangeo workloads. pangeo-data/pangeo-integration-tests#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in cluster memory usage #6833

Performance regression in cluster memory usage #6833

ian-r-rose commented Aug 4, 2022 •

edited

Loading

ian-r-rose commented Aug 4, 2022 •

edited

Loading

gjoseph92 commented Aug 4, 2022

ian-r-rose commented Aug 9, 2022

Performance regression in cluster memory usage #6833

Performance regression in cluster memory usage #6833

Comments

ian-r-rose commented Aug 4, 2022 • edited Loading

ian-r-rose commented Aug 4, 2022 • edited Loading

gjoseph92 commented Aug 4, 2022

ian-r-rose commented Aug 9, 2022

ian-r-rose commented Aug 4, 2022 •

edited

Loading

ian-r-rose commented Aug 4, 2022 •

edited

Loading