Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large DataFrame in WASM causes infinite loop #3599

Open
jhgjeraker opened this issue Jan 28, 2025 · 11 comments
Open

Large DataFrame in WASM causes infinite loop #3599

jhgjeraker opened this issue Jan 28, 2025 · 11 comments
Labels
bug Something isn't working

Comments

@jhgjeraker
Copy link

Describe the bug

I encountered some unexpected behavior while attempting to upsample a DataFrame by a large amount. I can reliably reproduce the behavior in WebAssembly notebooks, but never in Non-WebAssembly notebooks. I have verified that this occurs both in Community Cloud Notebooks and Notebooks generated using the "Create WebAssembly link" functionality.

Problem Description
When upsampling a polars DataFrame from ~8,000 rows to ~32,000,000 rows in a WASM notebook the cell will usually run fine the first time, but if I rerun the cell a few times it will suddenly get caught in an infinite loop and never complete.

Reproducability
I've included reproducable code below, but here is also a parmalink to a notebook that reproduces the problem.

https://marimo.app/l/yl88gn

Try re-running the last cell a few times to trigger the bug. Note that reducing the target sample rate, i.e. using every="15m" for instances, reduces and at some point eliminates the problem.

Environment

{
  "marimo": "0.10.17",
  "OS": "Darwin",
  "OS Version": "24.2.0",
  "Processor": "arm",
  "Python Version": "3.13.1",
  "Binaries": {
    "Browser": "132.0.6834.111",
    "Node": "v14.21.3"
  },
  "Dependencies": {
    "click": "8.1.8",
    "docutils": "0.21.2",
    "itsdangerous": "2.2.0",
    "jedi": "0.19.2",
    "markdown": "3.7",
    "narwhals": "1.24.0",
    "packaging": "24.2",
    "psutil": "6.1.1",
    "pygments": "2.19.1",
    "pymdown-extensions": "10.14.1",
    "pyyaml": "6.0.2",
    "ruff": "0.9.3",
    "starlette": "0.45.3",
    "tomlkit": "0.13.2",
    "typing-extensions": "missing",
    "uvicorn": "0.34.0",
    "websockets": "14.2"
  },
  "Optional Dependencies": {
    "polars": "1.21.0"
  },
  "Experimental Flags": {}
}

Code to reproduce

import marimo

__generated_with = "0.10.12"
app = marimo.App()


@app.cell
def _():
    import datetime
    import tzdata

    import marimo as mo
    import polars as pl
    return datetime, mo, pl, tzdata


@app.cell(hide_code=True)
def _(mo):
    mo.md("""# Generate test data""")
    return


@app.cell
def _(datetime, pl):
    def gen_test_df():
        timestamps = pl.datetime_range(
            start=datetime.datetime(2024, 1, 1, tzinfo=datetime.UTC),
            end=datetime.datetime(2025, 1, 1, tzinfo=datetime.UTC),
            interval="60m",
            eager=True,
        )
        return pl.DataFrame(
            data={
                "timestamp": timestamps,
                "value": pl.Series([i % 2 for i in range(len(timestamps))]),
            },
        )


    df = gen_test_df()
    print(df)
    return df, gen_test_df


@app.cell(hide_code=True)
def _(mo):
    mo.md(
        """
        # Resample DataFrame
        In a WASM notebook, run the following cell multiple times. It should run fine a few times, but then suddenly hang in an infinite loop.
        """
    )
    return


@app.cell
def _(df, pl):
    def resample(df: pl.DataFrame, every: str) -> pl.DataFrame:
        return (
            df.group_by_dynamic(
                index_column="timestamp",
                every=every,
            )
            .agg(
                pl.col("value").last(),
            )
            .upsample(time_column="timestamp", every=every)
            .fill_null(strategy="forward")
        )


    res_df = resample(df, every="1s")
    print(res_df)
    return res_df, resample


if __name__ == "__main__":
    app.run()
@jhgjeraker jhgjeraker added the bug Something isn't working label Jan 28, 2025
@mscolnick
Copy link
Contributor

Thanks for the example. It does render the first time, but fails even on the first re-run for me.
It seems to fail to allocate 126489604 bytes (126mb). That should be under the wasm limit provided by browsers (2gb)

Image

I'll continue to investigate, but tagging the pyodide maintainers if they have ideas (@hoodmane, @ryanking13, @agriyakhetarpal)

@hoodmane
Copy link

Are you loading Pyodide from jsdelivr? Can you use the debug build so we can get symbols in the traceback?

@hoodmane
Copy link

It is of course the prerogative of V8 isolates to say no to allocations of any size. I'm not personally familiar with the chromium code that determines how much memory a webpage is allowed to allocate but I think it's a bit complicated.

@mscolnick
Copy link
Contributor

@hoodmane, yea from jsdeliver. is there a debug build hosted on jsdeliver? is that with versiondev or something else?

@hoodmane
Copy link

@mscolnick
Copy link
Contributor

@hoodmane, i ran this with debug locally and did not get additional logging or info

Image

@mscolnick
Copy link
Contributor

mscolnick commented Jan 28, 2025

If it helps, I can re-produce this in the pyodide REPL as well:

First paste this code in:

import datetime
import tzdata

import polars as pl

def gen_test_df():
    timestamps = pl.datetime_range(
        start=datetime.datetime(2024, 1, 1, tzinfo=datetime.UTC),
        end=datetime.datetime(2025, 1, 1, tzinfo=datetime.UTC),
        interval="60m",
        eager=True,
    )
    return pl.DataFrame(
        data={
            "timestamp": timestamps,
            "value": pl.Series([i % 2 for i in range(len(timestamps))]),
        },
    )


df = gen_test_df()
print(df)

def resample(df: pl.DataFrame, every: str) -> pl.DataFrame:
    return (
        df.group_by_dynamic(
            index_column="timestamp",
            every=every,
        )
        .agg(
            pl.col("value").last(),
        )
        .upsample(time_column="timestamp", every=every)
        .fill_null(strategy="forward")
    )


res_df = resample(df, every="1s")
print(res_df)

Then resample(df, every="1s") 1-few times.

@hoodmane
Copy link

Right, this is a problem that bites us occasionally: we only use debug symbols for the Python interpreter, not for packages. The traceback you have is in polars frames so we would need a debug build of polars. But the emscripten polars is built against a fork of llvm so I'm not even really sure how to make it myself. Annoying.

But presumably we should be able to set a breakpoint in the sbrk call and see that the Memory.grow() is being declined by Chromium. Then when sbrk returns ENOMEM, rust is saying okay let's abort in that case. I think these failures can in effect be reduced into a series of Memory.grow() calls which eventually raise a RangeError. If we can reproduce like that then it essentially proves that the problem is with the browser and not with Pyodide. If I can reproduce it I'll make a build with instrumentation around sbrk and check whether this is the case.

@ryanking13
Copy link

ryanking13 commented Jan 29, 2025

If it helps, I can re-produce this in the pyodide REPL as well:

This code doesn't break in my browser (Chrome 131, Mac M1 Max), but I noticed that it consumes a lot of memory (~3GB).

Image

@ryanking13
Copy link

Never mind. I could reproduce it by calling resample(df, every="1s") again.

@jhgjeraker
Copy link
Author

I have tested some more and are able to reproduce the problem with a small DataFrame of only ~8k rows, or an estimated 0.13 mb size, where it consistently fails on the 8th run instead of the 2nd (including the initial run).

To reproduce the issue with a small dataframe, use the original included code, but comment out upsample and fill_null in the resample() function so that you get the following.

def resample(df: pl.DataFrame, every: str) -> pl.DataFrame:
    return (
        df.group_by_dynamic(
            index_column="timestamp",
            every=every,
        )
        .agg(
            pl.col("value").last(),
        )
        # .upsample(time_column="timestamp", every=every)
        # .fill_null(strategy="forward")
    )

Running resample(df, every="1s") now consistently fails for me at the 8th run. The DataFrame never changes size from ~8k. The memory allocation error in the console seems to be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants