Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: service stopped when pivot a 1125138913x5 matrix into 4000 columns on a 160U-4096GBmem machine #741

Closed
huboqiang opened this issue Oct 13, 2023 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@huboqiang
Copy link

Describe the bug

Running pivot and service stopped:
image

To Reproduce

To help us to reproduce this bug, please provide information below:
image

  1. Your Python version 3.11.4
  2. The version of Xorbits you use 0.5.1
  3. Versions of crucial packages, such as numpy(1.25.2), scipy(1.11.2) and pandas(2.0.3)
  4. Full stack of the error.
……
……
……
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/dataframe/base/pivot.py:72: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  input_data[dtype] = np.nan
2023-10-13 19:20:03,359 xorbits._mars.services.scheduling.worker.execution 337003 ERROR    Failed to run subtask n6Qo8tTl8A6Pn7uCwJHxGAEK on band numa-0
Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
    return await asyncio.shield(aiotask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
    return await ref.run_subtask.options(profiling_context=profiling_context).send(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
           ^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 402, in internal_run_subtask
    subtask_info.result = await self._retry_run_subtask(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 513, in _retry_run_subtask
    return await _retry_run(subtask, subtask_info, _run_subtask_once)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 95, in _retry_run
    raise ex
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 73, in _retry_run
    return await target_async_func(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 491, in _run_subtask_once
    raise ex
xoscar.errors.ServerClosed: unexpectedly terminated process (Remote server unixsocket:///361854215913472 closed) with address 127.0.0.1:39737, which is highly suspected to be caused by an Out-of-Memory (OOM) problem
2023-10-13 19:20:03,454 xorbits._mars.services.task.execution.mars.stage 337003 ERROR    Subtask n6Qo8tTl8A6Pn7uCwJHxGAEK errored
Traceback (most recent call last):
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/scheduling/worker/execution.py", line 445, in _run_subtask_once
    return await asyncio.shield(aiotask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xorbits/_mars/services/subtask/api.py", line 70, in run_subtask_in_slot
    return await ref.run_subtask.options(profiling_context=profiling_context).send(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
           ^^^^^^^^^^^^
  File "/cluster/home/bqhu_jh/share/miniconda3/envs/bigdata/lib/python3.11/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///361854215913472 closed
……
……
……
  1. Minimized code to reproduce the error.
df = xpd.read_parquet("./merged.pq")
print(df.shape)   # (1125138913, 5)
df_pvt = df.pivot(index=[0,1,2], columns=[4], values=[3])
df_pvt

Expected behavior

A clear and concise description of what you expected to happen.

I think there were some bug for large scale data and the server closed.

Additional context

Add any other context about the problem here. No.

@XprobeBot XprobeBot added the bug Something isn't working label Oct 13, 2023
@XprobeBot XprobeBot added this to the v0.7.0 milestone Oct 13, 2023
@huboqiang huboqiang changed the title BUG: BUG: service stopped when pivot a 1125138913x5 matrix into 4000 columns on a 160U-4096GBmem machine Oct 13, 2023
@huboqiang
Copy link
Author

Using rechunk(chunk_size=10_000) can adjust partrition policy like spark repartrition.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants