Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#4777: Don't use copy=True parameter for concat calls inside to_pandas #4778

Merged
merged 4 commits into from
Nov 20, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Aug 4, 2022

Signed-off-by: Myachev [email protected]

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves PERF: don't use copy=True parameter for concat calls inside to_pandas #4777
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #4778 (5aae430) into master (eee5f43) will increase coverage by 5.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4778      +/-   ##
==========================================
+ Coverage   84.63%   89.86%   +5.23%     
==========================================
  Files         265      266       +1     
  Lines       19626    19909     +283     
==========================================
+ Hits        16610    17892    +1282     
+ Misses       3016     2017     -999     
Impacted Files Coverage Δ
...dataframe/pandas/partitioning/partition_manager.py 89.25% <ø> (+3.88%) ⬆️
modin/core/dataframe/pandas/utils.py 100.00% <100.00%> (ø)
modin/logging/config.py 94.59% <0.00%> (-1.30%) ⬇️
modin/experimental/batch/test/test_pipeline.py 92.75% <0.00%> (ø)
modin/pandas/groupby.py 93.77% <0.00%> (+0.23%) ⬆️
modin/pandas/series.py 94.23% <0.00%> (+0.24%) ⬆️
modin/pandas/series_utils.py 99.43% <0.00%> (+0.56%) ⬆️
...odin/core/storage_formats/pandas/query_compiler.py 96.59% <0.00%> (+0.56%) ⬆️
modin/core/io/text/excel_dispatcher.py 94.01% <0.00%> (+0.85%) ⬆️
modin/experimental/xgboost/xgboost.py 87.61% <0.00%> (+0.95%) ⬆️
... and 45 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jbrockmendel
Copy link
Collaborator

Nice!

@jbrockmendel
Copy link
Collaborator

LGTM

@anmyachev
Copy link
Collaborator Author

LGTM

Happy to merge but there are the following issues in our CI: ValueError: buffer source array is read-only. I don't have fix for this for now.

@jbrockmendel
Copy link
Collaborator

CI: ValueError: buffer source array is read-only

does this happen on pandas main too?

@anmyachev
Copy link
Collaborator Author

CI: ValueError: buffer source array is read-only

does this happen on pandas main too?

I don’t know, I don’t have the right environment for this at hand for now. However, we can definitely try to make at least one copy instead of two (if we can’t get rid of them at all)

@anmyachev anmyachev marked this pull request as ready for review August 31, 2022 19:51
@anmyachev anmyachev requested a review from a team as a code owner August 31, 2022 19:51
@@ -45,4 +45,5 @@ def concatenate(dfs):
df.iloc[:, i] = pandas.Categorical(
df.iloc[:, i], categories=union.categories
)
return pandas.concat(dfs)
# `ValueError: buffer source array is read-only` if copy==False
Copy link
Collaborator

@mvashishtha mvashishtha Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to use the test failure of test_merge_asof[non_unique] here to get a small reproducer using just ray:

import ray
import pandas as pd

original_left_df = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
original_right_df = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]}, index=[0] * 5)

# This works
print(pd.merge_asof(original_left_df, original_right_df, on='a'))

left_oid = ray.put(original_left_df)
right_oid = ray.put(original_right_df)
left_row = pd.concat([ray.get(left_oid)], axis=1, copy=False)
left_df = pd.concat([left_row], axis=0, copy=False)
right_row = pd.concat([ray.get(right_oid)], axis=1, copy=False)
right_df = pd.concat([right_row], axis=0, copy=False)

# this fails
print(pd.merge_asof(left_df, right_df, on='a'))

I am on Python 3.10.4, ray 2.0.0, on macOS Monterey 12.4.

Stack trace
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 print(pd.merge_asof(left_df, right_df, on='a'))

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:598, in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
    342 """
    343 Perform a merge by key distance.
    344
   (...)
    579 4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN
    580 """
    581 op = _AsOfMerge(
    582     left,
    583     right,
   (...)
    596     direction=direction,
    597 )
--> 598 return op.get_result()

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1646, in _OrderedMerge.get_result(self)
   1645 def get_result(self) -> DataFrame:
-> 1646     join_index, left_indexer, right_indexer = self._get_join_info()
   1648     llabels, rlabels = _items_overlap_with_suffix(
   1649         self.left._info_axis, self.right._info_axis, self.suffixes
   1650     )
   1652     left_join_indexer: np.ndarray | None

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:967, in _MergeOperation._get_join_info(self)
    963     join_index, right_indexer, left_indexer = _left_join_on_index(
    964         right_ax, left_ax, self.right_join_keys, sort=self.sort
    965     )
    966 else:
--> 967     (left_indexer, right_indexer) = self._get_join_indexers()
    969     if self.right_index:
    970         if len(self.left) > 0:

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1997, in _AsOfMerge._get_join_indexers(self)
   1994 else:
   1995     # choose appropriate function by type
   1996     func = _asof_function(self.direction)
-> 1997     return func(left_values, right_values, self.allow_exact_matches, tolerance)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/join.pyx:868, in pandas._libs.join.asof_join_backward()

File stringsource:658, in View.MemoryView.memoryview_cwrapper()

File stringsource:349, in View.MemoryView.memoryview.__cinit__()

ValueError: buffer source array is read-only

@anmyachev could you look at this some more? cc @jbrockmendel

I know that pandas dataframes that we pull from the object store are immutable, as in ray-project/ray#369. My guess is we are currently relying on the implicit copy in the concat to make the dataframe mutable for merge_asof. It seems like a pandas bug that merge_asof assumes that both dataframes are mutable. On the other hand, we really don't want the result of to_pandas to be immutable.

P.S. In the reproducer, doing just one copy=False concat (only one dimension, either dimension) on either frame still gives the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pandas._libs.join.asof_join_backward_on_X_by_Y the arguments arguments have types numeric_t[:] and by_t[:], which causes this to raise when passed a readonly ndarray. this would be fixed by changing the declarations to const foo_t[:], which is not supported in released cython, but i think is supposed to be supported in cython3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the newest cython 0.29.x supports const foo_t[:] so this might be addressable

@vnlitvinov
Copy link
Collaborator

Any progress here?

@anmyachev
Copy link
Collaborator Author

Any progress here?

Unfortunately no progress

@anmyachev
Copy link
Collaborator Author

I don't know what else I can do here

@anmyachev
Copy link
Collaborator Author

anmyachev commented Nov 5, 2023

Blocked by #6706

@anmyachev anmyachev added Blocked ❌ A pull request that is blocked and removed Blocked ❌ A pull request that is blocked labels Nov 5, 2023
@anmyachev anmyachev merged commit adfd2bb into modin-project:master Nov 20, 2023
36 checks passed
@anmyachev anmyachev deleted the issue4777 branch November 21, 2023 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: don't use copy=True parameter for concat calls inside to_pandas
5 participants