-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#4777: Don't use copy=True
parameter for concat
calls inside to_pandas
#4778
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4778 +/- ##
==========================================
+ Coverage 84.63% 89.86% +5.23%
==========================================
Files 265 266 +1
Lines 19626 19909 +283
==========================================
+ Hits 16610 17892 +1282
+ Misses 3016 2017 -999
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Nice! |
LGTM |
Happy to merge but there are the following issues in our CI: |
does this happen on pandas main too? |
I don’t know, I don’t have the right environment for this at hand for now. However, we can definitely try to make at least one copy instead of two (if we can’t get rid of them at all) |
@@ -45,4 +45,5 @@ def concatenate(dfs): | |||
df.iloc[:, i] = pandas.Categorical( | |||
df.iloc[:, i], categories=union.categories | |||
) | |||
return pandas.concat(dfs) | |||
# `ValueError: buffer source array is read-only` if copy==False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to use the test failure of test_merge_asof[non_unique]
here to get a small reproducer using just ray:
import ray
import pandas as pd
original_left_df = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
original_right_df = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]}, index=[0] * 5)
# This works
print(pd.merge_asof(original_left_df, original_right_df, on='a'))
left_oid = ray.put(original_left_df)
right_oid = ray.put(original_right_df)
left_row = pd.concat([ray.get(left_oid)], axis=1, copy=False)
left_df = pd.concat([left_row], axis=0, copy=False)
right_row = pd.concat([ray.get(right_oid)], axis=1, copy=False)
right_df = pd.concat([right_row], axis=0, copy=False)
# this fails
print(pd.merge_asof(left_df, right_df, on='a'))
I am on Python 3.10.4, ray 2.0.0, on macOS Monterey 12.4.
Stack trace
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 print(pd.merge_asof(left_df, right_df, on='a'))
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:598, in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
342 """
343 Perform a merge by key distance.
344
(...)
579 4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
580 """
581 op = _AsOfMerge(
582 left,
583 right,
(...)
596 direction=direction,
597 )
--> 598 return op.get_result()
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1646, in _OrderedMerge.get_result(self)
1645 def get_result(self) -> DataFrame:
-> 1646 join_index, left_indexer, right_indexer = self._get_join_info()
1648 llabels, rlabels = _items_overlap_with_suffix(
1649 self.left._info_axis, self.right._info_axis, self.suffixes
1650 )
1652 left_join_indexer: np.ndarray | None
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:967, in _MergeOperation._get_join_info(self)
963 join_index, right_indexer, left_indexer = _left_join_on_index(
964 right_ax, left_ax, self.right_join_keys, sort=self.sort
965 )
966 else:
--> 967 (left_indexer, right_indexer) = self._get_join_indexers()
969 if self.right_index:
970 if len(self.left) > 0:
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1997, in _AsOfMerge._get_join_indexers(self)
1994 else:
1995 # choose appropriate function by type
1996 func = _asof_function(self.direction)
-> 1997 return func(left_values, right_values, self.allow_exact_matches, tolerance)
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/join.pyx:868, in pandas._libs.join.asof_join_backward()
File stringsource:658, in View.MemoryView.memoryview_cwrapper()
File stringsource:349, in View.MemoryView.memoryview.__cinit__()
ValueError: buffer source array is read-only
@anmyachev could you look at this some more? cc @jbrockmendel
I know that pandas dataframes that we pull from the object store are immutable, as in ray-project/ray#369. My guess is we are currently relying on the implicit copy in the concat
to make the dataframe mutable for merge_asof
. It seems like a pandas bug that merge_asof
assumes that both dataframes are mutable. On the other hand, we really don't want the result of to_pandas
to be immutable.
P.S. In the reproducer, doing just one copy=False
concat (only one dimension, either dimension) on either frame still gives the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in pandas._libs.join.asof_join_backward_on_X_by_Y
the arguments arguments have types numeric_t[:]
and by_t[:]
, which causes this to raise when passed a readonly ndarray. this would be fixed by changing the declarations to const foo_t[:]
, which is not supported in released cython, but i think is supposed to be supported in cython3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the newest cython 0.29.x supports const foo_t[:]
so this might be addressable
Any progress here? |
Unfortunately no progress |
I don't know what else I can do here |
|
… calls inside 'to_pandas' Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Myachev [email protected]
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
copy=True
parameter forconcat
calls insideto_pandas
#4777docs/development/architecture.rst
is up-to-date