-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis
#6760
Conversation
broadcast_apply_full_axis
@@ -3348,6 +3348,25 @@ def broadcast_apply_full_axis( | |||
kw["column_widths"] = self._column_widths_cache | |||
elif len(new_columns) == 1 and new_partitions.shape[1] == 1: | |||
kw["column_widths"] = [1] | |||
else: | |||
if ( | |||
kw["row_lengths"] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kw["row_lengths"] is None | |
axis == 0 and | |
kw["row_lengths"] is None |
It seems that we can preserve lengths only for one axis (for the one, that we're applying a full-axis function). Consider this case, where kernel modifies the shapes of partitions without modifying the total length:
from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme
import pandas
md_df = construct_modin_df_by_scheme(pandas.DataFrame({"a": [1, 2, 3, 4], "b": [3, 4, 5, 6], "c": [6, 7, 8, 9], "d": [0, 1, 2, 3]}), {"row_lengths": [2, 2], "column_widths": [2, 2]})._query_compiler._modin_frame
def func(df):
if df.iloc[0, 0] == 1:
return pandas.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4], "c": [4, 2, 1], "d": [3, 2, 1]})
else:
return pandas.DataFrame({"a": [3], "b": [4], "c": [1], "d": [1]})
res = md_df.apply_full_axis(func=func, axis=1, new_index=[0, 1, 2, 3], new_columns=["a", "b", "c", "d"], keep_partitioning=True)
print(res._row_lengths_cache) # [2, 2]
print(res._column_widths_cache) # [2, 2]
actual_row_lengths = [part.length() for part in res._partitions[:, 0]]
print(actual_row_lengths) # [3, 1]
bef002d
to
5777085
Compare
and len(broadcastable_by) == 1 | ||
and broadcastable_by[0].has_materialized_dtypes | ||
): | ||
new_index = ModinIndex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev How is it supposed to take the length from this object in broadcast_apply_full_axis
function if it refers to the old modin frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, I didn't think that the new_index
might be used before being passed to the new frame's constructor where the value will be updated...
We can add a logic, that won't use new_index/new_columns
for lengths/widths inferring if they return False for .is_materialized
. From my side, I'll make a PR passing here None
as a value, so in case an index is being used preliminary, it would raise an error, rather than giving an incorrect result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev sounds great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev ready for review
…_apply_full_axis' Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit 5777085.
Signed-off-by: Anatoly Myachev <[email protected]>
d287c10
to
1f01445
Compare
# to avoid problems that may arise when filtering empty dataframes | ||
and all(r != 0 for r in self._row_lengths_cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of problems do you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for filtering empty partitions is too unpredictable at first glance, so I don’t want to enable length calculations in this mode for now.
Co-authored-by: Dmitry Chigarev <[email protected]>
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
broadcast_apply_full_axis
can be implemented to compute caches in some cases #4804docs/development/architecture.rst
is up-to-date