Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis #6760

Merged
merged 6 commits into from
Dec 5, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Nov 20, 2023

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves PERF: broadcast_apply_full_axis can be implemented to compute caches in some cases #4804
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev changed the title PERF-#4804: Preserve lengths/widths caches in 'broadcast_apply_full_a… PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis Nov 20, 2023
@anmyachev anmyachev marked this pull request as ready for review November 20, 2023 19:20
@@ -3348,6 +3348,25 @@ def broadcast_apply_full_axis(
kw["column_widths"] = self._column_widths_cache
elif len(new_columns) == 1 and new_partitions.shape[1] == 1:
kw["column_widths"] = [1]
else:
if (
kw["row_lengths"] is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kw["row_lengths"] is None
axis == 0 and
kw["row_lengths"] is None

It seems that we can preserve lengths only for one axis (for the one, that we're applying a full-axis function). Consider this case, where kernel modifies the shapes of partitions without modifying the total length:

from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme
import pandas

md_df = construct_modin_df_by_scheme(pandas.DataFrame({"a": [1, 2, 3, 4], "b": [3, 4, 5, 6], "c": [6, 7, 8, 9], "d": [0, 1, 2, 3]}), {"row_lengths": [2, 2], "column_widths": [2, 2]})._query_compiler._modin_frame

def func(df):
    if df.iloc[0, 0] == 1:
        return pandas.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4], "c": [4, 2, 1], "d": [3, 2, 1]})
    else:
        return pandas.DataFrame({"a": [3], "b": [4], "c": [1], "d": [1]})

res = md_df.apply_full_axis(func=func, axis=1, new_index=[0, 1, 2, 3], new_columns=["a", "b", "c", "d"], keep_partitioning=True)
print(res._row_lengths_cache) # [2, 2]
print(res._column_widths_cache) # [2, 2]

actual_row_lengths = [part.length() for part in res._partitions[:, 0]]
print(actual_row_lengths) # [3, 1]

@anmyachev anmyachev marked this pull request as draft November 22, 2023 12:01
@anmyachev anmyachev force-pushed the issue4804_2 branch 2 times, most recently from bef002d to 5777085 Compare December 4, 2023 01:14
and len(broadcastable_by) == 1
and broadcastable_by[0].has_materialized_dtypes
):
new_index = ModinIndex(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev How is it supposed to take the length from this object in broadcast_apply_full_axis function if it refers to the old modin frame?

Copy link
Collaborator

@dchigarev dchigarev Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, I didn't think that the new_index might be used before being passed to the new frame's constructor where the value will be updated...

We can add a logic, that won't use new_index/new_columns for lengths/widths inferring if they return False for .is_materialized. From my side, I'll make a PR passing here None as a value, so in case an index is being used preliminary, it would raise an error, rather than giving an incorrect result

#6800

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev sounds great!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchigarev ready for review

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit 5777085.
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev marked this pull request as ready for review December 4, 2023 20:40
Comment on lines +3282 to +3283
# to avoid problems that may arise when filtering empty dataframes
and all(r != 0 for r in self._row_lengths_cache)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of problems do you mean here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for filtering empty partitions is too unpredictable at first glance, so I don’t want to enable length calculations in this mode for now.

@dchigarev dchigarev merged commit 6843954 into modin-project:master Dec 5, 2023
37 checks passed
@anmyachev anmyachev deleted the issue4804_2 branch December 5, 2023 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: broadcast_apply_full_axis can be implemented to compute caches in some cases
2 participants