Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#6831: Implement read_parquet_glob and to_parquet_glob #6854

Merged
merged 7 commits into from
Jan 13, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Jan 11, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Implement read_parquet_glob #6831
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@YarShev
Copy link
Collaborator

YarShev commented Jan 11, 2024

Does this PR also resolve #5723?

@@ -3187,4 +3187,4 @@ def __reduce__(self):
# Persistance support methods - END

# Namespace for experimental functions
modin = CachedAccessor("modin", ExperimentalFunctions)
modin: ExperimentalFunctions = CachedAccessor("modin", ExperimentalFunctions)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IDE hints to work.

@anmyachev
Copy link
Collaborator Author

Does this PR also resolve #5723?

Yes and no. In a broad sense, yes, several files are read in one function call, but it is supposed to implement reading a list of files, and not files defined by glob syntax. It is clear that the difference is only from an interface point of view, but it is inconvenient to reuse experimental functionality in the core module.

@anmyachev anmyachev force-pushed the issue6831 branch 2 times, most recently from da71bf0 to c4009ce Compare January 12, 2024 01:01
@anmyachev anmyachev marked this pull request as ready for review January 12, 2024 01:17
Signed-off-by: Anatoly Myachev <[email protected]>
modin/pandas/accessor.py Outdated Show resolved Hide resolved
modin/pandas/accessor.py Show resolved Hide resolved
modin/experimental/pandas/test/test_io_exp.py Outdated Show resolved Hide resolved
modin/experimental/pandas/io.py Show resolved Hide resolved
@doc(_doc_parse_func, parameters=_doc_parse_parameters_common)
def parse(fname, **kwargs):
warnings.filterwarnings("ignore")
num_splits = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each file is equal to one partition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably change this in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an opportunity for further optimization, so if necessary, yes. However it's more important to add support for different formats for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File an issue for further optimization please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modin/experimental/core/io/glob/glob_dispatcher.py Outdated Show resolved Hide resolved
modin/experimental/core/io/glob/glob_dispatcher.py Outdated Show resolved Hide resolved
@YarShev YarShev changed the title FEAT-#6831: Implement read_parquet_glob FEAT-#6831: Implement read_parquet_glob and to_parquet_glob Jan 12, 2024
anmyachev and others added 2 commits January 12, 2024 18:04
Co-authored-by: Iaroslav Igoshev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
modin/pandas/accessor.py Outdated Show resolved Hide resolved
ExperimentalPandasPickleParser, ExperimentalPickleDispatcher
ExperimentalPandasPickleParser, ExperimentalGlobDispatcher
)
to_pickle_distributed = __make_write(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change read_pickle_distributed and to_pickle_distributed to read_pickle_glob and to_pickle_glob (separate issue)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like it for consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File an issue for that please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more comments. Otherwise, LGTM.

anmyachev and others added 3 commits January 12, 2024 19:37
Co-authored-by: Iaroslav Igoshev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Copy link
Collaborator

@YarShev YarShev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@YarShev YarShev merged commit f6b31d6 into modin-project:master Jan 13, 2024
53 checks passed
@anmyachev anmyachev deleted the issue6831 branch January 13, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement read_parquet_glob
2 participants