-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#6831: Implement read_parquet_glob and to_parquet_glob #6854
Changes from all commits
ffee6e0
a93f1fb
c3f6a89
dbe4cb2
cff35f1
67beb14
dc1106e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -114,6 +114,24 @@ def parse(fname, **kwargs): | |
return _split_result_for_readers(1, num_splits, df) + [length, width] | ||
|
||
|
||
@doc(_doc_pandas_parser_class, data_type="parquet files") | ||
class ExperimentalPandasParquetParser(PandasParser): | ||
@staticmethod | ||
@doc(_doc_parse_func, parameters=_doc_parse_parameters_common) | ||
def parse(fname, **kwargs): | ||
warnings.filterwarnings("ignore") | ||
num_splits = 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why 1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each file is equal to one partition. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably change this in a separate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an opportunity for further optimization, so if necessary, yes. However it's more important to add support for different formats for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. File an issue for further optimization please. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
single_worker_read = kwargs.pop("single_worker_read", None) | ||
df = pandas.read_parquet(fname, **kwargs) | ||
if single_worker_read: | ||
return df | ||
|
||
length = len(df) | ||
width = len(df.columns) | ||
|
||
return _split_result_for_readers(1, num_splits, df) + [length, width] | ||
|
||
|
||
@doc(_doc_pandas_parser_class, data_type="custom text") | ||
class ExperimentalCustomTextParser(PandasParser): | ||
@staticmethod | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change
read_pickle_distributed
andto_pickle_distributed
toread_pickle_glob
andto_pickle_glob
(separate issue)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File an issue for that please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#6856