You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Other
Feature
In RAG or fine-tuning applications of DPK, in which the number of data files is relatively small (as compared to pre-training of very large number of files), we want to have two new capabilities:
Instead of one full parquet document at a time, we could divide the parquet input files (based on arrow tables?) given a batch size for processing and process at the batch size level.
The processing will be done as a pipeline of transforms in memory, i.e., no read/write to storage/disk in the intermediate steps of the processing. We will revisit PR pipeline transform #602 to do this.
The inspiration for this is based on DataTrove (DT) vs. DPK comparison that Santosh did, as follows:
DPK -
Loads whole parquet in memory, before starting processing
Keeps copy of processed parquet in memory and sends to writing at the end
DT:
Documents ( row in parquet file ) are read in batches, sent for processing and written as soon as processed
Don’t need to load whole doc in memory after reading and before writing
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Component
Other
Feature
In RAG or fine-tuning applications of DPK, in which the number of data files is relatively small (as compared to pre-training of very large number of files), we want to have two new capabilities:
The inspiration for this is based on DataTrove (DT) vs. DPK comparison that Santosh did, as follows:
DPK -
DT:
Documents ( row in parquet file ) are read in batches, sent for processing and written as soon as processed
Don’t need to load whole doc in memory after reading and before writing
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: