Memory Consumption and Batch Processing in DPK (Medium Priority) #883

shahrokhDaijavad · 2024-12-16T17:33:10Z

Search before asking

I searched the issues and found no similar issues.

Component

Other

Feature

In RAG or fine-tuning applications of DPK, in which the number of data files is relatively small (as compared to pre-training of very large number of files), we want to have two new capabilities:

Instead of one full parquet document at a time, we could divide the parquet input files (based on arrow tables?) given a batch size for processing and process at the batch size level.
The processing will be done as a pipeline of transforms in memory, i.e., no read/write to storage/disk in the intermediate steps of the processing. We will revisit PR pipeline transform #602 to do this.

The inspiration for this is based on DataTrove (DT) vs. DPK comparison that Santosh did, as follows:
DPK -

Loads whole parquet in memory, before starting processing
Keeps copy of processed parquet in memory and sends to writing at the end
DT:
Documents ( row in parquet file ) are read in batches, sent for processing and written as soon as processed
Don’t need to load whole doc in memory after reading and before writing

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad added the enhancement New feature or request label Dec 16, 2024

shahrokhDaijavad assigned touma-I Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Consumption and Batch Processing in DPK (Medium Priority) #883

Memory Consumption and Batch Processing in DPK (Medium Priority) #883

shahrokhDaijavad commented Dec 16, 2024

Memory Consumption and Batch Processing in DPK (Medium Priority) #883

Memory Consumption and Batch Processing in DPK (Medium Priority) #883

Comments

shahrokhDaijavad commented Dec 16, 2024

Search before asking

Component

Feature

Are you willing to submit a PR?