Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Consumption and Batch Processing in DPK (Medium Priority) #883

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Dec 16, 2024 · 0 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

Feature

In RAG or fine-tuning applications of DPK, in which the number of data files is relatively small (as compared to pre-training of very large number of files), we want to have two new capabilities:

  1. Instead of one full parquet document at a time, we could divide the parquet input files (based on arrow tables?) given a batch size for processing and process at the batch size level.
  2. The processing will be done as a pipeline of transforms in memory, i.e., no read/write to storage/disk in the intermediate steps of the processing. We will revisit PR pipeline transform #602 to do this.

The inspiration for this is based on DataTrove (DT) vs. DPK comparison that Santosh did, as follows:
DPK -

  • Loads whole parquet in memory, before starting processing
  • Keeps copy of processed parquet in memory and sends to writing at the end
    DT:
    Documents ( row in parquet file ) are read in batches, sent for processing and written as soon as processed
    Don’t need to load whole doc in memory after reading and before writing

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the enhancement New feature or request label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants