-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiprocessing when preparing ICL dataset #1276
Comments
@sanjari-orb sure! My only hesitation in doing this is that we've observed occasional hangs when using hf datasets and multiprocessing (huggingface/datasets#6393), but should be fine, especially if we keep it single process by default. Would be happy to accept a PR adding the arg. |
Actually we ended up seeing the same problem of the |
Unfortunately I have never managed to fully root cause this issue (feel free to comment on the datasets issue, as I don't think they have been able to fix it either). However, I believe it has something to do with multiple processes processing the same data at the same time. As a result, in the main dataloader we have local rank 0 go first, so that all the other ranks are just reading data cached on disk. We could probably apply the same logic in the ICL classes. |
Could you give me a pointer to where this is being handled? |
Ah yeah sorry, meant to include the link. llm-foundry/llmfoundry/data/finetuning/tasks.py Lines 831 to 837 in 2196d07
llm-foundry/llmfoundry/data/finetuning/tasks.py Lines 945 to 956 in 2196d07
|
We are already doing that here though right? llm-foundry/llmfoundry/eval/datasets/in_context_learning_evaluation.py Lines 265 to 268 in 2196d07
|
not quite. in the code I linked we have rank 0 go first for the dataset load. In the code you linked, we have only rank 0 download the file, but then all ranks would call |
Ah gotcha. Okay let me try this. Thanks! |
🚀 Feature Request
Allow passing
num_proc
/num_workers
parameter inInContextLearningDataset
so that preparation of dataset can use more than one processes.Motivation
When loading bigger ICL eval datasets, it is desirable to pass num_procs>1 in the following map function, which preps each example in the dataset:
llm-foundry/llmfoundry/eval/datasets/in_context_learning_evaluation.py
Lines 173 to 181 in 5571101
Can we introduce a
num_proc
parameter in theInContextLearningDataset
constructors so that the example preparation can instead be done like this:This greatly increases the speed of loading larger datasets.
The text was updated successfully, but these errors were encountered: