Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DataLoader worker (pid(s) 69269) exited unexpectedly #232

Closed
devprofession opened this issue Jul 17, 2023 · 1 comment
Closed

Comments

@devprofession
Copy link

devprofession commented Jul 17, 2023

I've tried to fine tune the Llama model in google Colab pro on A100:
model = BaseModel.create("llama_lora_int8")

In the 2 epoch it stopped and the following error appeared:

Loading checkpoint shards: 100%
33/33 [01:35<00:00, 3.03s/it]

INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params

0 | pytorch_model | LoraModel | 6.7 B

4.2 M Trainable params
6.7 B Non-trainable params
6.7 B Total params
26,970.440Total estimated model params size (MB)

trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199

Epoch 0: 0%
0/3515 [00:00<?, ?it/s]


Empty Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
1131 try:
-> 1132 data = self._data_queue.get(timeout=timeout)
1133 return (True, data)

20 frames

Empty:

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
1143 if len(failed_workers) > 0:
1144 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1145 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
1146 if isinstance(e, queue.Empty):
1147 return (False, None)

RuntimeError: DataLoader worker (pid(s) 69269) exited unexpectedly

@devprofession
Copy link
Author

Google Colab Pro gives you only 16GB GPU memory, you should upgrade to Pro+-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant