You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
1143 if len(failed_workers) > 0:
1144 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1145 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
1146 if isinstance(e, queue.Empty):
1147 return (False, None)
I've tried to fine tune the Llama model in google Colab pro on A100:
model = BaseModel.create("llama_lora_int8")
In the 2 epoch it stopped and the following error appeared:
Loading checkpoint shards: 100%
33/33 [01:35<00:00, 3.03s/it]
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
| Name | Type | Params
0 | pytorch_model | LoraModel | 6.7 B
4.2 M Trainable params
6.7 B Non-trainable params
6.7 B Total params
26,970.440Total estimated model params size (MB)
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Epoch 0: 0%
0/3515 [00:00<?, ?it/s]
Empty Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
1131 try:
-> 1132 data = self._data_queue.get(timeout=timeout)
1133 return (True, data)
20 frames
Empty:
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
1143 if len(failed_workers) > 0:
1144 pids_str = ', '.join(str(w.pid) for w in failed_workers)
-> 1145 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
1146 if isinstance(e, queue.Empty):
1147 return (False, None)
RuntimeError: DataLoader worker (pid(s) 69269) exited unexpectedly
The text was updated successfully, but these errors were encountered: