Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Lightning can't create new processes if CUDA is already initialized. #231

Closed
christina-nasika-edo opened this issue Jul 13, 2023 · 3 comments

Comments

@christina-nasika-edo
Copy link

christina-nasika-edo commented Jul 13, 2023

I am getting this error

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in :10 │
│ │
│ 7 # Initializes the model │
│ 8 model = BaseModel.create("llama_lora_int8") │
│ 9 # Finetuned the model │
│ ❱ 10 model.finetune(dataset=instruction_dataset) │
│ 11 │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/xturing/models/causal.py:113 in finetune │
│ │
│ 110 │ │ │ "instruction_dataset", │
│ 111 │ │ ], "Please make sure the dataset_type is text_dataset or instruction_dataset" │
│ 112 │ │ trainer = self._make_trainer(dataset, logger) │
│ ❱ 113 │ │ trainer.fit() │
│ 114 │ │
│ 115 │ def evaluate(self, dataset: Union[TextDataset, InstructionDataset]): │
│ 116 │ │ pass │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py:190 in │
│ fit │
│ │
│ 187 │ │ │ ) │
│ 188 │ │
│ 189 │ def fit(self): │
│ ❱ 190 │ │ self.trainer.fit(self.lightning_model) │
│ 191 │ │ if self.trainer.checkpoint_callback is not None: │
│ 192 │ │ │ self.trainer.checkpoint_callback.best_model_path │
│ 193 │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:529 in │
│ fit │
│ │
│ 526 │ │ """ │
│ 527 │ │ model = _maybe_unwrap_optimized(model) │
│ 528 │ │ self.strategy._lightning_module = model │
│ ❱ 529 │ │ call._call_and_handle_interrupt( │
│ 530 │ │ │ self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, │
│ 531 │ │ ) │
│ 532 │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py:41 in │
│ _call_and_handle_interrupt │
│ │
│ 38 │ """ │
│ 39 │ try: │
│ 40 │ │ if trainer.strategy.launcher is not None: │
│ ❱ 41 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, │
│ 42 │ │ return trainer_fn(*args, **kwargs) │
│ 43 │ │
│ 44 │ except _TunerExitException: │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multipr │
│ ocessing.py:99 in launch │
│ │
│ 96 │ │ """ │
│ 97 │ │ self._check_torchdistx_support() │
│ 98 │ │ if self._start_method in ("fork", "forkserver"): │
│ ❱ 99 │ │ │ _check_bad_cuda_fork() │
│ 100 │ │ │
│ 101 │ │ # The default cluster environment in Lightning chooses a random free port number │
│ 102 │ │ # This needs to be done in the main process here before starting processes to en │
│ │
│ /opt/conda/envs/venv/lib/python3.10/site-packages/lightning_fabric/strategies/launchers/multipro │
│ cessing.py:189 in _check_bad_cuda_fork │
│ │
│ 186 │ ) │
│ 187 │ if _IS_INTERACTIVE: │
│ 188 │ │ message += " You will have to restart the Python kernel." │
│ ❱ 189 │ raise RuntimeError(message) │
│ 190 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call
torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please
remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

All I did was run the beginning of the lora-llama-int8 tutorial

import gc

from xturing.datasets.instruction_dataset import InstructionDataset
from xturing.models import BaseModel

instruction_dataset = InstructionDataset("./xturing_data")

Initializes the model

model = BaseModel.create("llama_lora_int8")

Finetuned the model

model.finetune(dataset=instruction_dataset)

Do you know what might be the issue?

@tushar2407
Copy link
Contributor

Can you run the script again by making sure your GPU is empty using the command nvidia-smi?
Also, instead of using the interactive mode, just run the script with the command python llama_lora_int8.py.
Moreover, make sure to update the version of your xturing using the command pip install xturing --upgrade.
Let us know if the error persists.

@christina-nasika-edo
Copy link
Author

Thank you @tushar2407, I got the script running following your advice.

Is there a way to know how the fine-tuning is progressing?
It has been stuck in a message (Epoch 0: 100%) for like a day.

@tushar2407
Copy link
Contributor

tushar2407 commented Jul 25, 2023

Hey @christi7,
I am glad it works. The functionality is not yet there in the library but you can contribute the same! Here is the contribution guide. You will have to add a class here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants