Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Autotrain dgx not working for DPO and ORPO #815

Closed
2 tasks done
jmparejaz opened this issue Nov 26, 2024 · 12 comments
Closed
2 tasks done

[BUG]Autotrain dgx not working for DPO and ORPO #815

jmparejaz opened this issue Nov 26, 2024 · 12 comments
Labels
bug Something isn't working stale

Comments

@jmparejaz
Copy link

Prerequisites

  • I have read the documentation.
  • I have checked other issues for similar problems.

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

image

Error Logs

error_dgx.txt

Additional Information

this is my second issue thread about dgx finetuning not working for alignment.
Initially, it was only ORPO, now It is not working either for DPO
Please help, AutoTrain DGX is awesome but not working recently

@jmparejaz jmparejaz added the bug Something isn't working label Nov 26, 2024
@abhishekkrthakur
Copy link
Member

could you please paste the error

@jmparejaz
Copy link
Author

the error logs are attached in a txt file.
however check it here:

INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:82 - model dtype: torch.float16
INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:99 - creating trainer
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/clm/__main__.py", line 38, in train
    train_dpo(config)
  File "/app/src/autotrain/trainers/clm/train_clm_dpo.py", line 100, in train
    callbacks = utils.get_callbacks(config)
  File "/app/src/autotrain/trainers/clm/utils.py", line 816, in get_callbacks
    callbacks = [UploadLogs(config=config), LossLoggingCallback(), TrainStartCallback()]
  File "/app/src/autotrain/trainers/common.py", line 314, in __init__
    self.api.create_repo(
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3531, in create_repo
    hf_raise_for_status(r)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:216 - 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo
[rank3]:[W1126 07:19:58.100088970 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank2]:[W1126 07:20:01.283553003 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank5]:[W1126 07:20:01.486946432 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank7]:[W1126 07:20:01.504053966 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank6]:[W1126 07:20:01.519639703 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank4]:[W1126 07:20:01.524438380 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[rank1]:[W1126 07:20:01.568074344 socket.cpp:428] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
INFO     | 2024-11-26 07:20:24 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 126
INFO     | 2024-11-26 07:20:24 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 126
INFO     | 2024-11-26 07:20:24 | autotrain.app.training_api:run_main:56 - No running jobs found. Shutting down the server.
INFO     | 2024-11-26 07:20:24 | autotrain.app.training_api:graceful_exit:35 - SIGTERM received. Performing cleanup...
ERROR:    Traceback (most recent call last):
  File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/app/src/autotrain/app/training_api.py", line 57, in run_main
    kill_process_by_pid(os.getpid())
  File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid
    os.kill(pid, signal.SIGTERM)
  File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit
    sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 700, in lifespan
    await receive()
  File "/app/env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
  File "/app/env/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError
Task exception was never retrieved
future: <Task finished name='Task-3' coro=<BackgroundRunner.run_main() done, defined at /app/src/autotrain/app/training_api.py:52> exception=SystemExit(0)>
Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/app/env/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 412, in main
    run(
  File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/app/env/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/app/src/autotrain/app/training_api.py", line 57, in run_main
    kill_process_by_pid(os.getpid())
  File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid
    os.kill(pid, signal.SIGTERM)
  File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit
    sys.exit(0)
SystemExit: 0```

@abhishekkrthakur
Copy link
Member

the error says you are creating a project with a name that matches a repo in your hf account. please use a unique name :)

@jmparejaz
Copy link
Author

jmparejaz commented Nov 26, 2024

yes I know but that is not the case, the repo name was completely new.
The training was doing right but suddenly it breaks, here the commit history
image
I tested it multiple times, and the same error happened no matter how i named the repo

it was running only for 30 mins then breaks

image

@abhishekkrthakur
Copy link
Member

INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:82 - model dtype: torch.float16
INFO     | 2024-11-26 07:10:01 | autotrain.trainers.clm.train_clm_dpo:train:99 - creating trainer
ERROR    | 2024-11-26 07:10:01 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last):
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/app/env/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/app/src/autotrain/trainers/common.py", line 212, in wrapper
    return func(*args, **kwargs)
  File "/app/src/autotrain/trainers/clm/__main__.py", line 38, in train
    train_dpo(config)
  File "/app/src/autotrain/trainers/clm/train_clm_dpo.py", line 100, in train
    callbacks = utils.get_callbacks(config)
  File "/app/src/autotrain/trainers/clm/utils.py", line 816, in get_callbacks
    callbacks = [UploadLogs(config=config), LossLoggingCallback(), TrainStartCallback()]
  File "/app/src/autotrain/trainers/common.py", line 314, in __init__
    self.api.create_repo(
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3531, in create_repo
    hf_raise_for_status(r)
  File "/app/env/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67457449-19b50db2252aed4622c8b06a;42883cd8-6a5e-414d-b965-e18158e72060)
You already created this model repo

here it just says its creating a repo that already exists.

are you talking about another project which failed in the middle of the training unrelated to the error you posted above?

@jmparejaz
Copy link
Author

yes, I know it says that... but it doesnt reflect the reality,
Check the txt file with the complete log history.
it starts with project name dpolong22
it starts training but suddenly it breaks and the error message is that repo already exists. That is the reason why I creating the issue, because I cant understand why it is happening, it doesnt make sense.

1-26 07:08:51 | autotrain.app.training_api:<module>:95 - AUTOTRAIN_USERNAME: growth-cadet
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:96 - PROJECT_NAME: dpolong22
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:97 - TASK_ID: 9
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:98 - DATA_PATH: growth-cadet/jobpost-2-signals_orpo_alignment_completion
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:<module>:99 - MODEL: growth-cadet/qwen2-7b-signals-department-TO-JSON07-2
INFO:     Started server process [60]
INFO:     Waiting for application startup.
INFO     | 2024-11-26 07:08:54 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--multi_gpu', '--num_machines', '1', '--num_processes', '8', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'dpolong22/training_params.json']
INFO     | 2024-11-26 07:08:54 | autotrain.commands:launch_command:524 - {'model': 'growth-cadet/qwen2-7b-signals-department-TO-JSON07-2', 'project_name': 'dpolong22', 'data_path': 'growth-cadet/jobpost-2-signals_orpo_alignment_completion', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'dpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 3e-05, 'epochs': 10, 'batch_size': 4, 'warmup_ratio': 0.05, 'gradient_accumulation': 2, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'linear', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'quantization': 'int4', 'target_modules': 'q_proj, o_proj, k_proj,v_proj', 'merge_adapter': True, 'peft': True, 'lora_r': 8, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'growth-cadet', 'token': '*****', 'unsloth': True, 'distributed_backend': None}
INFO     | 2024-11-26 07:08:54 | autotrain.app.training_api:lifespan:82 - Started training with PID 126
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit)```


I can create a new training with a new name and would happend again.
as I mentioned, I tested it multiple times before creating the issue

@abhishekkrthakur
Copy link
Member

ohkay. let me take a look and come back to you.

@jmparejaz
Copy link
Author

hi @abhishekkrthakur
any update on this issue? I noticed that there have been a couple of updates on autotrain versions.
But the problem keeps happening.
The finetuning always starts well and after some commits it breaks.
I have tested like 15 or 20 times and only one have completed (it was with 1 epoch)

The error message is the same always (huggingface_hub.errors.HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6749c7c9-5d23398514fc94d343d3929c;793eafad-5bea-4aaa-b337-3f078eaee34a) You already created this model repo ERROR | 2024-11-29 13:55:21 | autotrain.trainers.common:wrapper:216 - 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-6749c7c9-5d23398514fc94d343d3929c;793eafad-5bea-4aaa-b337-3f078eaee34a) You already created this model repo)

but it doesn't make sense since the autotrain UI creates a brand-new repo.
I think the problem is within the code that is triggering to create the repo again during the middle of the finetuning process.

@abhishekkrthakur
Copy link
Member

ive asked nvidia team for the logs. still waiting for their response.

@abhishekkrthakur
Copy link
Member

you could also use space hardware instead of dgx cloud to make sure its a dgx cloud issue.

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Dec 29, 2024
Copy link

This issue was closed because it has been inactive for 20 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants