Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

Closed
huanji-sun-007 opened this issue Jan 31, 2025 · 1 comment
Closed

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

huanji-sun-007 opened this issue Jan 31, 2025 · 1 comment

Comments

@huanji-sun-007
Copy link

huanji-sun-007 commented Jan 31, 2025

Describe the bug
Hi,
I'm trying to use Olive to quantize and export a fine-tuned Phi-3.5-mini-instruct model.
I can successfully run olive quantize and olive auto-opt on the base Phi-3.5-mini-instruct model and export the ONNX model.
However, when I try to run olive quantize and olive auto-opt on our fine-tuned Phi-3.5-mini-instruct model, I get the following error:

Loading HuggingFace model from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/INPUT_pytorch_model_folder
[2025-01-31 04:56:32,283] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 04:56:32,315] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 04:56:32,317] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:891:_create_system] Target system created in 0.000502 seconds
[2025-01-31 04:56:32,318] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:905:_create_system] Host system created in 0.000074 seconds
[2025-01-31 04:57:00,923] [INFO] [engine.py:709:_run_pass] Running pass awq:AutoAWQQuantizer {}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:01,  2.07it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:00<00:00,  2.09it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:01<00:00,  2.08it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.85it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.51it/s]
Repo card metadata block was not found. Setting CardData to empty.

Generating validation split:   0%|          | 0/214670 [00:00<?, ? examples/s]
Generating validation split:   1%|          | 2670/214670 [00:00<00:09, 23011.91 examples/s]
Generating validation split:   5%|▍         | 9718/214670 [00:00<00:04, 41919.56 examples/s]
Generating validation split:   7%|▋         | 14460/214670 [00:00<00:05, 37616.99 examples/s]
Generating validation split:  10%|▉         | 21125/214670 [00:00<00:06, 30475.66 examples/s]
Generating validation split:  13%|█▎        | 27485/214670 [00:00<00:05, 35082.33 examples/s]
Generating validation split:  15%|█▍        | 31638/214670 [00:01<00:07, 25823.26 examples/s]
Generating validation split:  17%|█▋        | 36245/214670 [00:01<00:07, 23072.75 examples/s]
Generating validation split:  19%|█▉        | 41039/214670 [00:01<00:09, 17723.04 examples/s]
Generating validation split:  21%|██▏       | 46117/214670 [00:01<00:07, 21420.14 examples/s]
Generating validation split:  23%|██▎       | 49203/214670 [00:02<00:11, 14777.56 examples/s]
Generating validation split:  25%|██▌       | 54542/214670 [00:02<00:08, 19166.70 examples/s]
Generating validation split:  28%|██▊       | 61094/214670 [00:02<00:07, 19205.18 examples/s]
Generating validation split:  30%|██▉       | 63955/214670 [00:03<00:08, 17331.74 examples/s]
Generating validation split:  32%|███▏      | 68960/214670 [00:03<00:10, 14320.61 examples/s]
Generating validation split:  35%|███▌      | 75298/214670 [00:03<00:07, 19085.59 examples/s]
Generating validation split:  37%|███▋      | 79825/214670 [00:03<00:08, 16652.59 examples/s]
Generating validation split:  40%|████      | 86173/214670 [00:04<00:05, 21570.13 examples/s]
Generating validation split:  42%|████▏     | 90937/214670 [00:04<00:05, 23410.86 examples/s]
Generating validation split:  45%|████▌     | 97050/214670 [00:04<00:04, 25272.73 examples/s]
Generating validation split:  48%|████▊     | 102602/214670 [00:04<00:04, 27528.27 examples/s]
Generating validation split:  50%|████▉     | 107311/214670 [00:05<00:07, 13839.58 examples/s]
Generating validation split:  52%|█████▏    | 112241/214670 [00:07<00:16, 6299.04 examples/s] 
Generating validation split:  55%|█████▍    | 117276/214670 [00:07<00:11, 8315.15 examples/s]
Generating validation split:  57%|█████▋    | 121316/214670 [00:07<00:09, 9641.10 examples/s]
Generating validation split:  59%|█████▊    | 125859/214670 [00:07<00:07, 11825.36 examples/s]
Generating validation split:  62%|██████▏   | 132521/214670 [00:07<00:05, 16030.97 examples/s]
Generating validation split:  65%|██████▍   | 138944/214670 [00:08<00:05, 15070.27 examples/s]
Generating validation split:  68%|██████▊   | 145574/214670 [00:08<00:03, 19119.47 examples/s]
Generating validation split:  70%|███████   | 150507/214670 [00:08<00:02, 21585.54 examples/s]
Generating validation split:  72%|███████▏  | 153810/214670 [00:09<00:03, 17228.97 examples/s]
Generating validation split:  75%|███████▍  | 160019/214670 [00:09<00:02, 18907.50 examples/s]
Generating validation split:  76%|███████▌  | 162645/214670 [00:09<00:02, 19282.95 examples/s]
Generating validation split:  77%|███████▋  | 165981/214670 [00:09<00:02, 17902.21 examples/s]
Generating validation split:  80%|████████  | 172633/214670 [00:09<00:01, 21361.46 examples/s]
Generating validation split:  82%|████████▏ | 176490/214670 [00:10<00:01, 23917.33 examples/s]
Generating validation split:  84%|████████▍ | 180051/214670 [00:10<00:01, 24409.16 examples/s]
Generating validation split:  87%|████████▋ | 186530/214670 [00:10<00:00, 28995.77 examples/s]
Generating validation split:  90%|████████▉ | 192764/214670 [00:10<00:00, 31419.86 examples/s]
Generating validation split:  92%|█████████▏| 197484/214670 [00:13<00:03, 5431.87 examples/s] 
Generating validation split:  95%|█████████▌| 204257/214670 [00:13<00:01, 7892.86 examples/s]
Generating validation split:  98%|█████████▊| 211151/214670 [00:13<00:00, 11066.43 examples/s]
Generating validation split: 100%|██████████| 214670/214670 [00:13<00:00, 15679.17 examples/s]
[2025-01-31 04:57:23,711] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.
[2025-01-31 04:57:24,081] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.

AWQ:   0%|          | 0/32 [00:00<?, ?it/s]You are not running the flash-attention implementation, expect numerical differences.

AWQ:   3%|▎         | 1/32 [00:13<06:59, 13.54s/it]
AWQ:   6%|▋         | 2/32 [00:26<06:39, 13.33s/it]
AWQ:   9%|▉         | 3/32 [00:39<06:25, 13.29s/it]
AWQ:  12%|█▎        | 4/32 [00:53<06:10, 13.24s/it]
AWQ:  16%|█▌        | 5/32 [01:06<05:57, 13.23s/it]
AWQ:  19%|█▉        | 6/32 [01:19<05:43, 13.20s/it]
AWQ:  22%|██▏       | 7/32 [01:32<05:29, 13.19s/it]
AWQ:  25%|██▌       | 8/32 [01:45<05:16, 13.17s/it]
AWQ:  28%|██▊       | 9/32 [01:58<05:02, 13.16s/it]
AWQ:  31%|███▏      | 10/32 [02:12<04:49, 13.16s/it]
AWQ:  34%|███▍      | 11/32 [02:25<04:35, 13.14s/it]
AWQ:  38%|███▊      | 12/32 [02:38<04:22, 13.14s/it]
AWQ:  41%|████      | 13/32 [02:51<04:09, 13.13s/it]
AWQ:  44%|████▍     | 14/32 [03:04<03:56, 13.14s/it]
AWQ:  47%|████▋     | 15/32 [03:17<03:43, 13.14s/it]
AWQ:  50%|█████     | 16/32 [03:30<03:30, 13.13s/it]
AWQ:  53%|█████▎    | 17/32 [03:44<03:17, 13.15s/it]
AWQ:  56%|█████▋    | 18/32 [03:57<03:03, 13.14s/it]
AWQ:  59%|█████▉    | 19/32 [04:10<02:51, 13.16s/it]
AWQ:  62%|██████▎   | 20/32 [04:23<02:38, 13.18s/it]
AWQ:  66%|██████▌   | 21/32 [04:36<02:24, 13.17s/it]
AWQ:  69%|██████▉   | 22/32 [04:49<02:11, 13.20s/it]
AWQ:  72%|███████▏  | 23/32 [05:03<01:58, 13.22s/it]
AWQ:  75%|███████▌  | 24/32 [05:16<01:46, 13.29s/it]
AWQ:  78%|███████▊  | 25/32 [05:29<01:32, 13.28s/it]
AWQ:  81%|████████▏ | 26/32 [05:43<01:20, 13.35s/it]
AWQ:  84%|████████▍ | 27/32 [05:56<01:06, 13.33s/it]
AWQ:  88%|████████▊ | 28/32 [06:10<00:53, 13.34s/it]
AWQ:  91%|█████████ | 29/32 [06:23<00:40, 13.34s/it]
AWQ:  94%|█████████▍| 30/32 [06:36<00:26, 13.32s/it]
AWQ:  97%|█████████▋| 31/32 [06:50<00:13, 13.35s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.41s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.24s/it]
[2025-01-31 05:04:28,058] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
WARNING:root:Cannot import JIT optimized kernels. CUDA extension will be disabled.
Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library
[2025-01-31 05:04:31,244] [INFO] [engine.py:781:_run_pass] Pass awq:AutoAWQQuantizer finished in 450.320374 seconds
[2025-01-31 05:04:31,245] [INFO] [cache.py:193:load_model] Loading model 2e34ed27 from cache.
[2025-01-31 05:04:32,306] [INFO] [engine.py:426:run_no_search] Saved output model to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/output_model
[2025-01-31 05:04:32,308] [INFO] [engine.py:338:run_accelerator] Save footprint to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/footprints.json.
[2025-01-31 05:04:32,308] [INFO] [engine.py:265:run] Run history for cpu-cpu:
[2025-01-31 05:04:32,309] [INFO] [engine.py:517:dump_run_history] run history:
+------------+-------------------+------------------+----------------+-----------+
| model_id   | parent_model_id   | from_pass        |   duration_sec | metrics   |
+============+===================+==================+================+===========+
| 0861c2d8   |                   |                  |                |           |
+------------+-------------------+------------------+----------------+-----------+
| 2e34ed27   | 0861c2d8          | AutoAWQQuantizer |         450.32 |           |
+------------+-------------------+------------------+----------------+-----------+
Command succeeded. Output model saved to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
Debug file copy start
Debug file copy end
Loaded previous command output of type hfmodel from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
[2025-01-31 05:04:35,686] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 05:04:35,720] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 05:04:35,725] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:891:_create_system] Target system created in 0.000088 seconds
[2025-01-31 05:04:35,727] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:905:_create_system] Host system created in 0.000144 seconds
[2025-01-31 05:04:37,329] [INFO] [engine.py:709:_run_pass] Running pass conversion:OnnxConversion {}
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Traceback (most recent call last):
  File "/opt/conda/envs/ptca/bin/olive", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/launcher.py", line 62, in main
    service.run()
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/auto_opt.py", line 183, in run
    olive_run(run_config)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 317, in run
    return run_engine(package_config, run_config)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 259, in run_engine
    engine.run(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 252, in run
    run_result = self.run_accelerator(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 330, in run_accelerator
    output_footprint = self.run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 400, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 664, in _run_passes
    model_config, model_id = self._run_pass(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 764, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/systems/local.py", line 30, in run_pass
    output_model = the_pass.run(model, output_model_path, point)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/olive_pass.py", line 245, in run
    output_model = self._run_for_config(model, config, output_model_path)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 114, in _run_for_config
    output_model = self._run_for_config_internal(model, config, output_model_path)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 151, in _run_for_config_internal
    return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 375, in _convert_model_on_device
    converted_onnx_model = OnnxConversion._export_pytorch_model(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 252, in _export_pytorch_model
    torch.onnx.export(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 516, in export
    _export(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1612, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
    outs = ONNXTracedModule(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1247, in forward
    outputs = self.model(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1017, in forward
    past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
AttributeError: 'list' object has no attribute 'get_seq_length'

To Reproduce

      olive quantize \
        --model_name_or_path ${{inputs.fine_tuned_model_path}} \
        --data_files ${{inputs.validation_file_path}} \
        --algorithm awq \
        --output_path ${{outputs.awq_model_folder}} \
        --log_level 1

      olive auto-opt \
        --model_name_or_path ${{outputs.awq_model_folder}} \
        --output_path ${{outputs.onnx_awq_model_folder}} \
        --device cpu \
        --provider CPUExecutionProvider \
        --use_ort_genai \
        --log_level 1

Expected behavior
Be able to export a fine-tuned Phi-3.5-mini-instruct model successfully.

Olive config

      olive auto-opt \
        --model_name_or_path ${{outputs.awq_model_folder}} \
        --output_path ${{outputs.onnx_awq_model_folder}} \
        --device cpu \
        --provider CPUExecutionProvider \
        --use_ort_genai \
        --log_level 1

Other information

  • OS: Azure AML Environment based on mcr.microsoft.com/azureml/curated/acpt-pytorch-2.2-cuda12.1:21 running on Linux (Docker Container)
  • Olive version: 0.7.1.1
  • ONNXRuntime package and version: onnxruntime-genai-cuda = "0.5.0"
  • Transformers package version: 4.44.2
@huanji-sun-007
Copy link
Author

This issue was solved by changing the use_cache field to true in our config.json file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant