Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

huanji-sun-007 · 2025-01-31T10:12:13Z

Describe the bug
Hi,
I'm trying to use Olive to quantize and export a fine-tuned Phi-3.5-mini-instruct model.
I can successfully run olive quantize and olive auto-opt on the base Phi-3.5-mini-instruct model and export the ONNX model.
However, when I try to run olive quantize and olive auto-opt on our fine-tuned Phi-3.5-mini-instruct model, I get the following error:

Loading HuggingFace model from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/INPUT_pytorch_model_folder
[2025-01-31 04:56:32,283] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 04:56:32,315] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 04:56:32,317] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:891:_create_system] Target system created in 0.000502 seconds
[2025-01-31 04:56:32,318] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:905:_create_system] Host system created in 0.000074 seconds
[2025-01-31 04:57:00,923] [INFO] [engine.py:709:_run_pass] Running pass awq:AutoAWQQuantizer {}

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00<00:01,  2.07it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:00<00:00,  2.09it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:01<00:00,  2.08it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.85it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.51it/s]
Repo card metadata block was not found. Setting CardData to empty.

Generating validation split:   0%|          | 0/214670 [00:00<?, ? examples/s]
Generating validation split:   1%|          | 2670/214670 [00:00<00:09, 23011.91 examples/s]
Generating validation split:   5%|▍         | 9718/214670 [00:00<00:04, 41919.56 examples/s]
Generating validation split:   7%|▋         | 14460/214670 [00:00<00:05, 37616.99 examples/s]
Generating validation split:  10%|▉         | 21125/214670 [00:00<00:06, 30475.66 examples/s]
Generating validation split:  13%|█▎        | 27485/214670 [00:00<00:05, 35082.33 examples/s]
Generating validation split:  15%|█▍        | 31638/214670 [00:01<00:07, 25823.26 examples/s]
Generating validation split:  17%|█▋        | 36245/214670 [00:01<00:07, 23072.75 examples/s]
Generating validation split:  19%|█▉        | 41039/214670 [00:01<00:09, 17723.04 examples/s]
Generating validation split:  21%|██▏       | 46117/214670 [00:01<00:07, 21420.14 examples/s]
Generating validation split:  23%|██▎       | 49203/214670 [00:02<00:11, 14777.56 examples/s]
Generating validation split:  25%|██▌       | 54542/214670 [00:02<00:08, 19166.70 examples/s]
Generating validation split:  28%|██▊       | 61094/214670 [00:02<00:07, 19205.18 examples/s]
Generating validation split:  30%|██▉       | 63955/214670 [00:03<00:08, 17331.74 examples/s]
Generating validation split:  32%|███▏      | 68960/214670 [00:03<00:10, 14320.61 examples/s]
Generating validation split:  35%|███▌      | 75298/214670 [00:03<00:07, 19085.59 examples/s]
Generating validation split:  37%|███▋      | 79825/214670 [00:03<00:08, 16652.59 examples/s]
Generating validation split:  40%|████      | 86173/214670 [00:04<00:05, 21570.13 examples/s]
Generating validation split:  42%|████▏     | 90937/214670 [00:04<00:05, 23410.86 examples/s]
Generating validation split:  45%|████▌     | 97050/214670 [00:04<00:04, 25272.73 examples/s]
Generating validation split:  48%|████▊     | 102602/214670 [00:04<00:04, 27528.27 examples/s]
Generating validation split:  50%|████▉     | 107311/214670 [00:05<00:07, 13839.58 examples/s]
Generating validation split:  52%|█████▏    | 112241/214670 [00:07<00:16, 6299.04 examples/s] 
Generating validation split:  55%|█████▍    | 117276/214670 [00:07<00:11, 8315.15 examples/s]
Generating validation split:  57%|█████▋    | 121316/214670 [00:07<00:09, 9641.10 examples/s]
Generating validation split:  59%|█████▊    | 125859/214670 [00:07<00:07, 11825.36 examples/s]
Generating validation split:  62%|██████▏   | 132521/214670 [00:07<00:05, 16030.97 examples/s]
Generating validation split:  65%|██████▍   | 138944/214670 [00:08<00:05, 15070.27 examples/s]
Generating validation split:  68%|██████▊   | 145574/214670 [00:08<00:03, 19119.47 examples/s]
Generating validation split:  70%|███████   | 150507/214670 [00:08<00:02, 21585.54 examples/s]
Generating validation split:  72%|███████▏  | 153810/214670 [00:09<00:03, 17228.97 examples/s]
Generating validation split:  75%|███████▍  | 160019/214670 [00:09<00:02, 18907.50 examples/s]
Generating validation split:  76%|███████▌  | 162645/214670 [00:09<00:02, 19282.95 examples/s]
Generating validation split:  77%|███████▋  | 165981/214670 [00:09<00:02, 17902.21 examples/s]
Generating validation split:  80%|████████  | 172633/214670 [00:09<00:01, 21361.46 examples/s]
Generating validation split:  82%|████████▏ | 176490/214670 [00:10<00:01, 23917.33 examples/s]
Generating validation split:  84%|████████▍ | 180051/214670 [00:10<00:01, 24409.16 examples/s]
Generating validation split:  87%|████████▋ | 186530/214670 [00:10<00:00, 28995.77 examples/s]
Generating validation split:  90%|████████▉ | 192764/214670 [00:10<00:00, 31419.86 examples/s]
Generating validation split:  92%|█████████▏| 197484/214670 [00:13<00:03, 5431.87 examples/s] 
Generating validation split:  95%|█████████▌| 204257/214670 [00:13<00:01, 7892.86 examples/s]
Generating validation split:  98%|█████████▊| 211151/214670 [00:13<00:00, 11066.43 examples/s]
Generating validation split: 100%|██████████| 214670/214670 [00:13<00:00, 15679.17 examples/s]
[2025-01-31 04:57:23,711] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.
[2025-01-31 04:57:24,081] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.

AWQ:   0%|          | 0/32 [00:00<?, ?it/s]You are not running the flash-attention implementation, expect numerical differences.

AWQ:   3%|▎         | 1/32 [00:13<06:59, 13.54s/it]
AWQ:   6%|▋         | 2/32 [00:26<06:39, 13.33s/it]
AWQ:   9%|▉         | 3/32 [00:39<06:25, 13.29s/it]
AWQ:  12%|█▎        | 4/32 [00:53<06:10, 13.24s/it]
AWQ:  16%|█▌        | 5/32 [01:06<05:57, 13.23s/it]
AWQ:  19%|█▉        | 6/32 [01:19<05:43, 13.20s/it]
AWQ:  22%|██▏       | 7/32 [01:32<05:29, 13.19s/it]
AWQ:  25%|██▌       | 8/32 [01:45<05:16, 13.17s/it]
AWQ:  28%|██▊       | 9/32 [01:58<05:02, 13.16s/it]
AWQ:  31%|███▏      | 10/32 [02:12<04:49, 13.16s/it]
AWQ:  34%|███▍      | 11/32 [02:25<04:35, 13.14s/it]
AWQ:  38%|███▊      | 12/32 [02:38<04:22, 13.14s/it]
AWQ:  41%|████      | 13/32 [02:51<04:09, 13.13s/it]
AWQ:  44%|████▍     | 14/32 [03:04<03:56, 13.14s/it]
AWQ:  47%|████▋     | 15/32 [03:17<03:43, 13.14s/it]
AWQ:  50%|█████     | 16/32 [03:30<03:30, 13.13s/it]
AWQ:  53%|█████▎    | 17/32 [03:44<03:17, 13.15s/it]
AWQ:  56%|█████▋    | 18/32 [03:57<03:03, 13.14s/it]
AWQ:  59%|█████▉    | 19/32 [04:10<02:51, 13.16s/it]
AWQ:  62%|██████▎   | 20/32 [04:23<02:38, 13.18s/it]
AWQ:  66%|██████▌   | 21/32 [04:36<02:24, 13.17s/it]
AWQ:  69%|██████▉   | 22/32 [04:49<02:11, 13.20s/it]
AWQ:  72%|███████▏  | 23/32 [05:03<01:58, 13.22s/it]
AWQ:  75%|███████▌  | 24/32 [05:16<01:46, 13.29s/it]
AWQ:  78%|███████▊  | 25/32 [05:29<01:32, 13.28s/it]
AWQ:  81%|████████▏ | 26/32 [05:43<01:20, 13.35s/it]
AWQ:  84%|████████▍ | 27/32 [05:56<01:06, 13.33s/it]
AWQ:  88%|████████▊ | 28/32 [06:10<00:53, 13.34s/it]
AWQ:  91%|█████████ | 29/32 [06:23<00:40, 13.34s/it]
AWQ:  94%|█████████▍| 30/32 [06:36<00:26, 13.32s/it]
AWQ:  97%|█████████▋| 31/32 [06:50<00:13, 13.35s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.41s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.24s/it]
[2025-01-31 05:04:28,058] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
WARNING:root:Cannot import JIT optimized kernels. CUDA extension will be disabled.
Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library
[2025-01-31 05:04:31,244] [INFO] [engine.py:781:_run_pass] Pass awq:AutoAWQQuantizer finished in 450.320374 seconds
[2025-01-31 05:04:31,245] [INFO] [cache.py:193:load_model] Loading model 2e34ed27 from cache.
[2025-01-31 05:04:32,306] [INFO] [engine.py:426:run_no_search] Saved output model to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/output_model
[2025-01-31 05:04:32,308] [INFO] [engine.py:338:run_accelerator] Save footprint to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/footprints.json.
[2025-01-31 05:04:32,308] [INFO] [engine.py:265:run] Run history for cpu-cpu:
[2025-01-31 05:04:32,309] [INFO] [engine.py:517:dump_run_history] run history:
+------------+-------------------+------------------+----------------+-----------+
| model_id   | parent_model_id   | from_pass        |   duration_sec | metrics   |
+============+===================+==================+================+===========+
| 0861c2d8   |                   |                  |                |           |
+------------+-------------------+------------------+----------------+-----------+
| 2e34ed27   | 0861c2d8          | AutoAWQQuantizer |         450.32 |           |
+------------+-------------------+------------------+----------------+-----------+
Command succeeded. Output model saved to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
Debug file copy start
Debug file copy end
Loaded previous command output of type hfmodel from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
[2025-01-31 05:04:35,686] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 05:04:35,720] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 05:04:35,725] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:891:_create_system] Target system created in 0.000088 seconds
[2025-01-31 05:04:35,727] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:905:_create_system] Host system created in 0.000144 seconds
[2025-01-31 05:04:37,329] [INFO] [engine.py:709:_run_pass] Running pass conversion:OnnxConversion {}
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Traceback (most recent call last):
  File "/opt/conda/envs/ptca/bin/olive", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/launcher.py", line 62, in main
    service.run()
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/auto_opt.py", line 183, in run
    olive_run(run_config)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 317, in run
    return run_engine(package_config, run_config)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 259, in run_engine
    engine.run(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 252, in run
    run_result = self.run_accelerator(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 330, in run_accelerator
    output_footprint = self.run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 400, in run_no_search
    should_prune, signal, model_ids = self._run_passes(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 664, in _run_passes
    model_config, model_id = self._run_pass(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 764, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/systems/local.py", line 30, in run_pass
    output_model = the_pass.run(model, output_model_path, point)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/olive_pass.py", line 245, in run
    output_model = self._run_for_config(model, config, output_model_path)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 114, in _run_for_config
    output_model = self._run_for_config_internal(model, config, output_model_path)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 151, in _run_for_config_internal
    return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 375, in _convert_model_on_device
    converted_onnx_model = OnnxConversion._export_pytorch_model(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 252, in _export_pytorch_model
    torch.onnx.export(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 516, in export
    _export(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1612, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
    outs = ONNXTracedModule(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 129, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1247, in forward
    outputs = self.model(
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1017, in forward
    past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
AttributeError: 'list' object has no attribute 'get_seq_length'

To Reproduce

      olive quantize \
        --model_name_or_path ${{inputs.fine_tuned_model_path}} \
        --data_files ${{inputs.validation_file_path}} \
        --algorithm awq \
        --output_path ${{outputs.awq_model_folder}} \
        --log_level 1

      olive auto-opt \
        --model_name_or_path ${{outputs.awq_model_folder}} \
        --output_path ${{outputs.onnx_awq_model_folder}} \
        --device cpu \
        --provider CPUExecutionProvider \
        --use_ort_genai \
        --log_level 1

Expected behavior
Be able to export a fine-tuned Phi-3.5-mini-instruct model successfully.

Olive config

      olive auto-opt \
        --model_name_or_path ${{outputs.awq_model_folder}} \
        --output_path ${{outputs.onnx_awq_model_folder}} \
        --device cpu \
        --provider CPUExecutionProvider \
        --use_ort_genai \
        --log_level 1

Other information

OS: Azure AML Environment based on mcr.microsoft.com/azureml/curated/acpt-pytorch-2.2-cuda12.1:21 running on Linux (Docker Container)
Olive version: 0.7.1.1
ONNXRuntime package and version: onnxruntime-genai-cuda = "0.5.0"
Transformers package version: 4.44.2

The text was updated successfully, but these errors were encountered:

huanji-sun-007 · 2025-02-03T05:59:11Z

This issue was solved by changing the use_cache field to true in our config.json file.

huanji-sun-007 closed this as completed Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

huanji-sun-007 commented Jan 31, 2025 •

edited

Loading

huanji-sun-007 commented Feb 3, 2025

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

Unable to export a fine-tuned Phi-3.5-mini-instruct model #1591

Comments

huanji-sun-007 commented Jan 31, 2025 • edited Loading

huanji-sun-007 commented Feb 3, 2025

huanji-sun-007 commented Jan 31, 2025 •

edited

Loading