You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Hi,
I'm trying to use Olive to quantize and export a fine-tuned Phi-3.5-mini-instruct model.
I can successfully run olive quantize and olive auto-opt on the base Phi-3.5-mini-instruct model and export the ONNX model.
However, when I try to run olive quantize and olive auto-opt on our fine-tuned Phi-3.5-mini-instruct model, I get the following error:
Loading HuggingFace model from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/INPUT_pytorch_model_folder
[2025-01-31 04:56:32,283] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 04:56:32,315] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 04:56:32,317] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 04:56:32,317] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:891:_create_system] Target system created in 0.000502 seconds
[2025-01-31 04:56:32,318] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 04:56:32,318] [INFO] [engine.py:905:_create_system] Host system created in 0.000074 seconds
[2025-01-31 04:57:00,923] [INFO] [engine.py:709:_run_pass] Running pass awq:AutoAWQQuantizer {}
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 2.07it/s]
Loading checkpoint shards: 50%|█████ | 2/4 [00:00<00:00, 2.09it/s]
Loading checkpoint shards: 75%|███████▌ | 3/4 [00:01<00:00, 2.08it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.85it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00, 2.51it/s]
Repo card metadata block was not found. Setting CardData to empty.
Generating validation split: 0%| | 0/214670 [00:00<?, ? examples/s]
Generating validation split: 1%| | 2670/214670 [00:00<00:09, 23011.91 examples/s]
Generating validation split: 5%|▍ | 9718/214670 [00:00<00:04, 41919.56 examples/s]
Generating validation split: 7%|▋ | 14460/214670 [00:00<00:05, 37616.99 examples/s]
Generating validation split: 10%|▉ | 21125/214670 [00:00<00:06, 30475.66 examples/s]
Generating validation split: 13%|█▎ | 27485/214670 [00:00<00:05, 35082.33 examples/s]
Generating validation split: 15%|█▍ | 31638/214670 [00:01<00:07, 25823.26 examples/s]
Generating validation split: 17%|█▋ | 36245/214670 [00:01<00:07, 23072.75 examples/s]
Generating validation split: 19%|█▉ | 41039/214670 [00:01<00:09, 17723.04 examples/s]
Generating validation split: 21%|██▏ | 46117/214670 [00:01<00:07, 21420.14 examples/s]
Generating validation split: 23%|██▎ | 49203/214670 [00:02<00:11, 14777.56 examples/s]
Generating validation split: 25%|██▌ | 54542/214670 [00:02<00:08, 19166.70 examples/s]
Generating validation split: 28%|██▊ | 61094/214670 [00:02<00:07, 19205.18 examples/s]
Generating validation split: 30%|██▉ | 63955/214670 [00:03<00:08, 17331.74 examples/s]
Generating validation split: 32%|███▏ | 68960/214670 [00:03<00:10, 14320.61 examples/s]
Generating validation split: 35%|███▌ | 75298/214670 [00:03<00:07, 19085.59 examples/s]
Generating validation split: 37%|███▋ | 79825/214670 [00:03<00:08, 16652.59 examples/s]
Generating validation split: 40%|████ | 86173/214670 [00:04<00:05, 21570.13 examples/s]
Generating validation split: 42%|████▏ | 90937/214670 [00:04<00:05, 23410.86 examples/s]
Generating validation split: 45%|████▌ | 97050/214670 [00:04<00:04, 25272.73 examples/s]
Generating validation split: 48%|████▊ | 102602/214670 [00:04<00:04, 27528.27 examples/s]
Generating validation split: 50%|████▉ | 107311/214670 [00:05<00:07, 13839.58 examples/s]
Generating validation split: 52%|█████▏ | 112241/214670 [00:07<00:16, 6299.04 examples/s]
Generating validation split: 55%|█████▍ | 117276/214670 [00:07<00:11, 8315.15 examples/s]
Generating validation split: 57%|█████▋ | 121316/214670 [00:07<00:09, 9641.10 examples/s]
Generating validation split: 59%|█████▊ | 125859/214670 [00:07<00:07, 11825.36 examples/s]
Generating validation split: 62%|██████▏ | 132521/214670 [00:07<00:05, 16030.97 examples/s]
Generating validation split: 65%|██████▍ | 138944/214670 [00:08<00:05, 15070.27 examples/s]
Generating validation split: 68%|██████▊ | 145574/214670 [00:08<00:03, 19119.47 examples/s]
Generating validation split: 70%|███████ | 150507/214670 [00:08<00:02, 21585.54 examples/s]
Generating validation split: 72%|███████▏ | 153810/214670 [00:09<00:03, 17228.97 examples/s]
Generating validation split: 75%|███████▍ | 160019/214670 [00:09<00:02, 18907.50 examples/s]
Generating validation split: 76%|███████▌ | 162645/214670 [00:09<00:02, 19282.95 examples/s]
Generating validation split: 77%|███████▋ | 165981/214670 [00:09<00:02, 17902.21 examples/s]
Generating validation split: 80%|████████ | 172633/214670 [00:09<00:01, 21361.46 examples/s]
Generating validation split: 82%|████████▏ | 176490/214670 [00:10<00:01, 23917.33 examples/s]
Generating validation split: 84%|████████▍ | 180051/214670 [00:10<00:01, 24409.16 examples/s]
Generating validation split: 87%|████████▋ | 186530/214670 [00:10<00:00, 28995.77 examples/s]
Generating validation split: 90%|████████▉ | 192764/214670 [00:10<00:00, 31419.86 examples/s]
Generating validation split: 92%|█████████▏| 197484/214670 [00:13<00:03, 5431.87 examples/s]
Generating validation split: 95%|█████████▌| 204257/214670 [00:13<00:01, 7892.86 examples/s]
Generating validation split: 98%|█████████▊| 211151/214670 [00:13<00:00, 11066.43 examples/s]
Generating validation split: 100%|██████████| 214670/214670 [00:13<00:00, 15679.17 examples/s]
[2025-01-31 04:57:23,711] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.
[2025-01-31 04:57:24,081] [WARNING] [utils.py:295:get_attr] Attribute ['model', 'rotary_emb'] not found.
AWQ: 0%| | 0/32 [00:00<?, ?it/s]You are not running the flash-attention implementation, expect numerical differences.
AWQ: 3%|▎ | 1/32 [00:13<06:59, 13.54s/it]
AWQ: 6%|▋ | 2/32 [00:26<06:39, 13.33s/it]
AWQ: 9%|▉ | 3/32 [00:39<06:25, 13.29s/it]
AWQ: 12%|█▎ | 4/32 [00:53<06:10, 13.24s/it]
AWQ: 16%|█▌ | 5/32 [01:06<05:57, 13.23s/it]
AWQ: 19%|█▉ | 6/32 [01:19<05:43, 13.20s/it]
AWQ: 22%|██▏ | 7/32 [01:32<05:29, 13.19s/it]
AWQ: 25%|██▌ | 8/32 [01:45<05:16, 13.17s/it]
AWQ: 28%|██▊ | 9/32 [01:58<05:02, 13.16s/it]
AWQ: 31%|███▏ | 10/32 [02:12<04:49, 13.16s/it]
AWQ: 34%|███▍ | 11/32 [02:25<04:35, 13.14s/it]
AWQ: 38%|███▊ | 12/32 [02:38<04:22, 13.14s/it]
AWQ: 41%|████ | 13/32 [02:51<04:09, 13.13s/it]
AWQ: 44%|████▍ | 14/32 [03:04<03:56, 13.14s/it]
AWQ: 47%|████▋ | 15/32 [03:17<03:43, 13.14s/it]
AWQ: 50%|█████ | 16/32 [03:30<03:30, 13.13s/it]
AWQ: 53%|█████▎ | 17/32 [03:44<03:17, 13.15s/it]
AWQ: 56%|█████▋ | 18/32 [03:57<03:03, 13.14s/it]
AWQ: 59%|█████▉ | 19/32 [04:10<02:51, 13.16s/it]
AWQ: 62%|██████▎ | 20/32 [04:23<02:38, 13.18s/it]
AWQ: 66%|██████▌ | 21/32 [04:36<02:24, 13.17s/it]
AWQ: 69%|██████▉ | 22/32 [04:49<02:11, 13.20s/it]
AWQ: 72%|███████▏ | 23/32 [05:03<01:58, 13.22s/it]
AWQ: 75%|███████▌ | 24/32 [05:16<01:46, 13.29s/it]
AWQ: 78%|███████▊ | 25/32 [05:29<01:32, 13.28s/it]
AWQ: 81%|████████▏ | 26/32 [05:43<01:20, 13.35s/it]
AWQ: 84%|████████▍ | 27/32 [05:56<01:06, 13.33s/it]
AWQ: 88%|████████▊ | 28/32 [06:10<00:53, 13.34s/it]
AWQ: 91%|█████████ | 29/32 [06:23<00:40, 13.34s/it]
AWQ: 94%|█████████▍| 30/32 [06:36<00:26, 13.32s/it]
AWQ: 97%|█████████▋| 31/32 [06:50<00:13, 13.35s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.41s/it]
AWQ: 100%|██████████| 32/32 [07:03<00:00, 13.24s/it]
[2025-01-31 05:04:28,058] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
WARNING:root:Cannot import JIT optimized kernels. CUDA extension will be disabled.
Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library
[2025-01-31 05:04:31,244] [INFO] [engine.py:781:_run_pass] Pass awq:AutoAWQQuantizer finished in 450.320374 seconds
[2025-01-31 05:04:31,245] [INFO] [cache.py:193:load_model] Loading model 2e34ed27 from cache.
[2025-01-31 05:04:32,306] [INFO] [engine.py:426:run_no_search] Saved output model to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/output_model
[2025-01-31 05:04:32,308] [INFO] [engine.py:338:run_accelerator] Save footprint to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder/olive-cli-tmp-kzq62dpf/footprints.json.
[2025-01-31 05:04:32,308] [INFO] [engine.py:265:run] Run history for cpu-cpu:
[2025-01-31 05:04:32,309] [INFO] [engine.py:517:dump_run_history] run history:
+------------+-------------------+------------------+----------------+-----------+
| model_id | parent_model_id | from_pass | duration_sec | metrics |
+============+===================+==================+================+===========+
| 0861c2d8 | | | | |
+------------+-------------------+------------------+----------------+-----------+
| 2e34ed27 | 0861c2d8 | AutoAWQQuantizer | 450.32 | |
+------------+-------------------+------------------+----------------+-----------+
Command succeeded. Output model saved to /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
Debug file copy start
Debug file copy end
Loaded previous command output of type hfmodel from /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/cap/data-capability/wd/pytorch_awq_model_folder
[2025-01-31 05:04:35,686] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2025-01-31 05:04:35,720] [INFO] [cache.py:138:__init__] Using cache directory: /mnt/azureml/cr/j/e137575ac9a141abb1b4357a07c1a204/exe/wd/.olive-cache/default_workflow
[2025-01-31 05:04:35,725] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:246:run] Running Olive on accelerator: cpu-cpu
[2025-01-31 05:04:35,727] [INFO] [engine.py:888:_create_system] Creating target system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:891:_create_system] Target system created in 0.000088 seconds
[2025-01-31 05:04:35,727] [INFO] [engine.py:902:_create_system] Creating host system ...
[2025-01-31 05:04:35,727] [INFO] [engine.py:905:_create_system] Host system created in 0.000144 seconds
[2025-01-31 05:04:37,329] [INFO] [engine.py:709:_run_pass] Running pass conversion:OnnxConversion {}
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Traceback (most recent call last):
File "/opt/conda/envs/ptca/bin/olive", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/launcher.py", line 62, in main
service.run()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/cli/auto_opt.py", line 183, in run
olive_run(run_config)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 317, in run
return run_engine(package_config, run_config)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/workflows/run/run.py", line 259, in run_engine
engine.run(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 252, in run
run_result = self.run_accelerator(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 330, in run_accelerator
output_footprint = self.run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 400, in run_no_search
should_prune, signal, model_ids = self._run_passes(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 664, in _run_passes
model_config, model_id = self._run_pass(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/engine/engine.py", line 764, in _run_pass
output_model_config = host.run_pass(p, input_model_config, output_model_path, pass_search_point)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/systems/local.py", line 30, in run_pass
output_model = the_pass.run(model, output_model_path, point)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/olive_pass.py", line 245, in run
output_model = self._run_for_config(model, config, output_model_path)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 114, in _run_for_config
output_model = self._run_for_config_internal(model, config, output_model_path)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 151, in _run_for_config_internal
return self._convert_model_on_device(model, config, output_model_path, device, torch_dtype)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 375, in _convert_model_on_device
converted_onnx_model = OnnxConversion._export_pytorch_model(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/olive/passes/onnx/conversion.py", line 252, in _export_pytorch_model
torch.onnx.export(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 516, in export
_export(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1612, in _export
graph, params_dict, torch_out = _model_to_graph(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1134, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 1010, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/onnx/utils.py", line 914, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 1310, in _get_trace_graph
outs = ONNXTracedModule(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 138, in forward
graph, out = torch._C._create_graph_by_tracing(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/jit/_trace.py", line 129, in wrapper
outs.append(self.inner(*trace_inputs))
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1247, in forward
outputs = self.model(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1522, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/phi3/modeling_phi3.py", line 1017, in forward
past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
AttributeError: 'list' object has no attribute 'get_seq_length'
Describe the bug
Hi,
I'm trying to use Olive to quantize and export a fine-tuned Phi-3.5-mini-instruct model.
I can successfully run
olive quantize
andolive auto-opt
on the base Phi-3.5-mini-instruct model and export the ONNX model.However, when I try to run
olive quantize
andolive auto-opt
on our fine-tuned Phi-3.5-mini-instruct model, I get the following error:To Reproduce
Expected behavior
Be able to export a fine-tuned Phi-3.5-mini-instruct model successfully.
Olive config
Other information
0.7.1.1
onnxruntime-genai-cuda = "0.5.0"
4.44.2
The text was updated successfully, but these errors were encountered: