You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
When deploying an ONNX model using the Triton Inference Server's ONNX runtime backend, the inference performance on the CPU is noticeably slower compared to running the same model using the ONNXRuntime Python client directly. This performance discrepancy is observed under identical conditions, including the same hardware, model, and input data.
Comparing the performance between two approaches by single one run is quite not fair. Why don't you setup a benchmark against larger samples and have more than one client?
Description
When deploying an ONNX model using the Triton Inference Server's ONNX runtime backend, the inference performance on the CPU is noticeably slower compared to running the same model using the ONNXRuntime Python client directly. This performance discrepancy is observed under identical conditions, including the same hardware, model, and input data.
Triton Information
TRITON_VERSION <= 24.09
To Reproduce
model used:
Triton server (ONNX runtime)
config.pbtxt
Python clients
Triton client
results:
473 ms ± 87.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ONNX Runtime
results:
159 ms ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The text was updated successfully, but these errors were encountered: