Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings When Converting PTH Model to ONNX and slow TensorRT inference #153

Open
mirza298 opened this issue Jan 27, 2025 · 3 comments
Open

Comments

@mirza298
Copy link

mirza298 commented Jan 27, 2025

Hello,

I got an RTX 4060 mobile, the installed Tesnorrt version is 10.4.0, onnx 1.17.0, onnxruntime 1.20.1, onnxsim 0.4.36 and torch 2.5.1.

I have finetuned D‑FINE‑S Objects365+COCO model with the following configuration file: configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml, inside the dataloader.yml only batch size was changed to 8.

When converting PyTorch (.pth) model to ONNX, I encountered the following warnings in the output:

python tools/deployment/export_onnx.py -c configs/dfine/custom/objects365/dfine_hgnetv2_s_obj2custom.yml -r best_stg2.pth
/workspace/D-FINE/tools/deployment/export_onnx.py:28: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(args.resume, map_location='cpu')
/workspace/D-FINE/tools/deployment/../../src/zoo/dfine/dfine_decoder.py:642: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if memory.shape[0] > 1:
/workspace/D-FINE/tools/deployment/../../src/zoo/dfine/dfine_decoder.py:129: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if reference_points.shape[-1] == 2:
/workspace/D-FINE/tools/deployment/../../src/zoo/dfine/dfine_decoder.py:133: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif reference_points.shape[-1] == 4:
/usr/local/lib/python3.10/dist-packages/torch/onnx/_internal/jit_utils.py:308: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:178.)
_C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:663: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:178.)
_C._jit_pass_onnx_graph_shape_type_inference(
/usr/local/lib/python3.10/dist-packages/torch/onnx/utils.py:1186: UserWarning: Constant folding - Only steps=1 can be constant folded for opset >= 10 onnx::Slice op. Constant folding not applied. (Triggered internally at ../torch/csrc/jit/passes/onnx/constant_fold.cpp:178.)
_C._jit_pass_onnx_graph_shape_type_inference(
Check export onnx model done...
Simplify onnx model True...

The model converts successfully in the end, but I’m concerned about the warnings. Do these warnings have a significant impact on the performance or behavior of the model?

Running the ONNX model on a sample image results in a latency of around 60ms. When converting the ONNX model to TensorRT format, I get a latency of 70–80ms using FP16 and FP32 precision. I used the repository code for conversion and inference but modified the opset_version to 17. Using opset_version=17 improved conversion and precision for FP16 TensorRT models, with opset_version=16, the model accuracy was significantly degraded. So beside the errors in the pth to onnx conversion, what could be the reason the for slow trt inference? For conversion onnx model to trt I used this command: trtexec --onnx="best_stg2.onnx" --saveEngine="model.engine" --fp16, the output is:

trtexec --onnx=best_stg2.onnx --saveEngine=model.engine --fp16
[01/27/2025-12:10:09] [I] === Model Options ===
[01/27/2025-12:10:09] [I] Format: ONNX
[01/27/2025-12:10:09] [I] Model: best_stg2.onnx
[01/27/2025-12:10:09] [I] Output:
[01/27/2025-12:10:09] [I] === Build Options ===
[01/27/2025-12:10:09] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[01/27/2025-12:10:09] [I] avgTiming: 8
[01/27/2025-12:10:09] [I] Precision: FP32+FP16
[01/27/2025-12:10:09] [I] LayerPrecisions:
[01/27/2025-12:10:09] [I] Layer Device Types:
[01/27/2025-12:10:09] [I] Calibration:
[01/27/2025-12:10:09] [I] Refit: Disabled
[01/27/2025-12:10:09] [I] Strip weights: Disabled
[01/27/2025-12:10:09] [I] Version Compatible: Disabled
[01/27/2025-12:10:09] [I] ONNX Plugin InstanceNorm: Disabled
[01/27/2025-12:10:09] [I] TensorRT runtime: full
[01/27/2025-12:10:09] [I] Lean DLL Path:
[01/27/2025-12:10:09] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[01/27/2025-12:10:09] [I] Exclude Lean Runtime: Disabled
[01/27/2025-12:10:09] [I] Sparsity: Disabled
[01/27/2025-12:10:09] [I] Safe mode: Disabled
[01/27/2025-12:10:09] [I] Build DLA standalone loadable: Disabled
[01/27/2025-12:10:09] [I] Allow GPU fallback for DLA: Disabled
[01/27/2025-12:10:09] [I] DirectIO mode: Disabled
[01/27/2025-12:10:09] [I] Restricted mode: Disabled
[01/27/2025-12:10:09] [I] Skip inference: Disabled
[01/27/2025-12:10:09] [I] Save engine: model.engine
[01/27/2025-12:10:09] [I] Load engine:
[01/27/2025-12:10:09] [I] Profiling verbosity: 0
[01/27/2025-12:10:09] [I] Tactic sources: Using default tactic sources
[01/27/2025-12:10:09] [I] timingCacheMode: local
[01/27/2025-12:10:09] [I] timingCacheFile:
[01/27/2025-12:10:09] [I] Enable Compilation Cache: Enabled
[01/27/2025-12:10:09] [I] errorOnTimingCacheMiss: Disabled
[01/27/2025-12:10:09] [I] Preview Features: Use default preview flags.
[01/27/2025-12:10:09] [I] MaxAuxStreams: -1
[01/27/2025-12:10:09] [I] BuilderOptimizationLevel: -1
[01/27/2025-12:10:09] [I] MaxTactics: -1
[01/27/2025-12:10:09] [I] Calibration Profile Index: 0
[01/27/2025-12:10:09] [I] Weight Streaming: Disabled
[01/27/2025-12:10:09] [I] Runtime Platform: Same As Build
[01/27/2025-12:10:09] [I] Debug Tensors:
[01/27/2025-12:10:09] [I] Input(s)s format: fp32:CHW
[01/27/2025-12:10:09] [I] Output(s)s format: fp32:CHW
[01/27/2025-12:10:09] [I] Input build shapes: model
[01/27/2025-12:10:09] [I] Input calibration shapes: model
[01/27/2025-12:10:09] [I] === System Options ===
[01/27/2025-12:10:09] [I] Device: 0
[01/27/2025-12:10:09] [I] DLACore:
[01/27/2025-12:10:09] [I] Plugins:
[01/27/2025-12:10:09] [I] setPluginsToSerialize:
[01/27/2025-12:10:09] [I] dynamicPlugins:
[01/27/2025-12:10:09] [I] ignoreParsedPluginLibs: 0
[01/27/2025-12:10:09] [I]
[01/27/2025-12:10:09] [I] === Inference Options ===
[01/27/2025-12:10:09] [I] Batch: Explicit
[01/27/2025-12:10:09] [I] Input inference shapes: model
[01/27/2025-12:10:09] [I] Iterations: 10
[01/27/2025-12:10:09] [I] Duration: 3s (+ 200ms warm up)
[01/27/2025-12:10:09] [I] Sleep time: 0ms
[01/27/2025-12:10:09] [I] Idle time: 0ms
[01/27/2025-12:10:09] [I] Inference Streams: 1
[01/27/2025-12:10:09] [I] ExposeDMA: Disabled
[01/27/2025-12:10:09] [I] Data transfers: Enabled
[01/27/2025-12:10:09] [I] Spin-wait: Disabled
[01/27/2025-12:10:09] [I] Multithreading: Disabled
[01/27/2025-12:10:09] [I] CUDA Graph: Disabled
[01/27/2025-12:10:09] [I] Separate profiling: Disabled
[01/27/2025-12:10:09] [I] Time Deserialize: Disabled
[01/27/2025-12:10:09] [I] Time Refit: Disabled
[01/27/2025-12:10:09] [I] NVTX verbosity: 0
[01/27/2025-12:10:09] [I] Persistent Cache Ratio: 0
[01/27/2025-12:10:09] [I] Optimization Profile Index: 0
[01/27/2025-12:10:09] [I] Weight Streaming Budget: 100.000000%
[01/27/2025-12:10:09] [I] Inputs:
[01/27/2025-12:10:09] [I] Debug Tensor Save Destinations:
[01/27/2025-12:10:09] [I] === Reporting Options ===
[01/27/2025-12:10:09] [I] Verbose: Disabled
[01/27/2025-12:10:09] [I] Averages: 10 inferences
[01/27/2025-12:10:09] [I] Percentiles: 90,95,99
[01/27/2025-12:10:09] [I] Dump refittable layers:Disabled
[01/27/2025-12:10:09] [I] Dump output: Disabled
[01/27/2025-12:10:09] [I] Profile: Disabled
[01/27/2025-12:10:09] [I] Export timing to JSON file:
[01/27/2025-12:10:09] [I] Export output to JSON file:
[01/27/2025-12:10:09] [I] Export profile to JSON file:
[01/27/2025-12:10:09] [I]
[01/27/2025-12:10:09] [I] === Device Information ===
[01/27/2025-12:10:09] [I] Available Devices:
[01/27/2025-12:10:09] [I] Device 0: "NVIDIA GeForce RTX 4060 Laptop GPU" UUID: GPU-2bafcf9a-afff-446a-5971-587fb11aef33
[01/27/2025-12:10:09] [I] Selected Device: NVIDIA GeForce RTX 4060 Laptop GPU
[01/27/2025-12:10:09] [I] Selected Device ID: 0
[01/27/2025-12:10:09] [I] Selected Device UUID: GPU-2bafcf9a-afff-446a-5971-587fb11aef33
[01/27/2025-12:10:09] [I] Compute Capability: 8.9
[01/27/2025-12:10:09] [I] SMs: 24
[01/27/2025-12:10:09] [I] Device Global Memory: 7940 MiB
[01/27/2025-12:10:09] [I] Shared Memory per SM: 100 KiB
[01/27/2025-12:10:09] [I] Memory Bus Width: 128 bits (ECC disabled)
[01/27/2025-12:10:09] [I] Application Compute Clock Rate: 1.89 GHz
[01/27/2025-12:10:09] [I] Application Memory Clock Rate: 8.001 GHz
[01/27/2025-12:10:09] [I]
[01/27/2025-12:10:09] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[01/27/2025-12:10:09] [I]
[01/27/2025-12:10:09] [I] TensorRT version: 10.4.0
[01/27/2025-12:10:09] [I] Loading standard plugins
[01/27/2025-12:10:09] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 16, GPU 184 (MiB)
[01/27/2025-12:10:11] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +2220, GPU +428, now: CPU 2391, GPU 612 (MiB)
[01/27/2025-12:10:11] [I] Start parsing network model.
[01/27/2025-12:10:11] [I] [TRT] ----------------------------------------------------------------
[01/27/2025-12:10:11] [I] [TRT] Input filename: best_stg2.onnx
[01/27/2025-12:10:11] [I] [TRT] ONNX IR version: 0.0.8
[01/27/2025-12:10:11] [I] [TRT] Opset version: 17
[01/27/2025-12:10:11] [I] [TRT] Producer name: pytorch
[01/27/2025-12:10:11] [I] [TRT] Producer version: 2.5.1
[01/27/2025-12:10:11] [I] [TRT] Domain:
[01/27/2025-12:10:11] [I] [TRT] Model version: 0
[01/27/2025-12:10:11] [I] [TRT] Doc string:
[01/27/2025-12:10:11] [I] [TRT] ----------------------------------------------------------------
[01/27/2025-12:10:11] [W] [TRT] ModelImporter.cpp:420: Make sure input orig_target_sizes has Int64 binding.
[01/27/2025-12:10:11] [W] [TRT] ModelImporter.cpp:797: Make sure output labels has Int64 binding.
[01/27/2025-12:10:11] [I] Finished parsing network model. Parse time: 0.0462787
[01/27/2025-12:10:11] [W] Dynamic dimensions required for input: images, but no shapes were provided. Automatically overriding shape to: 1x3x640x640
[01/27/2025-12:10:11] [I] Set shape of input tensor images for optimization profile 0 to: MIN=1x3x640x640 OPT=1x3x640x640 MAX=1x3x640x640
[01/27/2025-12:10:11] [W] Dynamic dimensions required for input: orig_target_sizes, but no shapes were provided. Automatically overriding shape to: 1x2
[01/27/2025-12:10:11] [I] Set shape of input tensor orig_target_sizes for optimization profile 0 to: MIN=1x2 OPT=1x2 MAX=1x2
[01/27/2025-12:10:11] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/27/2025-12:11:14] [I] [TRT] Compiler backend is used during engine build.
[01/27/2025-12:13:23] [I] [TRT] Detected 2 inputs and 3 output network tensors.
[01/27/2025-12:13:24] [I] [TRT] Total Host Persistent Memory: 537936 bytes
[01/27/2025-12:13:24] [I] [TRT] Total Device Persistent Memory: 0 bytes
[01/27/2025-12:13:24] [I] [TRT] Max Scratch Memory: 16179200 bytes
[01/27/2025-12:13:24] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 135 steps to complete.
[01/27/2025-12:13:24] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 4.75194ms to assign 11 blocks to 135 nodes requiring 29497344 bytes.
[01/27/2025-12:13:24] [I] [TRT] Total Activation Memory: 29497344 bytes
[01/27/2025-12:13:25] [I] [TRT] Total Weights Memory: 20680832 bytes
[01/27/2025-12:13:25] [I] [TRT] Compiler backend is used during engine execution.
[01/27/2025-12:13:25] [I] [TRT] Engine generation completed in 193.831 seconds.
[01/27/2025-12:13:25] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 64 MiB
[01/27/2025-12:13:25] [I] [TRT] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 4810 MiB
[01/27/2025-12:13:25] [I] Engine built in 193.902 sec.
[01/27/2025-12:13:25] [I] Created engine with size: 25.7951 MiB
[01/27/2025-12:13:25] [I] [TRT] Loaded engine size: 25 MiB
[01/27/2025-12:13:25] [I] Engine deserialized in 0.0265883 sec.
[01/27/2025-12:13:25] [I] [TRT] [MS] Running engine with multi stream info
[01/27/2025-12:13:25] [I] [TRT] [MS] Number of aux streams is 2
[01/27/2025-12:13:25] [I] [TRT] [MS] Number of total worker streams is 3
[01/27/2025-12:13:25] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[01/27/2025-12:13:25] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +28, now: CPU 0, GPU 47 (MiB)
[01/27/2025-12:13:25] [I] Setting persistentCacheLimit to 0 bytes.
[01/27/2025-12:13:25] [I] Created execution context with device memory size: 28.1309 MiB
[01/27/2025-12:13:25] [I] Using random values for input images
[01/27/2025-12:13:25] [I] Input binding for images with dimensions 1x3x640x640 is created.
[01/27/2025-12:13:25] [I] Using random values for input orig_target_sizes
[01/27/2025-12:13:25] [I] Input binding for orig_target_sizes with dimensions 1x2 is created.
[01/27/2025-12:13:25] [I] Output binding for labels with dimensions 1x300 is created.
[01/27/2025-12:13:25] [I] Output binding for boxes with dimensions 1x300x4 is created.
[01/27/2025-12:13:25] [I] Output binding for scores with dimensions 1x300 is created.
[01/27/2025-12:13:25] [I] Starting inference
[01/27/2025-12:13:28] [I] Warmup completed 84 queries over 200 ms
[01/27/2025-12:13:28] [I] Timing trace has 1223 queries over 3.00785 s
[01/27/2025-12:13:28] [I]
[01/27/2025-12:13:28] [I] === Trace details ===
[01/27/2025-12:13:28] [I] Trace averages of 10 runs:
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39852 ms - Host latency: 2.84209 ms (enqueue 1.0648 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40742 ms - Host latency: 2.84457 ms (enqueue 0.966858 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43026 ms - Host latency: 2.86991 ms (enqueue 1.08384 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.38275 ms - Host latency: 2.81925 ms (enqueue 0.985062 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39626 ms - Host latency: 2.83794 ms (enqueue 1.01142 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.3935 ms - Host latency: 2.83042 ms (enqueue 1.01015 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39574 ms - Host latency: 2.83244 ms (enqueue 1.07661 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.42708 ms - Host latency: 2.86924 ms (enqueue 1.09278 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.38858 ms - Host latency: 2.83555 ms (enqueue 1.16845 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40036 ms - Host latency: 2.84028 ms (enqueue 1.11482 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.42627 ms - Host latency: 2.86852 ms (enqueue 1.08424 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.38561 ms - Host latency: 2.82693 ms (enqueue 1.08053 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39257 ms - Host latency: 2.83228 ms (enqueue 1.05145 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39534 ms - Host latency: 2.83307 ms (enqueue 1.08961 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40958 ms - Host latency: 2.8475 ms (enqueue 1.071 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39913 ms - Host latency: 2.83663 ms (enqueue 1.0578 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40024 ms - Host latency: 2.83559 ms (enqueue 0.989362 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.41581 ms - Host latency: 2.85283 ms (enqueue 1.04617 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39615 ms - Host latency: 2.83293 ms (enqueue 1.06779 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39871 ms - Host latency: 2.83483 ms (enqueue 1.0278 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.41827 ms - Host latency: 2.85928 ms (enqueue 1.07579 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40127 ms - Host latency: 2.83878 ms (enqueue 1.03175 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40997 ms - Host latency: 2.84782 ms (enqueue 1.04777 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40651 ms - Host latency: 2.84379 ms (enqueue 1.08071 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.38635 ms - Host latency: 2.83957 ms (enqueue 1.0084 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39913 ms - Host latency: 2.83853 ms (enqueue 1.04483 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40711 ms - Host latency: 2.87655 ms (enqueue 1.58329 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45914 ms - Host latency: 2.90551 ms (enqueue 1.21445 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4707 ms - Host latency: 2.91226 ms (enqueue 1.07881 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4783 ms - Host latency: 2.91619 ms (enqueue 1.05862 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46547 ms - Host latency: 2.92039 ms (enqueue 1.30129 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45986 ms - Host latency: 2.92138 ms (enqueue 1.54501 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47256 ms - Host latency: 2.93599 ms (enqueue 1.46199 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.50665 ms - Host latency: 2.9473 ms (enqueue 1.1169 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.49322 ms - Host latency: 2.93321 ms (enqueue 1.01987 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.49949 ms - Host latency: 2.93723 ms (enqueue 1.07106 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.52159 ms - Host latency: 2.96528 ms (enqueue 1.10598 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.5345 ms - Host latency: 2.98315 ms (enqueue 1.20037 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.517 ms - Host latency: 2.97622 ms (enqueue 1.12841 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.5259 ms - Host latency: 2.97935 ms (enqueue 1.0635 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44572 ms - Host latency: 2.91278 ms (enqueue 1.04468 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45557 ms - Host latency: 2.92477 ms (enqueue 1.33063 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43414 ms - Host latency: 2.89436 ms (enqueue 1.46586 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46837 ms - Host latency: 2.91868 ms (enqueue 1.26537 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.42257 ms - Host latency: 2.85909 ms (enqueue 1.02791 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4574 ms - Host latency: 2.89585 ms (enqueue 1.04021 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43467 ms - Host latency: 2.88516 ms (enqueue 1.16014 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46071 ms - Host latency: 2.91073 ms (enqueue 1.04517 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.54589 ms - Host latency: 2.99913 ms (enqueue 1.25188 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47831 ms - Host latency: 2.94247 ms (enqueue 1.45754 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39749 ms - Host latency: 2.85688 ms (enqueue 1.43905 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39841 ms - Host latency: 2.83337 ms (enqueue 0.966309 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40446 ms - Host latency: 2.84076 ms (enqueue 1.00397 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39534 ms - Host latency: 2.83424 ms (enqueue 0.976282 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.42656 ms - Host latency: 2.88403 ms (enqueue 1.35331 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43351 ms - Host latency: 2.8938 ms (enqueue 1.27959 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43118 ms - Host latency: 2.90161 ms (enqueue 1.22269 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.53274 ms - Host latency: 2.99763 ms (enqueue 1.44738 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43221 ms - Host latency: 2.88912 ms (enqueue 1.3955 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4438 ms - Host latency: 2.92119 ms (enqueue 1.58049 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.50666 ms - Host latency: 2.99526 ms (enqueue 1.38043 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.48199 ms - Host latency: 2.94413 ms (enqueue 1.15358 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.49241 ms - Host latency: 2.9631 ms (enqueue 1.58268 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.5436 ms - Host latency: 3.00521 ms (enqueue 1.44675 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46536 ms - Host latency: 2.94985 ms (enqueue 1.61121 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.53501 ms - Host latency: 3.0243 ms (enqueue 1.64032 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.53431 ms - Host latency: 2.99885 ms (enqueue 1.47567 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45299 ms - Host latency: 2.89542 ms (enqueue 1.15087 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45841 ms - Host latency: 2.90017 ms (enqueue 1.14436 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45443 ms - Host latency: 2.89382 ms (enqueue 1.04814 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45957 ms - Host latency: 2.89821 ms (enqueue 1.09092 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45956 ms - Host latency: 2.90031 ms (enqueue 1.12499 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47478 ms - Host latency: 2.92344 ms (enqueue 1.1077 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.54667 ms - Host latency: 2.99797 ms (enqueue 1.15862 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.56058 ms - Host latency: 3.02078 ms (enqueue 1.09144 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.54299 ms - Host latency: 3.0079 ms (enqueue 1.52184 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.53242 ms - Host latency: 3.00073 ms (enqueue 1.95933 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.5394 ms - Host latency: 2.998 ms (enqueue 1.38574 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.52205 ms - Host latency: 2.9864 ms (enqueue 1.15305 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.53791 ms - Host latency: 2.99973 ms (enqueue 1.73311 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.51033 ms - Host latency: 2.9679 ms (enqueue 1.33455 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43577 ms - Host latency: 2.90293 ms (enqueue 1.56077 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47031 ms - Host latency: 2.9113 ms (enqueue 1.12178 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4241 ms - Host latency: 2.87881 ms (enqueue 1.07261 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45388 ms - Host latency: 2.89707 ms (enqueue 1.15454 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4376 ms - Host latency: 2.88997 ms (enqueue 1.15464 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44521 ms - Host latency: 2.88582 ms (enqueue 1.09399 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43013 ms - Host latency: 2.86858 ms (enqueue 1.08057 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.42517 ms - Host latency: 2.87947 ms (enqueue 1.151 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.50789 ms - Host latency: 2.95813 ms (enqueue 1.05593 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.51184 ms - Host latency: 2.96042 ms (enqueue 0.966772 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.54517 ms - Host latency: 2.99482 ms (enqueue 0.950171 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46548 ms - Host latency: 2.92188 ms (enqueue 1.15049 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.50234 ms - Host latency: 2.96084 ms (enqueue 1.17869 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47703 ms - Host latency: 2.92698 ms (enqueue 1.08486 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.52351 ms - Host latency: 2.96289 ms (enqueue 1.1031 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4832 ms - Host latency: 2.92888 ms (enqueue 1.14043 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.52422 ms - Host latency: 2.96179 ms (enqueue 1.078 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.49028 ms - Host latency: 2.95364 ms (enqueue 1.16362 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46348 ms - Host latency: 2.90212 ms (enqueue 1.1689 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.38997 ms - Host latency: 2.84583 ms (enqueue 1.0521 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4074 ms - Host latency: 2.852 ms (enqueue 1.07219 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.40476 ms - Host latency: 2.84126 ms (enqueue 1.03877 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45146 ms - Host latency: 2.89766 ms (enqueue 1.08599 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4428 ms - Host latency: 2.88433 ms (enqueue 1.11426 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.43555 ms - Host latency: 2.87715 ms (enqueue 1.10444 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44993 ms - Host latency: 2.88916 ms (enqueue 1.0842 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4501 ms - Host latency: 2.88989 ms (enqueue 1.06694 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.451 ms - Host latency: 2.89333 ms (enqueue 1.15081 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46697 ms - Host latency: 2.90608 ms (enqueue 1.11941 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.4731 ms - Host latency: 2.91565 ms (enqueue 1.11868 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.46716 ms - Host latency: 2.90879 ms (enqueue 1.08152 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47622 ms - Host latency: 2.91526 ms (enqueue 1.05051 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.47371 ms - Host latency: 2.91438 ms (enqueue 1.08623 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45842 ms - Host latency: 2.89189 ms (enqueue 1.02375 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45869 ms - Host latency: 2.90232 ms (enqueue 1.09861 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.45388 ms - Host latency: 2.90442 ms (enqueue 1.0698 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44053 ms - Host latency: 2.88828 ms (enqueue 1.12231 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44163 ms - Host latency: 2.88831 ms (enqueue 1.10674 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44106 ms - Host latency: 2.89414 ms (enqueue 1.11782 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.44404 ms - Host latency: 2.89646 ms (enqueue 1.13303 ms)
[01/27/2025-12:13:28] [I] Average on 10 runs - GPU latency: 2.39832 ms - Host latency: 2.8406 ms (enqueue 1.09514 ms)
[01/27/2025-12:13:28] [I]
[01/27/2025-12:13:28] [I] === Performance summary ===
[01/27/2025-12:13:28] [I] Throughput: 406.603 qps
[01/27/2025-12:13:28] [I] Latency: min = 2.76614 ms, max = 3.34424 ms, mean = 2.90436 ms, median = 2.86865 ms, percentile(90%) = 3.06628 ms, percentile(95%) = 3.12378 ms, percentile(99%) = 3.26904 ms
[01/27/2025-12:13:28] [I] Enqueue Time: min = 0.578979 ms, max = 2.71729 ms, mean = 1.1707 ms, median = 1.10303 ms, percentile(90%) = 1.53485 ms, percentile(95%) = 1.67151 ms, percentile(99%) = 2.08105 ms
[01/27/2025-12:13:28] [I] H2D Latency: min = 0.405273 ms, max = 0.628296 ms, mean = 0.441148 ms, median = 0.434814 ms, percentile(90%) = 0.462158 ms, percentile(95%) = 0.475708 ms, percentile(99%) = 0.517334 ms
[01/27/2025-12:13:28] [I] GPU Compute Time: min = 2.31937 ms, max = 2.89893 ms, mean = 2.45541 ms, median = 2.42188 ms, percentile(90%) = 2.61023 ms, percentile(95%) = 2.66455 ms, percentile(99%) = 2.80688 ms
[01/27/2025-12:13:28] [I] D2H Latency: min = 0.00488281 ms, max = 0.123291 ms, mean = 0.00779505 ms, median = 0.00732422 ms, percentile(90%) = 0.00854492 ms, percentile(95%) = 0.00915527 ms, percentile(99%) = 0.0319824 ms
[01/27/2025-12:13:28] [I] Total Host Walltime: 3.00785 s
[01/27/2025-12:13:28] [I] Total GPU Compute Time: 3.00297 s
[01/27/2025-12:13:28] [W] * GPU compute time is unstable, with coefficient of variance = 4.06414%.
[01/27/2025-12:13:28] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[01/27/2025-12:13:28] [I] Explanations of the performance metrics are printed in the verbose logs.
[01/27/2025-12:13:28] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100400] [b26] # trtexec --onnx=best_stg2.onnx --saveEngine=model.engine --fp16

@KozlovKY
Copy link

Running the ONNX model on a sample image results in a latency of around 60ms. When converting the ONNX model to TensorRT format, I get a latency of 70–80ms using FP16 and FP32 precision. I used the repository code for conversion and inference but modified the opset_version to 17. Using opset_version=17 improved conversion and precision for FP16 TensorRT models, with opset_version=16, the model accuracy was significantly degraded. So beside the errors in the pth to onnx conversion, what could be the reason the for slow trt inference?

Yes, it can be because Layernorm is supported with opset>=17, I had a warning when converting with a lower opset.

@mirza298
Copy link
Author

mirza298 commented Jan 27, 2025

@KozlovKY
Thank you for your response.

I'm still encountering issues achieving the proposed 2-3ms latency for the small model using TensorRT. My TensorRT inference time remains around 80ms.

I've found code from a related issue on this repo that successfully converts the model to ONNX without the previous warnings. However, the resulting TensorRT model still doesn't provide the desired inference speed. Did you successfully get the inference working with TensorRT properly?

@KozlovKY
Copy link

KozlovKY commented Jan 28, 2025

@KozlovKY Thank you for your response.

I'm still encountering issues achieving the proposed 2-3ms latency for the small model using TensorRT. My TensorRT inference time remains around 80ms.

I've found code from a related issue on this repo that successfully converts the model to ONNX without the previous warnings. However, the resulting TensorRT model still doesn't provide the desired inference speed. Did you successfully get the inference working with TensorRT properly?

I ve checked only inference speed, may be there is a problem in outputs, you can check my code for convert trt from onnx:

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def convert_onnx_to_trt(onnx_path, engine_path, batch_size=1, precision="fp16"):
    # Initialize TensorRT stuff
    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

    # Parse ONNX model
    parser = trt.OnnxParser(network, TRT_LOGGER)
    with open(onnx_path, 'rb') as model:
        parser.parse(model.read())

    for idx in range(parser.num_errors):
        print(parser.get_error(idx))

    # Create optimization profile
    profile = builder.create_optimization_profile()
    profile.set_shape(
        "images",
        min=(1, 3, 960, 960),
        opt=(batch_size, 3, 960, 960),
        max=(batch_size, 3, 960, 960)
    )
    profile.set_shape(
        "orig_target_sizes",
        min=(1, 2),
        opt=(batch_size, 2),
        max=(batch_size, 2)
    )
    # opset 17
    # Configure builder
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)  # 2GB
    config.add_optimization_profile(profile)
    if precision == "fp16":
        config.set_flag(trt.BuilderFlag.FP16)
        config.clear_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)
        config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
        for layer_idx in range(network.num_layers):
            layer = network[layer_idx]
            if layer.type == trt.LayerType.NORMALIZATION:
                layer.precision = trt.float32
                layer.set_output_type(0, trt.float32)

    config.set_flag(trt.BuilderFlag.FP16)
    # Build and save engine
    engine = builder.build_serialized_network(network, config)
    #engine = trt.runtime.deserialize_cuda_engine(plan)

    with open(engine_path, 'wb') as f:
        f.write(engine)
    f.close()
    print(f"Successfully converted model to TensorRT engine: {engine_path}")

if __name__ == "__main__":
    # Example usage
    convert_onnx_to_trt(
        onnx_path="path.onnx",
        engine_path="path.engine",
        batch_size=1,
        precision="fp16"
    )

I tried with onnx==1.17.0 tensorrt==10.1.0 cu12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants