Segfault/Coredump in grpc::ModelInferHandler::InferResponseComplete #7877

andyblackheel · 2024-12-12T20:46:38Z

Description
We have ensemble ASR model, it gets crashed with segfault after 3-10 minutes of load with 15 generator threads at single L40s GPU.

Triton Information
We've tested vanilla r24.08, r24.10 and debug build of r24.10. We run it in docker container.

To Reproduce
Well, actually it is some sort of private ASR model setup, I'm not sure we can share it.

Model config is attached. Models inside ensemble models are mostly use onnxruntime, except preprocessing which uses python backend.

Expected behavior
Well, expected behavior is not to crash.

We've get several core dumps, so I've attached backtraces from it.

Also, I've seen some very promising commit several days ago: f5e4f69#diff-78246a41bf5fef27235af811675a07f2262046074eae8da3985e98ae68602065. Probably we should try build this version of code and try setting some of TRITONSERVER_DELAY_* variable?
config.txt
stacktrace_01.txt
stacktrace_02.txt
stacktrace_03.txt

The text was updated successfully, but these errors were encountered:

a-beloglazov-aiforia · 2024-12-23T01:08:41Z

The issue was reproduced on Triton Inference Server version r24.11.

zhuichao001 · 2025-01-13T07:01:09Z

Can you display the error stack of such coredumps?

a-beloglazov-aiforia · 2025-01-13T11:27:22Z

@zhuichao001 Are you interested in this error stack?

Signal (11) received.
 0# 0x000062831709552D in tritonserver
 1# 0x0000798D93B32520 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# pthread_mutex_lock in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# 0x00006283170F942D in tritonserver
 4# 0x0000798D925A8390 in /opt/tritonserver/bin/../lib/libtritonserver.so
 5# 0x0000798D925A84DA in /opt/tritonserver/bin/../lib/libtritonserver.so
 6# 0x0000798D925998DE in /opt/tritonserver/bin/../lib/libtritonserver.so
 7# 0x0000798D925390BA in /opt/tritonserver/bin/../lib/libtritonserver.so
 8# 0x0000798D9253D169 in /opt/tritonserver/bin/../lib/libtritonserver.so
 9# 0x0000798D9253F056 in /opt/tritonserver/bin/../lib/libtritonserver.so
10# 0x0000798D925A8390 in /opt/tritonserver/bin/../lib/libtritonserver.so
11# TRITONBACKEND_ResponseSend in /opt/tritonserver/bin/../lib/libtritonserver.so
12# 0x0000798D8677C168 in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
13# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
14# 0x0000798D925070B4 in /opt/tritonserver/bin/../lib/libtritonserver.so
15# 0x0000798D9250742B in /opt/tritonserver/bin/../lib/libtritonserver.so
16# 0x0000798D92625CCD in /opt/tritonserver/bin/../lib/libtritonserver.so
17# 0x0000798D9250B864 in /opt/tritonserver/bin/../lib/libtritonserver.so
18# 0x0000798D94239253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
19# 0x0000798D93B84AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
20# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

zhuichao001 · 2025-01-16T06:27:25Z

It appears that the root cause is due to TRITONBACKEND_ResponseSend. Could we compile with the -g option or run the program under gdb to check if the arguments being passed at the time of the coredump are valid and effective?

rmccorm4 added grpc Related to the GRPC server crash Related to server crashes, segfaults, etc. labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault/Coredump in grpc::ModelInferHandler::InferResponseComplete #7877

Segfault/Coredump in grpc::ModelInferHandler::InferResponseComplete #7877

andyblackheel commented Dec 12, 2024

a-beloglazov-aiforia commented Dec 23, 2024

zhuichao001 commented Jan 13, 2025

a-beloglazov-aiforia commented Jan 13, 2025 •

edited

Loading

zhuichao001 commented Jan 16, 2025

Segfault/Coredump in grpc::ModelInferHandler::InferResponseComplete #7877

Segfault/Coredump in grpc::ModelInferHandler::InferResponseComplete #7877

Comments

andyblackheel commented Dec 12, 2024

a-beloglazov-aiforia commented Dec 23, 2024

zhuichao001 commented Jan 13, 2025

a-beloglazov-aiforia commented Jan 13, 2025 • edited Loading

zhuichao001 commented Jan 16, 2025

a-beloglazov-aiforia commented Jan 13, 2025 •

edited

Loading