Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Failure] shared_ptr_base.h:199:9: runtime error: member call on address which does not point to an object of type 'std::_Sp_counted_base<>' #3192

Open
junliume opened this issue Aug 10, 2024 · 10 comments

Comments

@junliume
Copy link
Contributor

Another byproduct of #3181

LastTest.log

The error message:

/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:199:9: runtime error: member call on address 0x00000b9e6590 which does not point to an object of type 'std::_Sp_counted_base<>'
0x00000b9e6590: note: object has invalid vptr
 00 00 00 00  d8 c0 dd 8e 53 7f 00 00  00 00 00 00 02 00 00 00  d9 01 00 00 00 00 00 00  30 d5 ac 10
              ^~~~~~~~~~~~~~~~~~~~~~~
              invalid vptr
    #0 0x7f53852a1bc7  (/data/MIOpen/build/lib/libMIOpen.so.1+0x29a5ebc7)
    #1 0x7f538e2c87cb  (/data/MIOpen/build/lib/libMIOpen.so.1+0x32a857cb)
    #2 0x7f530f9c2d9e  (/lib/x86_64-linux-gnu/libc.so.6+0x45d9e) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #3 0x7f530f9c25c8  (/lib/x86_64-linux-gnu/libc.so.6+0x455c8) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #4 0x7f530f9c260f  (/lib/x86_64-linux-gnu/libc.so.6+0x4560f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #5 0x7f530f9a6d96  (/lib/x86_64-linux-gnu/libc.so.6+0x29d96) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #6 0x7f530f9a6e3f  (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #7 0x249004  (/data/MIOpen/build/bin/test_find_db+0x249004)

[How to reproduce]:

cmake command:

CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=debug -DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored ' -DBUILD_DEV=Off -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DCMAKE_PREFIX_PATH=/opt/rocm    ..

and then

LLVM_PATH=/opt/rocm/llvm CTEST_PARALLEL_LEVEL=4  make -j$(nproc) install  check MIOpenDriver
@junliume
Copy link
Contributor Author

junliume commented Aug 10, 2024

@BrianHarrisonAMD @atamazov I suspect -fsanitize=undefined but need more investigation.

It must be one of these:

-DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored '

Update: confirmed it is due to -fsanitize=undefined

@atamazov
Copy link
Contributor

@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.

@junliume
Copy link
Contributor Author

junliume commented Aug 10, 2024

@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.

Unfortunately, in my short experiment reverting #2524 does not resolve this issue.

We do see lots of warning messages like:

Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1

@atamazov
Copy link
Contributor

atamazov commented Aug 10, 2024

@junliume

We do see lots of warning messages like:

Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1

IIRC sometimes we need to know the amount of free GPU memory and use hipMemGetInfo to query this info. But in some cases, this HIP function does fail, and I have no idea why. The workaround (which issues a warning and simply returns some fixed value) was introduced in #2333, 6477e68

I suspect that the reason of HIP runtime failure is a combination of severely outdated base driver + new rocm in docker + some target asics. I think that we need some assistance from HIP runtime team.

@atamazov
Copy link
Contributor

@junliume ...but I do not think this is related to this specific issue with UB.

@BrianHarrisonAMD
Copy link
Contributor

Not sure if this was already known, but I tracked it down to the test_find_db testsuite, and it appears to be from calling the following in solver_finders.cpp:

    std::transform(
        finders.begin(), finders.end(), std::inserter(solutions, solutions.end()), [&](auto&& f) {
            return std::make_pair(f->GetAlgorithmName(problem),
                                  f->Find(ctx, problem, invoke_ctx, parameters, options));
        });

Seems to be calling Find on the finders causes this issue in the test.

@BrianHarrisonAMD
Copy link
Contributor

BrianHarrisonAMD commented Aug 12, 2024

Update, for the find_db.cpp test, I changed it to only run the forward test, and narrowed it down to miopen::solver::conv::ConvMlirIgemmFwdXdlops causing the above issue for me.

This change to mlo_dir_conv.cpp fixes the forwards test for me:

static auto GetImplicitGemmSolvers()
{
    return miopen::solver::SolverContainer<
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R5Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops_Padded_Gemm,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmV4R1Fwd,
        miopen::solver::conv::ConvHipImplicitGemmV4R4Fwd,
        // miopen::solver::conv::ConvMlirIgemmFwdXdlops,
        miopen::solver::conv::ConvMlirIgemmFwd,
        miopen::solver::conv::ConvMlirIgemmBwdXdlops,
        miopen::solver::conv::ConvMlirIgemmBwd,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd_1x1,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicBwd,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlops,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlops,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC,
        miopen::solver::conv::ConvCkIgemmFwdV6r1DlopsNchw,
#if MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
        miopen::solver::conv::ConvHipImplicitGemmFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmGroupFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmGroupBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemm3DGroupFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemm3DGroupBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmF16F8F16FwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmF16F8F16BwdXdlops,
#endif // MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC>{};
}

Going to dig a bit deeper to see what's the issue with that one solver.

Edit: Looks like the issue happens for me if I call any of the miir API's, and goes away if I prevent those from happening.

This line is enough for it to trigger the issue for me:

miirCreateHandle(params.c_str());

Looks like it's due to the params the handle is created with, but not sure yet what caused this to be an issue now.

@BrianHarrisonAMD
Copy link
Contributor

BrianHarrisonAMD commented Aug 13, 2024

Adding a branch to suppress the ubsan errors since it's coming from MLIR handle creation, and our options are limited since we are using an older version.

PR up with suppression changes #3198

@amberhassaan
Copy link
Contributor

@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h is the culprit. Could we be ignoring some problem in our code by suppressing these errors?

@BrianHarrisonAMD
Copy link
Contributor

@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h is the culprit. Could we be ignoring some problem in our code by suppressing these errors?

@amberhassaan shared_ptr_base.h isn't the issue, but it's where the ubsan error comes from during teardown of the application, and it's the only way I could find to suppress the error. The issue can be narrowed down to just creating a MLIR handle with nothing else happening, (I made a reproducer for that), and appears to be due to something in MLIR cleaning up static memory during exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants