Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory profiling causes rocmIsEnabled to segfault #47450

Open
iarspider opened this issue Feb 25, 2025 · 28 comments
Open

Memory profiling causes rocmIsEnabled to segfault #47450

iarspider opened this issue Feb 25, 2025 · 28 comments

Comments

@iarspider
Copy link
Contributor

iarspider commented Feb 25, 2025

Output of LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so gdb --args rocmIsEnabled (these libraries are preloaded if --maxmem_profile is passed to cmsDriver):

#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) ()
   from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week0/el8_amd64_gcc12/cms/cmssw-patch/CMSSW_15_1_X_2025-02-24-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()
@iarspider
Copy link
Contributor Author

assign core,heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@iarspider
Copy link
Contributor Author

This was discovered when preparing cms-sw/cms-bot#2418.

@Dr15Jones
Copy link
Contributor

@iarspider what was the command you issued to start gdb? I do not understand what you meant when you wrote "if --maxmem_profile is passed to cmsRun" as cmsRun does not accept --maxmem_profile as a command line argument. I could believe a configuration file passed to cmsRun would take that argument.

@makortel
Copy link
Contributor

I do not understand what you meant when you wrote "if --maxmem_profile is passed to cmsRun" as cmsRun does not accept --maxmem_profile as a command line argument. I could believe a configuration file passed to cmsRun would take that argument.

It seems to be the argument for cmsDriver.py. IIRC its impact is just the LD_PRELOAD.

@Dr15Jones
Copy link
Contributor

So I looked at the output of one of the failing RelVals in the PR in question. The log contains

Starting env LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so  cmsRun -j JobReport3.xml  step3_RAW2DIGI_RECO_DQM.py
Memory Report: total memory requested: 2227
Memory Report:  max memory used: 2280
Memory Report:  presently used: 0
Memory Report:  # allocations calls:   13
Memory Report:  # deallocations calls: 16
----- Begin Fatal Exception 24-Feb-2025 15:11:42 EET-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named step3_RAW2DIGI_RECO_DQM.py
Exception Message:
 unknown python problem occurred.
ValueError: -11 is not a valid PlatformStatus

At:
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/external/python3/3.9.14-ccc34bac15aa449b4c76ba24d02d2fd7/lib/python3.9/enum.py(713): __new__
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/external/python3/3.9.14-ccc34bac15aa449b4c76ba24d02d2fd7/lib/python3.9/enum.py(384): __call__
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/HeterogeneousCore/ROCmCore/python/ProcessAcceleratorROCm.py(19): enabledLabels
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/FWCore/ParameterSet/python/Config.py(1535): handleProcessAccelerators
  /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02878/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-23-0000/src/FWCore/ParameterSet/python/Config.py(1468): fillProcessDesc
  <string>(2): <module>

----- End Fatal Exception -------------------------------------------------

So cmsRun starts up and when running the python to determine which hardware accelerators are available it tries to use the ProcessAcceleratorROCm python module. It appears that when that is loaded it crashes.
The call to enableLabels seen in they python stack has

status = PlatformStatus(os.waitstatus_to_exitcode(os.system("rocmIsEnabled")))

so that is the origin of the call to the stand alone binary rocmIsEnabled mentioned in the description of the issue.

@makortel
Copy link
Contributor

We have seen this behavior also before #45964 (comment)

@Dr15Jones
Copy link
Contributor

So I ran

LD_PRELOAD="libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so" rocmIsEnabled; echo $?

a dozen or so times at FNAL using a CMSSW_15_1_RNTUPLE_X_2025-02-16-2300 work area (since I had it handy) and I saw no problems.

What is needed to get a consistent (or at least a probable) crash?

@iarspider
Copy link
Contributor Author

@Dr15Jones I got this crash on a node (LUMI) with ROCm-enabled GPU using CMSSW_15_1_X_2025-02-23-0000 (but I think only the first part is important).

@makortel
Copy link
Contributor

It seems to be technically possible to avoid passing the LD_PRELOAD (or filtering out these libraries if there is something else) where ProcessAcceleratorROCm calls the rocmIsEnabled.

@Dr15Jones
Copy link
Contributor

Of course avoiding the crash in rocmIsEnabled when doing the LD_PRELOAD in cmsRun is great, but is this just a canary in the coal mine where we will crash in cmsRun itself when we try to use the rocm based GPU?

@makortel
Copy link
Contributor

I vaguely recall the MaxMemoryPreload was supposed to be run only on select IB flavors (I can't find the discussion though, I did find the cms-bot PR adding the use of --maxmem_profile cms-sw/cms-bot#2202).

@makortel
Copy link
Contributor

@gartung mentioned he experienced crash in rocmIsEnabled (on a machine without AMD GPU) also preloading other (profiling) libraries. So maybe we should drop the LD_PRELOAD from the environment when calling rocmIsEnabled (to not cause problems on non-AMD-GPU machines)

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

@gartung this actually works for me on a bare metal node, with a Radeon Pro W7800:

$ cd /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_15_0_0_pre3

$ cmsenv

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled
Memory Report: total memory requested: 231066386
Memory Report:  max memory used: 14799096
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   732952
Memory Report:  # deallocations calls: 741971

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
   0     gfx1100    AMD Radeon PRO W7800
Memory Report: total memory requested: 231094400
Memory Report:  max memory used: 14799304
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   733031
Memory Report:  # deallocations calls: 743723

Update on this node it also works within an Alma 8 or Alma 9 container:

Singularity> rocmIsEnabled

Singularity> LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
   0     gfx1100    AMD Radeon Graphics
Memory Report: total memory requested: 231103916
Memory Report:  max memory used: 14799200
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   733223
Memory Report:  # deallocations calls: 743916

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

I confirm that it does fail on LUMI, in an Alma 8 container:

$ rocmComputeCapabilities
   0    gfx90a:sramecc+:xnack-    AMD Instinct MI250X

$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmComputeCapabilities
Segmentation fault

@gartung
Copy link
Member

gartung commented Feb 25, 2025

I was using export LD_PREOAD=libprofiler.so and then running cmsRun config.py. The os.system('rocmIsEnabled') returned -11 instead of 0,1,2 which was the expected return.

@Dr15Jones
Copy link
Contributor

@fwyzard
Would it be at all possible to use a DEBUG version of CMSSW on LUMI, use gdb and get a full backtrace?

@makortel
Copy link
Contributor

So maybe we should drop the LD_PRELOAD from the environment when calling rocmIsEnabled (to not cause problems on non-AMD-GPU machines)

FWIW I opened a draft PR to do that #47452

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

@Dr15Jones

Would it be at all possible to use a DEBUG version of CMSSW on LUMI, use gdb and get a full backtrace?

ehr... with a debug build (I used CMSSW_15_1_DBG_X_2025-02-19-2300) the LD_PRELOAD command does not crash:

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ cmsenv

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled 
Memory Report: total memory requested: 248269565
Memory Report:  max memory used: 14760840
Memory Report:  presently used: 16
Memory Report:  # allocations calls:   860831
Memory Report:  # deallocations calls: 869850

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_DBG_X_2025-02-19-2300$ echo $?
0

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

Actually it also works with the same non-DEBUG IB:

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ cmsenv

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ LD_PRELOAD=libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so rocmIsEnabled 
Memory Report: total memory requested: 248052703
Memory Report:  max memory used: 14760776
Memory Report:  presently used: 8
Memory Report:  # allocations calls:   860799
Memory Report:  # deallocations calls: 869818

andbocci@nid007977:/cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-19-2300$ echo $?
0

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

It actually works on all releases earlier than CMSSW_15_1_X_2025-02-21-2300, and fails on that release and all more recent ones.

@smuzaffar
Copy link
Contributor

TBB (version v2022.0.0) was updated for CMSSW_15_1_X_2025-02-21-2300 and above

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

Mhm 🤔

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

Anyway, here is the GDB stack trace from rocmIsEnabled in CMSSW_15_1_X_2025-02-21-2300:

$ gdb -ex 'set pagination off' -ex 'set environment LD_PRELOAD libPerfToolsAllocMonitorPreload.so:libPerfToolsMaxMemoryPreload.so' -ex r -ex bt rocmIsEnabled

...

Thread 1 "rocmIsEnabled" received signal SIGSEGV, Segmentation fault.
0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

playing with breakpoints, I can get a similar stack trace with CMSSW_15_1_X_2025-02-21-1100:

#0  0x00001555538a82a0 in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator delete(void*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/lib/el8_amd64_gcc12/libPerfToolsAllocMonitorPreload.so
#2  0x000015554f8dd1fb in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#3  0x000015554f8ddd93 in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::run(llvm::BasicBlock*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#4  0x000015554f8dfc22 in llvm::CycleInfoWrapperPass::runOnFunction(llvm::Function&) [clone .localalias.5] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#5  0x0000155550035c69 in llvm::FPPassManager::runOnFunction(llvm::Function&) [clone .localalias.4] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#6  0x0000155550035db1 in llvm::FPPassManager::runOnModule(llvm::Module&) [clone .localalias.54] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#7  0x0000155550036a7f in llvm::legacy::PassManagerImpl::run(llvm::Module&) [clone .localalias.36] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#8  0x000015554c0695ac in clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >, clang::BackendConsumer*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#9  0x000015554c0454a1 in clang::CodeGenAction::ExecuteAction() [clone .localalias.40] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#10 0x000015554db23851 in clang::FrontendAction::Execute() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#11 0x000015554daaf2fa in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) [clone .localalias.2] () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#12 0x000015554bb9e673 in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#13 0x000015554af7fd6b in COMGR::AMDGPUCompiler::executeInProcessDriver(llvm::ArrayRef<char const*>) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#14 0x000015554af81fdc in COMGR::AMDGPUCompiler::processFile(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#15 0x000015554af82618 in COMGR::AMDGPUCompiler::processFiles(amd_comgr_data_kind_s, char const*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#16 0x000015554af9356d in amd_comgr_do_action () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
#17 0x0000155553e8c205 in amd::device::Program::compileAndLinkExecutable(amd_comgr_data_set_s, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, amd::option::Options*, char**, unsigned long*, amd::device::Program::file_type_t) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#18 0x0000155553e8ebd4 in amd::device::Program::linkImplLC(amd::option::Options*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#19 0x0000155553e8b141 in amd::device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#20 0x0000155553eb4b26 in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#21 0x0000155553e85ded in amd::Device::BlitProgram::create(amd::Device*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#22 0x0000155553ec2edb in amd::roc::Device::createBlitProgram() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#23 0x0000155553f06aa8 in amd::roc::KernelBlitManager::createProgram(amd::roc::Device&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#24 0x0000155553edb4cd in amd::roc::VirtualGPU::create() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#25 0x0000155553ebce08 in amd::roc::Device::createVirtualDevice(amd::CommandQueue*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#26 0x0000155553ea9f74 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority, std::vector<unsigned int, std::allocator<unsigned int> > const&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#27 0x0000155553e05e19 in hip::Stream::Stream(hip::Device*, hip::Stream::Priority, unsigned int, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, hipStreamCaptureStatus) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#28 0x0000155553ca1c94 in hip::Device::NullStream(bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#29 0x0000155553d4dba7 in hip::ihipMemset(void*, long, unsigned long, unsigned long, ihipStream_t*, bool) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#30 0x0000155553d7ed2c in hip::hipMemset(void*, int, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02877/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-1100/external/el8_amd64_gcc12/lib/libamdhip64.so.6
#31 0x00000000004012fd in isRocmDeviceSupported(int) ()
#32 0x0000000000401190 in main ()

but then if I continue, it works:

(gdb) disable  
(gdb) c
Continuing.
[Thread 0x155443dff700 (LWP 5789) exited]
[Thread 0x1555493ff700 (LWP 5787) exited]
Memory Report: total memory requested: 248083365
Memory Report:  max memory used: 14764880
Memory Report:  presently used: 0
Memory Report:  # allocations calls:   860865
Memory Report:  # deallocations calls: 869884
[Inferior 1 (process 5786) exited normally]

@fwyzard
Copy link
Contributor

fwyzard commented Feb 25, 2025

And here is the top (bottom ?) of the stack trace with CMSSW_15_1_X_2025-02-21-2300, after rebuilding the relevant packages with debug symbols:

#0  0x00001555538a8a4c in _int_free () from /lib64/libc.so.6
#1  0x00001555555441ac in operator()<void*> (ptr=0xcf04c0, __closure=<synthetic pointer>) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:326
#2  cms::perftools::AllocMonitorRegistry::deallocCalled<operator delete(void*)::<lambda(auto:26)>, operator delete(void*)::<lambda(auto:27)> > (iDealloc=..., iGetActual=..., iPtr=0xcf04c0, this=0x1555555320c0 <cms::perftools::AllocMonitorRegistry::instance()::s_registry>) at /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/src/PerfTools/AllocMonitor/interface/AllocMonitorRegistry.h:133
#3  operator delete (ptr=0xcf04c0) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:326
#4  operator delete (ptr=0xcf04c0) at src/PerfTools/AllocMonitorPreload/src/memory_proxies.cc:318
#5  0x000015554f8dd24d in llvm::GenericCycleInfoCompute<llvm::GenericSSAContext<llvm::Function> >::updateDepth(llvm::GenericCycle<llvm::GenericSSAContext<llvm::Function> >*) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/week1/el8_amd64_gcc12/cms/cmssw/CMSSW_15_1_X_2025-02-21-2300/external/el8_amd64_gcc12/lib/libamd_comgr.so.2
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants