Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Some PR build errors showing up as strange 'cmake --build . --config Release -- -j29 -k 0' errors #10823

Closed
bartlettroscoe opened this issue Aug 2, 2022 · 18 comments
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests Waiting Waiting for some external team to do something before this can be completed

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 2, 2022

Bug Report

@trilinos/framework

Next Action Status

This is due to a defect in CTest introduced in CMake 3.18. The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ...

Internal issues

Description

As shown in this query showing:

image

the new Trilinos Framework GenConfig build rhel7_sems-cuda-11.4.2-sems-gnu-10.1.0-sems-openmpi-4.0.5_release_static_Volta70_no-asan_complex_no-fpic_mpi_pt_no-rdc_no-uvm_deprecated-on_no-package-enables is failing with a build error reported in the Zoltan2Sphynx package showing:

"/projects/sems/install/rhel7-x86_64/sems/v2/utility/cmake/3.21.1/gcc/7.3.0/mxfpluq/bin/cmake" "--build" "." "--config" "Release" "--" "-j29" "-k" "0"

returning error code 1.

As you can see, this is currently failing in the "Master Merge" builds for promotion PRs #10820 and #10797 so this error has nothing to do with a given PR branch, this is impacting 'develop' and will impact everyone's PRs. The reason I saw it is because it took out my last PR iteration #10813 (comment) for PR #10813.

Steps to Reproduce

Run a PR build.

@bartlettroscoe
Copy link
Member Author

CC: @tcclevenger

And this error just took out a PR testing iteration for PR #10751 shown in the build:

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 3, 2022

CC: @fryeguy52

FYI: I searched the Trilinos 'develop' branch as of commit 1cd5bae:

* 1cd5bae44bc "Merge pull request #10809 from cgcgcg/cxxStandard"
| Author: Christian Glusa <[email protected]>
| Date:   Wed Aug 3 08:34:02 2022 -0600 (63 minutes ago)
| 
| M     cmake/ctest/drivers/enigma/TrilinosCTestDriverCore.enigma.gcc.cmake
| M     cmake/ctest/drivers/geminga/TrilinosCTestDriverCore.geminga.gcc-cuda.cmake
| M     cmake/ctest/drivers/geminga/TrilinosCTestDriverCore.geminga.gcc.cmake
| M     cmake/ctest/drivers/lightsaber/TrilinosCTestDriverCore.lightsaber.gcc.cmake
| M     cmake/ctest/drivers/rocketman/TrilinosCTestDriverCore.rocketman.gcc.cmake
| M     cmake/ctest/drivers/trappist/TrilinosCTestDriverCore.trappist.clang.cmake
| M     cmake/ctest/drivers/trappist/TrilinosCTestDriverCore.trappist.gcc.cmake

and I did a search to try to find the code that is generated this command:

cmake --build . --config Release -- -j29 -k 0

by running:

$ cd Trilinos/

$ find . -type f -exec grep -nH "[-][-]build [.]" {} \; | grep -v /TriBITS/ | grep -v cmake/tribits/
./kokkos/appveyor.yml:9:    cmake --build . --target install &&
./seacas/.appveyor.yml:83:  - cmd: cmake --build . --config %configuration% -- /maxcpucount:4
./packages/kokkos/appveyor.yml:9:    cmake --build . --target install &&
./packages/sacado/test/GTestSuite/googletest/appveyor.yml:111:    & cmake --build . --config $env:configuration -- $cmake_parallel
./packages/sacado/test/GTestSuite/googletest/googletest/README.md:106:execute_process(COMMAND ${CMAKE_COMMAND} --build .

The closest match above is:

./packages/sacado/test/GTestSuite/googletest/appveyor.yml:111:    & cmake --build . --config $env:configuration -- $cmake_parallel

That name appveyor gets mentioned in:

$ cd packages/sacado/test/GTestSuite/googletest/

$ find . -type f -exec grep -nH appveyor {} \;
./appveyor.yml:63:        appveyor DownloadFile https://github.com/bazelbuild/bazel/releases/download/0.28.1/bazel-0.28.1-windows-x86_64.exe -FileName bazel.exe
./README.md:6:[![Build status](https://ci.appveyor.com/api/projects/status/4o38plt0xbo1ubc8/branch/master?svg=true)](https://ci.appveyor.com/project/GoogleTestAppVeyor/googletest/branch/master)

But looking at the line from appveyor.yml is shows:

    $cmake_parallel = if ($env:generator -eq "MinGW Makefiles") {"-j2"} else  {"/m"}
    & cmake --build . --config $env:configuration -- $cmake_parallel

Well, that does not match the the signature:

cmake --build . --config Release -- -j29 -k 0

It seems that file appveyor.yml is a configuration file for a tool appveyor which is used as part of a CI/CD system called AppVoyer. The main website https://www.appveyor.com/ shows that Google is one of their customers so I can't see how that would get run in Trilinos PR testing.

So I am stumped how this command is getting run as part of Trilinos PR testing.

I will see if I can reproduce these errors myself (word is that we should be able to which I will try out now).

@bartlettroscoe bartlettroscoe added the PA: Framework Issues that fall under the Trilinos Framework Product Area label Aug 5, 2022
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 5, 2022

CC: @csiefer2, @e10harvey

NOTE: The builds that show these errors are similar to those that show "6" build errors reported in #10836 in that they zero tests "Not Run", "Fail" and "Pass". These are a little harder to search for on CDash but this query seems to select them.

Looking over this set of builds, we see different types of errors reported for the command:

"<base-dir>/cmake" "--build" "." "--config" "Debug" "--" "-j20" "-k" "0"

These look like real build errors in Trilinos but they are not being reported correctly with each package. Instead, they are just reported for the outer cmake --build . command.

Here are some examples of different build errors reported:

1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h :

/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/thyra/adapters/epetra/test/EpetraOperatorWrapper/EpetraOperatorWrapper_UnitTests.cpp:54:10: fatal error: Trilinos_Util_CrsMatrixGallery.h: No such file or directory
 #include "Trilinos_Util_CrsMatrixGallery.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

NOTE: All of the examples above are from the builds rhel7_sems-gnu-7.2.0 or rhel7_sems-gnu-8.3.0!

2. MueLu_Test_ETI.hpp ISO C++ forbids declaration of ‘type name’ with no type:

In file included from /scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/Driver_Structured.cpp:437:0: /scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/../unit_tests/MueLu_Test_ETI.hpp: In function ‘bool Automatic_Test_ETI(int, char**)’:
/scratch/trilinos/jenkins/ascic166/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/muelu/test/structured/../unit_tests/MueLu_Test_ETI.hpp:91:31: error: ISO C++ forbids declaration of ‘type name’ with no type [-fpermissive]
   Teuchos::RCP<const Teuchos::MpiComm<int> > comm = Teuchos::rcp_dynamic_cast<const Teuchos::MpiComm<int> >(Teuchos::DefaultComm<int>::getComm());
                               ^~~~~~~

NOTE: All of the examples above are from the builds rhel7_sems-gnu-7.2.0!

3. ninja: error: loading 'build.ninja': No such file or directory:

ninja: error: loading 'build.ninja': No such file or directory

4. No error output:

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 5, 2022

Note, we see the error:

1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h

above being cleanly reported on the 'vortex' builds with the Thyra package shown here impacting PRs #10834, #10802, #10801, and #10751.

What I think is happening is that the same build error for the 'ascic' builds with the 'gnu-7.2.0' and 'gnu-8.3.0' builds is getting reported through the command cmake --build . --config Release -- -j29 -k 0 in a way that I don't understand.

We need to see if we can reproduce this build error on one of the 'ascic' builds locally.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 5, 2022

FYI: I am trying to reproduce the 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h error for the build:

  • rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

on the machine 'hpws055'.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 5, 2022

FYI: I tried to reproduce the build error 1.EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h for the build:

  • rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

from the machine 'hpws055' and I was not successful in doing so. All of Thyra built just fine, including the executable Thyra_EpetraOperatorWrapper_UnitTests. However, the tests all crash showing:

... lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers

It appears you can't reproduce Trilinos PR builds on HPWS machines at SNL :-(

I will try reproducing on a real 'ascicgpu' machine.

Attempt to reproduce 'EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h' build error with 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1' build on 'hpws055' Details: (click to expand)

Trying to reproduce the build error "EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h" on the machine 'hpws055'.

The repo version is:

$ ssh hpws055

$ cd /fgs/rabartl/Trilinos.base/Trilinos/

$ gitdist-status --dist-repos=.
----------------------------------------------------------------
| ID | Repo Dir        | Branch  | Tracking Branch | C | M | ? |
|----|-----------------|---------|-----------------|---|---|---|
|  0 | Trilinos (Base) | develop | github/develop  |   |   |   |
----------------------------------------------------------------

$ gitdist-repo-versions --dist-repos=.
*** Base Git Repo: Trilinos
7256b6e3b61225859d96d22aed7757b446144861 [Fri Aug 5 15:08:43 2022 -0600] <[email protected]>
Merge pull request #10814 from iyamazaki/amesos2-pardiso

Doing the configure, build, and test with:

$ ssh hpws055

$ cd /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/

$ cat load-env-and-cmake-frag-file.sh
if [[ -e GenConfigSettings.cmake ]] ; then
  echo "Remvoing existing file GenConfigSettings.cmake ..."
  rm GenConfigSettings.cmake
fi
source /fgs/rabartl/Trilinos.base/Trilinos/packages/framework/GenConfig/gen-config.sh \
--cmake-fragment GenConfigSettings.cmake \
rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables \
--force \
"$@"

$ cat do-configure
if [[ -e CMakeCache.txt ]] ; then
  echo "Removing CMakeCache.txt ..."
  rm CMakeCache.txt
fi
if [[ -d CMakeFiles ]] ; then
  echo "Removing CMakeFiles ..."
  rm -r CMakeFiles
fi
cmake \
-G Ninja \
-C GenConfigSettings.cmake \
-D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \
-D Trilinos_ENABLE_TESTS=ON \
-D Trilinos_TRACE_ADD_TEST=ON \
"$@" \
/fgs/rabartl/Trilinos.base/Trilinos

$ . load-env-and-cmake-frag-file.sh
Setting system to 'rhel7' based on specification in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-gnu-7.2.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched complete configuration 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'
  for build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
* CMake fragment file written to: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/GenConfigSettings.cmake

$ time ./do-configure -DTrilinos_ENABLE_Thyra=ON &> configure.out && time ninja -j14 &> make.out && time ctest -j14 &> ctest.out

real    0m13.211s
user    0m6.981s
sys     0m5.088s

real    0m0.215s
user    0m0.049s
sys     0m0.043s

real    0m4.059s
user    0m14.974s
sys     0m9.000s

$ grep "failed out of" ctest.out 
2% tests passed, 80 tests failed out of 82

Well, everything built but I got a bunch of test failures. It seems the problem is:

$ ctest -VV -R "^ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1$"

...

test 1
    Start 1: ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1

1: Test command: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/bin/mpirun "--bind-to" "none" "-np" "1" "/fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/packages/thyra/core/test/nonlinear/models/UnitTests/ThyraCore_Simple2DModelEvaluatorUnitTests.exe"
1: Test timeout computed to be: 1500
1: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/packages/thyra/core/test/nonlinear/models/UnitTests/ThyraCore_Simple2DModelEvaluatorUnitTests.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers
1: -------------------------------------------------------
1: Primary job  terminated normally, but 1 process returned
1: a non-zero exit code.. Per user-direction, the job has been aborted.
1: -------------------------------------------------------
1: --------------------------------------------------------------------------
1: mpirun detected that one or more processes exited with non-zero status, thus causing
1: the job to be terminated. The first process to do so was:
1: 
1:   Process name: [[42265,1],0]
1:   Exit code:    127
1: --------------------------------------------------------------------------
1/1 Test #1: ThyraCore_Simple2DModelEvaluatorUnitTests_MPI_1 ...***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
]  0.26 sec

...

Hum, it seems you can't reproduce Trilinos PR build test results from an HPWS machine :-(

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 6, 2022

FYI: I tried to reproduce the build error 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h for the build:

  • rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables

from the machine 'ascicgpu17' and I was not successful in doing so. All of Thyra built just fine, including the executable Thyra_EpetraOperatorWrapper_UnitTests and all of the tests ran successfully. That submitted to CDash here and showed all 82 passing Thyra tests.

Attempt to reproduce 'EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h' build error with 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1' build on 'ascicgpu17' Details: (click to expand)

Trying to reproduce the build error "EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h" on the machine 'ascicgpu17'.

The repo version is:

$ ssh ascicgpu17

$ cd /fgs/rabartl/Trilinos.base/Trilinos/

$ gitdist-status --dist-repos=.
----------------------------------------------------------------
| ID | Repo Dir        | Branch  | Tracking Branch | C | M | ? |
|----|-----------------|---------|-----------------|---|---|---|
|  0 | Trilinos (Base) | develop | github/develop  |   |   |   |
----------------------------------------------------------------

$ gitdist-repo-versions --dist-repos=.
*** Base Git Repo: Trilinos
7256b6e3b61225859d96d22aed7757b446144861 [Fri Aug 5 15:08:43 2022 -0600] <[email protected]>
Merge pull request #10814 from iyamazaki/amesos2-pardiso

Doing the configure, build, and test with:

$ ssh ascicgpu

$ cd /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/

$ cat load-env-and-cmake-frag-file.sh
if [[ -e GenConfigSettings.cmake ]] ; then
  echo "Remvoing existing file GenConfigSettings.cmake ..."
  rm GenConfigSettings.cmake
fi
source /fgs/rabartl/Trilinos.base/Trilinos/packages/framework/GenConfig/gen-config.sh \
--cmake-fragment GenConfigSettings.cmake \
rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables \
--force \
"$@"

$ cat do-configure
if [[ -e CMakeCache.txt ]] ; then
  echo "Removing CMakeCache.txt ..."
  rm CMakeCache.txt
fi
if [[ -d CMakeFiles ]] ; then
  echo "Removing CMakeFiles ..."
  rm -r CMakeFiles
fi
cmake \
-G Ninja \
-C GenConfigSettings.cmake \
-D Trilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES=OFF \
-D Trilinos_ENABLE_TESTS=ON \
-D Trilinos_TRACE_ADD_TEST=ON \
"$@" \
/fgs/rabartl/Trilinos.base/Trilinos

$ script load-env-and-cmake-frag-file.out
Script started, file is load-env-and-cmake-frag-file.out
[rabartl@ascicgpu17 rhel7_sems-gnu-7.2.0-openmpi-1.10.1]$ . load-env-and-cmake-frag-file.sh
Remvoing existing file GenConfigSettings.cmake ...
Using system 'rhel7' based on matching hostname 'ascicgpu17'.
Overriding system to 'rhel7' based on specification in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched environment name 'sems-gnu-7.2.0-openmpi-1.10.1-serial' in build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
Matched complete configuration 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'
  for build name 'rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables'.
* CMake fragment file written to: /fgs/rabartl/Trilinos.base/BUILDS/PR/rhel7_sems-gnu-7.2.0-openmpi-1.10.1/GenConfigSettings.cmake

./do-configure -DTrilinos_ENABLE_Thyra=ON &> configure.out && time make dashboard.out &> make.dashboard.out

real    0m17.319s
user    0m6.755s
sys     0m7.218s

real    3m53.695s
user    29m12.527s
sys     5m55.026s

$ grep "failed out of" make.dashboard.out 
100% tests passed, 0 tests failed out of 82

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 8, 2022

FYI: There is independent confirmation in new issue #10842 of the error 1. EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h. I will move my analysis of this error over to that issue.

NOTE: My current hypothesis is that an older version of Trilinos from a couple of weeks ago showed this error but has since been fixed on 'develop'. I will test that hypothesis out and document findings in #10842.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 8, 2022

FYI: There is another clue in #10842 (comment). It seems that you might see the error ** 1. EpetraOperatorWrapper_UnitTests.cpp missing Trilinos_Util_CrsMatrixGallery.** when running out of disk space.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 8, 2022

FYI: I made my last very careful effort to reproduce the EpetraOperatorWrapper_UnitTests.cpp cannot open Trilinos_Util_CrsMatrixGallery.h error in #10842 (comment) for the 'vortex' build for PR #10808 and I was not able to do so (i.e. it passed the build).

@jhux2
Copy link
Member

jhux2 commented Aug 24, 2022

Note that this issue is also tracking what was reported in #10906.

"In some PR testing, compile failures are erroneously showing up under the subproject Zoltan2Sphyx."

@bartlettroscoe bartlettroscoe changed the title Framework: New ascicgpu CUDA PR build failing with strange 'cmake --build . --config Release -- -j29 -k 0' error Framework: Some PR build errors showing up as strange 'cmake --build . --config Release -- -j29 -k 0' errors Aug 24, 2022
@bartlettroscoe
Copy link
Member Author

FYI: Still no XML files being archived in the Jenkins jobs to allow us to debug what is causing this behavior. See TRILINOSHD-188.

@bartlettroscoe
Copy link
Member Author

FYI: We are still seeing a bunch of these cases where errors are reported to Zoltan2Sphynx as seen here over the last 2 days with 7 PR iterations showing failures:

image

@bartlettroscoe
Copy link
Member Author

FYI: See #10836 (comment) and #10836 (comment).

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Sep 7, 2022

CC: @e10harvey, @zackgalbreath

FYI: The problem of reporting the global cmake --build . [other arguments] command does not seem to be solved. In the build PR-10962-test-rhel7_sems-gnu-7.2.0-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_no-mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1025 that just ran an hour ago, it shows the build errors:

image

which shows a build error in the example object file:

packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o

Why is that build error not being reported along with the Compadre?

The Build.xml file archived in:

shown here is given below.

What is strange about these two build errors is that they are for the same Compadre build error:

/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp: In function ‘int main(int, char**)’:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:9: error: iteration 2147483647 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
         for (int j=0; j&lt;dimension-1; ++j) {
         ^~~
[CTest: warning matched] /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:24: note: within this loop
         for (int j=0; j&lt;dimension-1; ++j) {
                       ~^~~~~~~~~~~~
[CTest: warning matched] cc1plus: all warnings being treated as errors

and the Build.xml file shows two entries for the same build error. It is almost like the ctest -S process is running the build twice: once with launchers turned on and a follow up build with launchers turned off.

The second failure for the global cmake --build command entry in the XML file shows:

		<Failure type="Error">
			<!-- Meta-information about the build action -->
			<Action/>
			<!-- Details of command -->
			<Command>
				<WorkingDirectory>/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/pull_request_test</WorkingDirectory>
				<Argument>/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.19.1/bin/cmake</Argument>
				<Argument>--build</Argument>
				<Argument>.</Argument>
				<Argument>--config</Argument>
				<Argument>Debug</Argument>
				<Argument>--</Argument>
				<Argument>-j20</Argument>
				<Argument>-k</Argument>
				<Argument>0</Argument>
			</Command>
			<!-- Result of command -->
			<Result>
				<StdOut>[1/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_NumericTraits.cpp.o
[2/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
[3/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_MemorySpace.cpp.o
[4/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_MemoryPool.cpp.o
[5/13366] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Spinwait.cpp.o
...
FAILED: packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o 
"/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.19.1/bin/ctest" --launch --target-name Compadre_GMLS_Manifold_Test --build-dir /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/pull_request_test/packages/compadre/examples --output packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o --source /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp --language CXX --filter-prefix "" -- /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/base/bin/g++  -I. -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples -Ipackages/compadre/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/basis -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/constraints -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/src/tpl -Ipackages/kokkos/core/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/core/src -Ipackages/kokkos -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/core/src/../../tpls/desul/include -Ipackages/kokkos/containers/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/containers/src -Ipackages/kokkos/algorithms/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos/algorithms/src -Ipackages/kokkos-kernels/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/impl -Ipackages/kokkos-kernels/src/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/impl/tpls -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/blas -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/blas/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/sparse -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/sparse/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/graph -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/graph/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/dense -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/dense/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/sparse -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/batched/sparse/impl -I/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/kokkos-kernels/src/common -isystem /projects/sems/install/rhel7-x86_64/sems/tpl/superlu/4.3/gcc/7.2.0/base/include -pedantic -Wall -Wno-long-long -Wwrite-strings -Wall -Wno-clobbered -Wno-vla -Wno-pragmas -Wno-unknown-pragmas -Wno-unused-local-typedefs -Wno-literal-suffix -Wno-deprecated-declarations -Wno-misleading-indentation -Wno-int-in-bool-context -Wno-maybe-uninitialized -Wno-nonnull-compare -Wno-address -Wno-inline -Wno-unused-but-set-variable -Wno-unused-variable -Wno-unused-label -Werror -DTRILINOS_HIDE_DEPRECATED_HEADER_WARNINGS   -O3 -DNDEBUG -std=c++14 -MD -MT packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o -MF packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o.d -o packages/compadre/examples/CMakeFiles/Compadre_GMLS_Manifold_Test.dir/GMLS_Manifold.cpp.o -c /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp: In function ‘int main(int, char**)’:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:9: error: iteration 2147483647 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
         for (int j=0; j&lt;dimension-1; ++j) {
         ^~~
[CTest: warning matched] /scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-serial/Trilinos/packages/compadre/examples/GMLS_Manifold.cpp:503:24: note: within this loop
         for (int j=0; j&lt;dimension-1; ++j) {
                       ~^~~~~~~~~~~~
[CTest: warning matched] cc1plus: all warnings being treated as errors
[8514/13366] Building CXX object packages/stk/stk_util/stk_util/util/CMakeFiles/stk_util_util.dir/tokenize.cpp.o
[8515/13366] Building CXX object packages/stk/stk_util/stk_util/util/CMakeFiles/stk_util_util.dir/human_bytes.cpp.o
[8516/13366] Building CXX object packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/CPUTime.cpp.o
...
[13363/13366] Linking CXX executable packages/piro/test/Piro_ThyraSolver.exe
[13364/13366] Building CXX object packages/trilinoscouplings/examples/scaling/CMakeFiles/TrilinosCouplings_Example_Poisson_NoFE_Tpetra.dir/example_Poisson_NoFE_Tpetra.cpp.o
[13365/13366] Linking CXX executable packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_NoFE_Tpetra.exe
ninja: build stopped: cannot make progress due to previous errors.</StdOut>
				<StdErr/>
				<ExitCondition>1</ExitCondition>
			</Result>
		</Failure>

This is so strange.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Sep 21, 2022

FYI: The behavior described above turns out the be a CTest defect. For details and to follow the fix, see:

Unfortunately, I think that means we will need to upgrade CMake/CTest on all client machines to fix this which will require waiting for CMake 3.25.0 in Jan 2023 (or perhaps a patch release of CMake 3.24).

Update: The fix is going to come out in CMake 3.23.3!

@bartlettroscoe bartlettroscoe added the Waiting Waiting for some external team to do something before this can be completed label Oct 19, 2022
@bartlettroscoe
Copy link
Member Author

FYI: The fix for this is in CMake 3.24.3 (released 2022-11-01) . (See SNL Kitware #209). Next: Install CMake 3.24.3 everywhere and use with Trilinos PR builds ...

@bartlettroscoe
Copy link
Member Author

With the upgrade of CMake 3.24.3 for all of the Trilinos PR builds yesterday, this should be resolved (see TRILINOSHD-228). For example, we are only seeing build errors for actual targets in the PR builds over the last day shown here and we see just the build error for the target:

Error building packages/seacas/libraries/ioss/src/exodus/CMakeFiles/Ioex.dir/Ioex_ParallelDatabaseIO.C.o

in the build PR-11309-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1484.

Using a version of CMake between versions 3.19 and 2.24.2 (inclusive), we would have seen that same error showing up along with the entire ninja build output for all targets (including all warnings that was the cause of #10836).

Closing this as complete.

Boy, that was a hard one to diagnose. But the fact that Kitware was willing to patch CMake 3.24.3, SEMS was willing to install CMake 3.24.3, and the Trilinos Framework team was willing and able to upgrade all of the PR builds is what allowed this to be fixed relatively quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests Waiting Waiting for some external team to do something before this can be completed
Projects
Development

No branches or pull requests

2 participants