-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ATDM Trilinos 'sems-rhel7' configuration broken on new CEE hpwsXYZ and cee-buildXYZ machines #10022
Comments
@trilinos/framework, my understanding is that CEE is pushing everyone to move to these HPWS machines and is eliminating the RWS and EWS machines. I know the current ATDM Trilinos 'sems-rhel7' builds are running on ascicgpu machines but most Trilinos developers can't get access to those machines. (For example, I don't think I currently have access to any ascigpu machines.) I need to have a working set of Trilinos builds in order to test TriBITS work associated with TriBITSPub/TriBITS#367. Otherwise, I just have to do my best to locally test and cross fingers and merge to Trilinos 'develop'. But I would rather run a more comprehensive set of builds before merging future TriBITS changes. |
@bartlettroscoe: I don't see this build error showing up in the PrimaryATDM or SecondaryATDM builds. I will raise this github issue during our meeting tomorrow. CC: @jwillenbring, @ZUUL42 |
@e10harvey, I suspect that is because the builds posting to CDash are running on ASCIGPU machines, not HPWS machines. The latter are fairly new. See my comment above. |
FYI: As a basis of comparison, I ran the exact same Teuchos 'sems-rhel7' builds with the exact same version of Trilinos on the CEE build machine 'cee-build015' and I got all passing builds and tests (except for 3 failing tests from the CUDA build because this machine does not have a GPU). This shows the problem is with the SEMS modules and/or the ATDM Trilinos env and/or the HPWS machines themselves. Details of Teuchos 'sems-rhel7' builds on 'cee-build015' and results on CDash (click to expand). Now to run the full set of supported builds for Teuchos on the machine 'cee-build015':
posted to: with only 3 failing tests in the CUDA build: and all 3 because this machine does not have a GPU and displays the error:
as shown in this query: |
FYI: The RHEL7 version is newer on the HPWS machine 'hpws055' than on the older CEE machine 'cee-build015'. The details are given in TRILINOSHD-59. It may just be that the SEMS modules are broken for the newer OS. Hopefully the newer SEMS modules build with Spack will work okay. |
@fryeguy52: Would you please look into the module issues on the HPWS machine? |
FYI: I got a rude reminder this is still broken as documented in #10836 (comment). |
And I also hit this as described in #10823 (comment). I don't seem to be remembering that the SEMS modules are broken in the CEE High Performance Work Station (HPWS) machines :-( |
It seems this same problem is impacting the new 'ascic0xy' machines as well. See #10999. |
Well shoot, this same error is occurring on the 'cee-build030' machine as well :-( I am trying to run full Trilinos PR builds as part of testing a CMake upgrade as part of #10355 on the beefy machine 'cee-build030' and I can't successfully load a GenConfig env. Specifically, for the env load scirpt
When I source it it produces:
And then when I try to configure and build Trilinos, the mpiexec command used by Trilinos produces errors like:
While this may be a problem with the SEMS modules, it seems that GenConfig can't figure out of it loaded the env correctly or not. |
See #10999 (comment) |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
@trilinos/framework
Description
It seems the sems modules are broken on the new CEE 'hpwsXYZ' and 'cee-buildXYZ' machines. And the Trilinos GenConfig scripts don't detect they are broken (see below).
Original Description
It appears that the SEMS RHEL7 modules or perhaps just the ATDM Trilinos 'sems-rhel7' configuration is broken on the new CEE HPWS machines. It seems the builds complete but a lot of tests fail with errors like:
all showing
undefined symbol: ompi_common_verbs_usnic_register_fake_drivers
.To demonstrate this, I used a recent version of Trilinos 'develop' from ATDM Trilinos testing day 2021-12-21 55091be as shown by:
that was pretty clean as shown in this query.
I demonstrated the problem by running, on my machine 'hpws055', I ran all of the supported ATDM Trilinos 'sems-rhel7' builds for just the Teuchos test suite as:
Detailed output from ctest-s-local-test-drivers.sh (click to expand)
.
which posted to CDash as shown here:
There are 126 failing tests for each build, 756 failing tests across all of these builds as shown here:
All of these failing tests show the error
undefined symbol: ompi_common_verbs_usnic_register_fake_drivers
as shown in this query:Internal issues:
The text was updated successfully, but these errors were encountered: