Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tests hang on ppc64le and aarch64 #1196

Open
junghans opened this issue Dec 6, 2024 · 26 comments
Open

Some tests hang on ppc64le and aarch64 #1196

junghans opened this issue Dec 6, 2024 · 26 comments
Labels
bug Something isn't working

Comments

@junghans
Copy link
Contributor

junghans commented Dec 6, 2024

E.g. see the build here: https://koji.fedoraproject.org/koji/taskinfo?taskID=126510517

@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

@junghans Thank you for the report. I looked through the logs, but can't see it running any tests. Could you please point me out where to look?

@JBludau pointed out that it seems to still show progress in compilation, albeit very slowly.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

The end of the aarch64 build (https://kojipkgs.fedoraproject.org//work/tasks/562/126510562/build.log) looks like:

+ /usr/bin/ctest --test-dir aarch64-redhat-linux-gnu-serial --output-on-failure --force-new-ctest-process -j1
Internal ctest changing into directory: /builddir/build/BUILD/ArborX-1.7-build/ArborX-1.7/aarch64-redhat-linux-gnu-serial
Test project /builddir/build/BUILD/ArborX-1.7-build/ArborX-1.7/aarch64-redhat-linux-gnu-serial
      Start  1: ArborX_Test_DetailsUtils
 1/11 Test  #1: ArborX_Test_DetailsUtils .................   Passed    0.05 sec
      Start  2: ArborX_Test_Geometry
 2/11 Test  #2: ArborX_Test_Geometry .....................   Passed    0.01 sec
      Start  3: ArborX_Test_QueryTree
 3/11 Test  #3: ArborX_Test_QueryTree ....................   Passed    0.99 sec
      Start  4: ArborX_Test_DetailsTreeConstruction
 4/11 Test  #4: ArborX_Test_DetailsTreeConstruction ......   Passed    0.02 sec
      Start  5: ArborX_Test_DetailsContainers
 5/11 Test  #5: ArborX_Test_DetailsContainers ............   Passed    0.06 sec
      Start  6: ArborX_Test_DetailsCrsGraphWrapperImpl
 6/11 Test  #6: ArborX_Test_DetailsCrsGraphWrapperImpl ...   Passed    0.01 sec
      Start  7: ArborX_Test_Clustering
 7/11 Test  #7: ArborX_Test_Clustering ...................   Passed    0.07 sec
      Start  8: ArborX_Test_DetailsClusteringHelpers
 8/11 Test  #8: ArborX_Test_DetailsClusteringHelpers .....   Passed    0.04 sec
      Start  9: ArborX_Test_SpecializedTraversals

and it hangs there.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

For ppc64le, some tests are actually failing (see https://kojipkgs.fedoraproject.org//work/tasks/563/126510563/build.log):

 7/11 Test  #7: ArborX_Test_Clustering ...................***Failed    9.60 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Noise point does not have index -1: 0 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
....

@aprokop aprokop added the bug Something isn't working label Dec 6, 2024
@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

I tested on Power 10 system (Mammatus on Franken). Both current master and 1.7 are fine there. I wonder what's going on. I tested both Debug and RelWithDebugInfo builds.
The failure seems to be similar to #1112.

Surprisingly, I can make a different fail for the clustering helpers test. Some dendrogram tests fail:

$ OMP_NUM_THREADS=30 ./ArborX_Test_Clustering.exe
/home/users/aprokop/code/arborx/test/tstDendrogram.cpp(187): error: in "Dendrogram/dendrogram_boruvka<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check parents_boruvka == parents_union_find has failed
  - mismatch at position 831: [1510 == 1511] is false
  - mismatch at position 1306: [1511 == 1510] is false
  - mismatch at position 1510: [1557 == 2205] is false
  - mismatch at position 1511: [2205 == 1557] is false
  - mismatch at position 5871: [1510 == 1511] is false
  - mismatch at position 5903: [1511 == 1510] is false

*** 1 failure is detected in the test module "Master Test Suite"

Not very surprising because that test is hard to design as dendrograms may shift slightly.

@aprokop
Copy link
Contributor

aprokop commented Dec 6, 2024

I tested on Power 10 system (Mammatus on Franken). Both current master and 1.7 are fine there. I wonder what's going on. I tested both Debug and RelWithDebugInfo builds.

Nevermind. Only Debug passes. RelWithDebInfo (NOT RelWithDebugInfo :|) reproduces the failure. This is very similar to #1113. So maybe it's not just Intel, there's something about optimization that produces a wrong result. I'm really not sure how to start figuring this out.

@junghans
Copy link
Contributor Author

junghans commented Dec 6, 2024

@JBludau pointed out that it seems to still show progress in compilation, albeit very slowly.

Sometimes they run for weeks: https://koji.fedoraproject.org/koji/tasks?owner=junghans&state=active&view=tree&method=all&order=-id

@aprokop
Copy link
Contributor

aprokop commented Dec 10, 2024

@junghans Is there a way to test current master? I looked further into DBSCAN failures, and while I can reproduce them with 1.7, I can't reproduce them with the current master. I think something in refactoring may have fixed it.

@junghans
Copy link
Contributor Author

I tested master here: https://koji.fedoraproject.org/koji/taskinfo?taskID=126684443
(Yes, it says v1.7, but it is 37adf1a)

aarch64 passed, but ppc64le still have one failing test:

 8/14 Test  #8: ArborX_Test_Clustering ...................***Failed    0.37 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [2]
Noise point does not have index -1: 1 [1]
Noise point does not have index -1: 0 [1]
Connected cores do not belong to the same cluster: 1 [2] -> 0 [1]
Connected cores do not belong to the same cluster: 0 [1] -> 1 [2]
Noise point does not have index -1: 0 [1]
Noise point does not have index -1: 1 [1]
Cluster IDs are not unique
Core point is marked as noise: 12 [-1]
Noise point does not have index -1: 0 [0]
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstDBSCAN.cpp(202): �[1;31;49merror: in "DBSCAN/dbscan<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check verifyDBSCAN(space, points, 2 * r, 2, dbscan(space, points, 2 * r, 2, params)) has failed�[0;39;49m
�[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"

@aprokop
Copy link
Contributor

aprokop commented Jan 5, 2025

@junghans Can you please try the latest master (with #1198 merged).

@junghans
Copy link
Contributor Author

junghans commented Jan 5, 2025

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

x86_64 fails to build for a different reason due to some rocm issue, but aarch64 is also still failing on testing:

10/12 Test #10: ArborX_Test_SpecializedTraversals ........***Failed    0.01 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 10 test cases...
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstNeighborList.cpp(177): �[1;31;49merror: in "find_neighbor_list_compare_filtered_tree_traversal<Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>": check Test::buildHalfNeighborListAndExpandToFull(exec_space, points, radius) == Test::compute_reference<MemorySpace>(exec_space, points, radius) has failed
  - mismatch at position 31: [( 34 51 54 73 76 ) == ( 51 76 )] is false
  - mismatch at position 33: [( 34 54 73 76 ) == ( 76 )] is false
  - mismatch at position 34: [( 31 33 54 73 ) == ( )] is false
  - mismatch at position 40: [( 25 54 65 76 ) == ( 25 76 )] is false
  - mismatch at position 54: [( 31 33 34 40 73 76 ) == ( 76 )] is false
  - mismatch at position 65: [( 25 40 ) == ( 25 )] is false
  - mismatch at position 73: [( 31 33 34 54 ) == ( )] is false�[0;39;49m
�[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"
�[0;39;49m

ppc64le is fine.

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

Actually this is vice versa from before now that aarch64 is failing and pcc64le is fine,

@aprokop
Copy link
Contributor

aprokop commented Jan 6, 2025

Thanks for testing!

Interesting. So the failing test is actually different from the one before.

@junghans
Copy link
Contributor Author

junghans commented Jan 6, 2025

Ah you are right, it is indeed a different test!

@junghans
Copy link
Contributor Author

junghans commented Feb 14, 2025

c3ebac5 on the latest Fedora (F43) with gcc 15.0.1: https://koji.fedoraproject.org/koji/taskinfo?taskID=129212436

arm64 still fails now with:

4/14 Test  #4: ArborX_Test_QueryTree ....................***Failed    2.63 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 240 test cases...
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstQueryTreeComparisonWithBoost.cpp(186): �[1;31;49merror: in "ComparisonWithBoost/boost_rtree_spatial_predicate<TreeExecutionAndMemorySpaces<ArborX_Legacy_BVH_KDOP18_double_ Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>>": check query(ExecutionSpace{}, tree, intersects_queries) == (query(ExecutionSpace{}, rtree, intersects_queries_host)) has failed
  - mismatch at position 33: [( 884 895 906 907 908 1005 1016 1027 1028 1029 1126 1137 1148 1149 1150 ) == ( 884 885 886 895 896 897 906 907 908 1005 1006 1007 1016 1017 1018 1027 1028 1029 1126 1127 1128 1137 1138 1139 1148 1149 1150 )] is false�[0;39;49m
�[1;31;49m*** 1 failure is detected in the test module "Master Test Suite"
�[0;39;49m

x86_64 is fine.

And ppc64le:

4/12 Test  #4: ArborX_Test_QueryTree ....................***Failed    2.95 sec
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Running 240 test cases...
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstQueryTreeCallbacks.cpp(322): �[1;31;49merror: in "Callbacks/callback_with_attachment_spatial_predicate<TreeExecutionAndMemorySpaces<ArborX_Legacy_BVH_KDOP14_double_ Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>>": check query<value_type>(ExecutionSpace{}, tree, (makeIntersectsWithAttachmentQueries<DeviceType, Box, Coordinate>( {bounds}, {delta})), CustomInlineCallbackWithAttachment<decltype(points)>{points}) == (make_compressed_storage(offsets, values)) has failed
  - mismatch at position 0: [( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) ) == ( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) (5,13.660254037844387) (6,15.392304845413264) (7,17.124355652982139) (8,18.856406460551018) (9,20.588457268119896) )] is false�[0;39;49m
/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstQueryTreeCallbacks.cpp(329): �[1;31;49merror: in "Callbacks/callback_with_attachment_spatial_predicate<TreeExecutionAndMemorySpaces<ArborX_Legacy_BVH_KDOP14_double_ Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>>": check query<value_type>(ExecutionSpace{}, tree, (makeIntersectsWithAttachmentQueries<DeviceType, Box, Kokkos::Array<Coordinate, 2>>( {bounds}, {{0., delta}})), CustomPostCallbackWithAttachment<decltype(points)>{points}) == (make_compressed_storage(offsets, values)) has failed
  - mismatch at position 0: [( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) ) == ( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) (5,13.660254037844387) (6,15.392304845413264) (7,17.124355652982139) (8,18.856406460551018) (9,20.588457268119896) )] is false�[0;39;49m
�[1;31;49m*** 2 failures are detected in the test module "Master Test Suite"
�[0;39;49m

@aprokop
Copy link
Contributor

aprokop commented Feb 14, 2025

Thank you for the update. I haven't seen this before as it now seems to compare results with boost. Can you tell what version of boost is installed there?

/builddir/build/BUILD/ArborX-1.7-build/ArborX-master/test/tstQueryTreeCallbacks.cpp(322): �[1;31;49merror: in "Callbacks/callback_with_attachment_spatial_predicate<TreeExecutionAndMemorySpaces<ArborX_Legacy_BVH_KDOP14_double_ Kokkos__Device<Kokkos__OpenMP_ Kokkos__HostSpace>>>": check query<value_type>(ExecutionSpace{}, tree, (makeIntersectsWithAttachmentQueries<DeviceType, Box, Coordinate>( {bounds}, {delta})), CustomInlineCallbackWithAttachment<decltype(points)>{points}) == (make_compressed_storage(offsets, values)) has failed
 mismatch at position 0: [
( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) ) ==
( (0,5) (1,6.7320508075688767) (2,8.4641016151377535) (3,10.196152422706632) (4,11.928203230275509) (5,13.660254037844387) (6,15.392304845413264) (7,17.124355652982139) (8,18.856406460551018) (9,20.588457268119896) )] is false�[0;39;49m

This does not make any sense to me. The problem creates points (i, i, i) for i = 0, ..., 9. Then constructs a tree on them. Then grabs the bounding box of that tree and computes intersections returning distances to the origin. I have no clue why the query only returns the first 4 points. Could the bounding box be computed wrongly?

I need to be able to reproduce this to figure things out. How does one usually reproduce those builds without having access to the machines?

@junghans
Copy link
Contributor Author

junghans commented Feb 14, 2025

It is boost-devel aarch64 1.83.0-12.fc42.

you would need a ppc64le machine to reproduce that or on Fedora you could emulate it, here is some how to from another project: https://www.votca.org/DEVELOPERS_GUIDE.html#failed-release-builds

The error is on rawhide-ppc64le.

You also need my spec file.

ArborX.spec.txt

@junghans
Copy link
Contributor Author

Give me a min and I will test it on my Fedora machine.

@aprokop
Copy link
Contributor

aprokop commented Feb 14, 2025

you would need a ppc64le machine to reproduce that or on Fedora you could emulate it, here is some how to from another project: votca.org/DEVELOPERS_GUIDE.html#failed-release-builds

Would I be able to emulate it on Ubuntu or on Mac through Docker?

@junghans
Copy link
Contributor Author

You could can always run fedora:latest in docker and then go from there.

Here is rough guide for within Fedora:

git clone https://github.com/junghans/ArborX.spec
cd ArborX.spec
./get_master.sh
fedpkg srpm
mock -r fedora-rawhide-ppc64le --forcearch ppc64le --init
mock -r fedora-rawhide-ppc64le --forcearch ppc64le --no-clean ArborX-1.7-1.fc43.src.rpm

@aprokop
Copy link
Contributor

aprokop commented Feb 14, 2025

You could can always run fedora:latest in docker and then go from there.

I would still need the proper architecture for it, right? So I would need to run Docker on a machine with ppc or aarch64? I could try on my Mac for aarch64, but I don't have powerpc with docker.

@junghans
Copy link
Contributor Author

You could can always run fedora:latest in docker and then go from there.

I would still need the proper architecture for it, right? So I would need to run Docker on a machine with ppc or aarch64? I could try on my Mac for aarch64, but I don't have powerpc with docker.

No, mock uses qemu spin up the other architecture.

@aprokop
Copy link
Contributor

aprokop commented Feb 15, 2025

Here is rough guide for within Fedora:

I'm not familiar with it and getting an error

$ fedpkg srpm
sources file doesn't exist. Source files download skipped.
Could not execute srpm: Unable to find rawhide target

@junghans
Copy link
Contributor Author

junghans commented Feb 15, 2025

Just use rpmbuild -D"_sourcedir ${PWD}" -D"_srcrpmdir ${PWD}" -bs ArborX.spec instead of fedpkg.

@junghans
Copy link
Contributor Author

Ok, if run the fedora:latest container in privileged mode (to allow docker in docker) and in there:

dnf -y install fedpkg wget
git clone https://github.com/junghans/ArborX.spec
cd ArborX.spec/
./get_master.sh 
fedpkg srpm
mock -r fedora-rawhide-ppc64le --forcearch ppc64le --init
mock -r fedora-rawhide-ppc64le --forcearch ppc64le --no-clean ArborX-1.7-1.fc43.src.rpm

that seems to build (still ongoing)

@junghans
Copy link
Contributor Author

ppc64le is still building ;-)

And for aarch64, I setup a workflow here: https://github.com/junghans/ArborX.spec/actions/runs/13356584474/job/37300198190
(part of the ArborX.spec repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants