Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: MpiWrapper::allReduce overload for arrays. #3446

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

CusiniM
Copy link
Collaborator

@CusiniM CusiniM commented Nov 15, 2024

We had a bug because two arrays of different size were exchanged by grabbing the buffers directly. This overload will prevent it from happening.

I had to add a small helper coz LvArray has ValueType instead of value_type like std objects.

@CusiniM CusiniM self-assigned this Nov 15, 2024
@CusiniM CusiniM added the flag: no rebaseline Does not require rebaseline label Nov 15, 2024
@CusiniM
Copy link
Collaborator Author

CusiniM commented Dec 5, 2024

@rrsettgast , @corbett5 , @wrtobin this is ready for review.

src/coreComponents/common/TypesHelpers.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@MelReyCG MelReyCG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Dec 7, 2024

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

yes, you are right, let me try to do this.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 2, 2025

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

done.

@CusiniM CusiniM added ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI labels Jan 3, 2025
@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 3, 2025

@sframba or @acitrain can you please have a quick look at the changes to the wave solvers files and approve this?

Comment on lines 1016 to 1018
MpiWrapper::allReduce( dasReceivers,
dasReceivers,
MpiWrapper::Reduction::Sum,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sframba @acitrain this change triggers these diffs:

NFO: Total number of log files processed: 1537

WARNING: Found unfiltered diff in: /Users/cusini1/Downloads/integratedTests/TestResults/test_data/wavePropagation/elas3D_DAS_smoke_08/1497.python3_elas3D_DAS_smoke_08_2_restartcheck_.log
INFO: Details of diffs:   ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

The only explanation I can find is that, in the previous code, m_linearDASGeometry.size( 0 ) was not the size of the array. The fact that arrayView2d< real32 > const dasReceivers = m_dasSignalNp1AtReceivers.toView(); makes me wonder why m_linearDASGeometry.size( 0 ) was passed as the size. I am not sure how these two objects are related...

Copy link
Contributor

@sframba sframba Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CusiniM , I think that is exactly the reason: we have an extra receiver that simply tracks time:

m_dasSignalNp1AtReceivers.resize( m_nsamplesSeismoTrace, numReceiversGlobal + 1 );

This receiver does not need to be summed across ranks. Therefore only the first dasReceivers .size() - 1 (that is, m_linearDASGeometry.size( 0 )) arrays need to be summed, not the last one. Does this break your new structure or can it be accomodated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it does break it in the sense that the overload I put in place for arrays was written with the idea of summing the full array across ranks. Basically I wanted to ensure that one does not provide the wrong size (which caused a couple of bugs in the past). We could add an overload to allow for a the reduce operation to occur on a specified size <= array.size().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sframba Is there a reason that this all has to be in one array? It seems that the array is holding different quantity types? If so, can we just create a separate object to hold the last value that is of different quantity type?

Copy link
Contributor

@sframba sframba Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrsettgast as discussed together, no there is no reason why this should be all one array. We could (should) separate the extra array so that they become of homogeneous dimension. However, this requires some changes in our python code as well. Could we maybe :

  • Implement the overload suggested by @CusiniM for now, so we unlock this PR, then
  • When the "strong unit typing" work starts officially, add an item about wave solvers' receiver uniformity and assign it to us, so we can take the time to tidy up this problem properly?
    What do you guys think?

@CusiniM CusiniM added the ci: run code coverage enables running of the code coverage CI jobs label Jan 13, 2025
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 58.97436% with 16 lines in your changes missing coverage. Please review.

Project coverage is 56.89%. Comparing base (18e0c4e) to head (c18b631).

Files with missing lines Patch % Lines
...olvers/solidMechanics/SolidMechanicsStatistics.cpp 0.00% 4 Missing ⚠️
src/coreComponents/common/MpiWrapper.hpp 84.61% 2 Missing ⚠️
...onents/physicsSolvers/PhysicsSolverBaseKernels.hpp 33.33% 2 Missing ⚠️
...hysicsSolvers/solidMechanics/SolidMechanicsMPM.cpp 0.00% 2 Missing ⚠️
src/coreComponents/mesh/ParticleManager.cpp 0.00% 1 Missing ⚠️
...nents/physicsSolvers/contact/ContactSolverBase.cpp 0.00% 1 Missing ⚠️
...olvers/contact/SolidMechanicsEmbeddedFractures.cpp 0.00% 1 Missing ⚠️
...hysicsSolvers/multiphysics/HydrofractureSolver.cpp 0.00% 1 Missing ⚠️
...ers/solidMechanics/SolidMechanicsLagrangianFEM.cpp 0.00% 1 Missing ⚠️
...sicsSolvers/surfaceGeneration/SurfaceGenerator.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3446      +/-   ##
===========================================
- Coverage    56.89%   56.89%   -0.01%     
===========================================
  Files         1157     1157              
  Lines       101214   101227      +13     
===========================================
+ Hits         57582    57589       +7     
- Misses       43632    43638       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 15, 2025

@sframba please have a look now and approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run code coverage enables running of the code coverage CI jobs ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: no rebaseline Does not require rebaseline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants