-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: MpiWrapper::allReduce overload for arrays. #3446
base: develop
Are you sure you want to change the base?
Conversation
…OS-DEV/GEOS into feature/cusini/addMpiWrapperForArrays
@rrsettgast , @corbett5 , @wrtobin this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, can we drop or at least set to private the following signature?
template< typename T >
static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );
All external calls seem to be able to be translated to Span<T[]>
.
yes, you are right, let me try to do this. |
done. |
MpiWrapper::allReduce( dasReceivers, | ||
dasReceivers, | ||
MpiWrapper::Reduction::Sum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sframba @acitrain this change triggers these diffs:
NFO: Total number of log files processed: 1537
WARNING: Found unfiltered diff in: /Users/cusini1/Downloads/integratedTests/TestResults/test_data/wavePropagation/elas3D_DAS_smoke_08/1497.python3_elas3D_DAS_smoke_08_2_restartcheck_.log
INFO: Details of diffs: ********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
********************************************************************************
Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers
Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.
Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0
Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200
max = 2000.0001, mean = 500.0, std = 646.787
The only explanation I can find is that, in the previous code, m_linearDASGeometry.size( 0 ) was not the size of the array. The fact that arrayView2d< real32 > const dasReceivers = m_dasSignalNp1AtReceivers.toView();
makes me wonder why m_linearDASGeometry.size( 0 )
was passed as the size. I am not sure how these two objects are related...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @CusiniM , I think that is exactly the reason: we have an extra receiver that simply tracks time:
Line 221 in fd36d2a
m_dasSignalNp1AtReceivers.resize( m_nsamplesSeismoTrace, numReceiversGlobal + 1 ); |
This receiver does not need to be summed across ranks. Therefore only the first dasReceivers .size() - 1
(that is, m_linearDASGeometry.size( 0 )
) arrays need to be summed, not the last one. Does this break your new structure or can it be accomodated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it does break it in the sense that the overload I put in place for arrays
was written with the idea of summing the full array across ranks. Basically I wanted to ensure that one does not provide the wrong size (which caused a couple of bugs in the past). We could add an overload to allow for a the reduce operation to occur on a specified size <= array.size()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sframba Is there a reason that this all has to be in one array? It seems that the array is holding different quantity types? If so, can we just create a separate object to hold the last value that is of different quantity type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rrsettgast as discussed together, no there is no reason why this should be all one array. We could (should) separate the extra array so that they become of homogeneous dimension. However, this requires some changes in our python code as well. Could we maybe :
- Implement the overload suggested by @CusiniM for now, so we unlock this PR, then
- When the "strong unit typing" work starts officially, add an item about wave solvers' receiver uniformity and assign it to us, so we can take the time to tidy up this problem properly?
What do you guys think?
…OS-DEV/GEOS into feature/cusini/addMpiWrapperForArrays
@sframba please have a look now and approve. |
We had a bug because two arrays of different size were exchanged by grabbing the buffers directly. This overload will prevent it from happening.
I had to add a small helper coz
LvArray
hasValueType
instead ofvalue_type
likestd
objects.