-
Notifications
You must be signed in to change notification settings - Fork 0
2016 04 26
Wesley Bland edited this page Apr 26, 2016
·
2 revisions
- ORNL - Christian
- UTK - Aurelien
- Argonne - Ken
- Sandia - Keita
- LLNL - Murali
- Ohio State - Sourav
- Wesley - Have a combination of
MPI_COMM_FREE
andMPI_FINALIZE
where theMPI_COMM_DISCONNECT
can partially fail, but the overall semantics would still be attempted.- Requests would be completed (perhaps with an error)
- Other parts of the
MPI_COMM_DISCONNECT
may fail too
- Aurelien - If
MPI_COMM_DISCONNECT
is not resilient, we could have a problem where messages could show up from a disconnected process because it doesn't know that the communicator has been disconnected or the messages are late. This would add software overhead in some situations.- This could be solved by making the operation resilient, but that would have other implications.
- We decided to go with a "best effort" scenario where the implementation will try to clean things up the best it can (implementation dependent). This would still require a software fix as a fallback if there's no other way to prevent late messages.
- If Sessions expands to also cover windows and files, then we need to modify ULFM to account for that.
- This probably means we need to be able to "repair a set" in the same way we "repair a communicator".
- There are two options:
- Create a new, shrunken set (static sets)
- Repair the communicator in place by removing failed processes (dynamic sets)
- We chose the former because we didn't see any major benefits from repairing in place.
- We also talked about whether it would be possible to replace processes inline with these new features
- This would require repairing all sets that the process could be in on all processes, which has scalability issues.
- That would require a centralized way of looking up set information.