-
Notifications
You must be signed in to change notification settings - Fork 0
2016 12 20
Wesley Bland edited this page Dec 20, 2016
·
1 revision
- Recap Forum Meeting
- Go over existing tickets/work
- ULFM
- #1/#3 - Error Handlers
- Handful of edits requested and made
- Still discussing return codes for error handlers
- #28 - Catastrophic Errors
- More small tweaks requested and made
- Error Handler Greenfielding Work
- Decided to abandon the effort as is due to complexity of design
- Will pursue error handler inheritance separately
- ULFM
- Recapped discussion about ULFM problems raised during reading.
- When including ULFM, there are four competing models for FT:
- ULFM
- ULFM + auto recovery
- Communicators with holes
- Reinit
- Problem with holes model
- Increases non-failure cost because of algorithm changes
- Problems with ULFM
- Cleaning up MPI state is a large problem (or uses PMPI) if reinit-like behavior is desired.
- Revoking globally is not currently possible
- Error handlers are still insufficient for fault tolerance
- Communicators may be too small for specifying recovery model
- C-style exception handling is insufficient, but cannot easily be extended and remain compatible with Fortran.
- Investigate fleshing out the three main options (excluding holes) and discussing how they would work together.
- ULFM: Aurelien
- ULFM + auto recovery: Wesley
- Reinit: Ignacio
- Once we have a good idea of some more details of each proposal, consider how they would interact and overlap.
- Would there be anything in one proposal that would prevent the other.
- For now, don't allow more than one model in a single MPI instance.
- This might be relaxed with sessions to allow each session to choose an FT model.
- Work through the exercise of a sessions-style app and library which use different FT models.