Skip to content

2016 12 20

Wesley Bland edited this page Dec 20, 2016 · 1 revision

FTWG Con Call

Agenda

  1. Recap Forum Meeting
  2. Go over existing tickets/work
  3. ULFM

Notes

  • #1/#3 - Error Handlers
    • Handful of edits requested and made
    • Still discussing return codes for error handlers
  • #28 - Catastrophic Errors
    • More small tweaks requested and made
  • Error Handler Greenfielding Work
    • Decided to abandon the effort as is due to complexity of design
    • Will pursue error handler inheritance separately
  • ULFM
    • Recapped discussion about ULFM problems raised during reading.
    • When including ULFM, there are four competing models for FT:
      1. ULFM
      2. ULFM + auto recovery
      3. Communicators with holes
      4. Reinit
    • Problem with holes model
      • Increases non-failure cost because of algorithm changes
    • Problems with ULFM
      • Cleaning up MPI state is a large problem (or uses PMPI) if reinit-like behavior is desired.
      • Revoking globally is not currently possible
      • Error handlers are still insufficient for fault tolerance
        • Communicators may be too small for specifying recovery model
      • C-style exception handling is insufficient, but cannot easily be extended and remain compatible with Fortran.

Next Steps

  • Investigate fleshing out the three main options (excluding holes) and discussing how they would work together.
    • ULFM: Aurelien
    • ULFM + auto recovery: Wesley
    • Reinit: Ignacio
  • Once we have a good idea of some more details of each proposal, consider how they would interact and overlap.
    • Would there be anything in one proposal that would prevent the other.
  • For now, don't allow more than one model in a single MPI instance.
    • This might be relaxed with sessions to allow each session to choose an FT model.
    • Work through the exercise of a sessions-style app and library which use different FT models.
Clone this wiki locally