-
Notifications
You must be signed in to change notification settings - Fork 0
2017 12 05
How does the user pick a model? How do we tell the user what was picked and/or is available?
-
Requested / Provided
-
Positives:
- Requesting an FT model is atomic.
- The user will always get an answer on the first try.
- If someone else has "requested" something else, that would be "provided".
-
Negatives:
- FT models don't "fall back" on each other the way threading models do.
- The user might not get the second choice that they want (if there are more than two).
-
-
Request + Yes / No
-
Positives:
- User has complete control over which model they pick.
- Doesn't cause an error if the model is not available.
-
Negatives:
- Could have to iterate over lots of models.
- Can't change your mind after you've picked (could pair with an API to get the list of models).
-
This seems to be the best option.
- It should be be paired with an API to give the list of models that the implementation supports.
-
-
Request + Error / Success
- Do we want to be able to turn off the FT model?
- For now, let's say that you should chain together multiple applications.
- Later we might allow this.
Write up some text for this and create a separate ticket.
We went over the comments of the data resilience (https://github.com/mpiwg-ft/mpi-standard/pull/4) and decided to pause work on this in the context of ULFM. If work continues on it, it will be as a separate proposal.
Keita asked about the status of MPI_ISHRINK
, but Aurelien was not present to
give an update.
What are the parts of the Reinit proposal that might be difficult to standardize?
- We require a function pointer that we can use to long jump.
- This will force the application to use a recent enough version of Fortran at least for this part of the application.
- How to we handle files that are still open when the application jumps back to reinit?
- Do we close the file? Leave it open and transparently deal with it inside MPI?
- If we want to leave the files open, we need to figure out what
fopen
does and try to do something similar.
- What do we do with allocated memory?
- The original solution was to have the lightweight error handlers that will
decide whether to free memory or close files or not.
- What can you do inside a cleanup handler?
- How do we handle dynamic processes?
- We probably have to disconnect all dynamic processes.
- What do we do with PVARs? Do they get reset on reinit? How do we handle all of the different kinds of PVARs?
- Same thing with CVARs. These probably need to be reset to their initial values.
- Need to reset attributes and info keys on
MPI_COMM_WORLD
(and friends).
The reading was generally successful. There were a few minor changes that people asked for and were made. These will need to be voted on at a no-no vote at the February meeting.
The sentence about MPI being undefined after an error was removed from this proposal given that the catastrophic error proposal is going to tackle that problem in a different way.
The Forum felt strongly that the way to detect catastrophic errors should not be via an API call, but should come from the error class itself. The initial concern about the fact that not all errors have an error class was dismissed because you would never have checked for an error until you received an error code anyway.
Furthermore, the Forum decided that it would rather remove the notion of catastrophic errors completely and just treat all errors the same, as non-catastrophic errors. It would be up to the user to determine which errors are actually catastrophic and which ones aren't.
This has these main consequences:
-
If the MPI library has what it considers a "catastrophic error", it might have to just abort. The set of errors that falls into this category should be very limited, however.
-
The user will be responsible for deciding which kinds of errors it wants to handle and which ones it doesn't. This means that we'll need to provide more specific error classes whenever possible. We should look at what kinds of error classes might be useful. One example would be to look at errno for similar errors that we could borrow.
-
The proposal should be changed to remove all of the notions of catastrophic errors and just remove the sentence about MPI being undefined after an error.
-
Catastrophic (or any other) errors cannot be permanent. If they are, the library is probably in a situation where it probably just has to abort.