Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident Management Doc #67
base: main
Are you sure you want to change the base?
Incident Management Doc #67
Changes from all commits
6aee764
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time understanding what would qualify as an incident under this proposal. Could you come up with some concrete examples?
It seems to be oriented toward running a service where some level of availability is to be expected. That's not really true here. There's stuff running, but not in service to a public group of some kind.
Before commenting further, it would help if we had a sample set of scenarios to use as context for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just nonticed I'm duplicating some discussion between you and @leseb - sorry about that.
but none of this is visible to the public right now, so I'm not sure a public process like this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russellb I take your comment to mean "once our backend services like model training are publicly visible, we should have a process for dealing with incidents"
I agree with you!
However, as we did have a hiccup that caused @jjasghar to open this PR, should we at least have some documented process for dealing with said hiccups amongst the maintainer team? I am sure we have one that is informal and ad hoc, but I'd like to see it written down (for my ignorant self).
I think @jjasghar can maybe just give me a summary of the what happened and how to remediate/deal in this issue. The deep specifics are unlikely to be interesting or useful for GitHub history purposes. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know anything about the example in question. It would probably help to speak about it more concretely instead of in the hypothetical.
Check failure on line 118 in docs/incident-management.md
GitHub Actions / markdown-lint
Bare URL used
Check failure on line 122 in docs/incident-management.md
GitHub Actions / markdown-lint
Multiple consecutive blank lines