Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident Management Doc #67

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions docs/incident-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Incident Management for InstructLab

## Attribution

Before we start, I would like to mention that this document and process is inspired
by and influenced heavily from [here][gist] originally authored by [coderanger][ranger].
jjasghar marked this conversation as resolved.
Show resolved Hide resolved

## Problem Statement

With the multiple systems that come together for the open source project InstructLab
to work, we need a catalogue of major issues and problems so we can track and find
critical bugs in the open. Building a formalized incident management process is
the first step in building this knowledge database.

## Proposed Solution

Create a light weight but formalized Incident Management process. Whomever calls
for the Incident is the initial "Incident Commander" [IC] and becomes the hub of
the information, starts the conversation in the shared Slack channel
and has a shared timeline (google doc or the like).
After "a reasonable" amount of time they can hand off the IC
to someone else, and so on till the Incident is resolved.

### What counts as an incident?

Anything with community visible negative consequences. In most cases this
will be an outage or downtime event but some non-outage incidents include severe
performance degradations and security events.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding what would qualify as an incident under this proposal. Could you come up with some concrete examples?

It seems to be oriented toward running a service where some level of availability is to be expected. That's not really true here. There's stuff running, but not in service to a public group of some kind.

Before commenting further, it would help if we had a sample set of scenarios to use as context for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just nonticed I'm duplicating some discussion between you and @leseb - sorry about that.

Ah, see, it turns out we do have a bunch of back-end services required for us to run this project. We have a team > of backend engineers ensuring the model training and tuning is done. This requires communication between two > different groups of people, and we have no formal way to communicate. (hence this process)
We had a "major" incident, just this last week and when I asked for a some information about it, we noticed we didn't have a plan for these types of situations.

but none of this is visible to the public right now, so I'm not sure a public process like this makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russellb I take your comment to mean "once our backend services like model training are publicly visible, we should have a process for dealing with incidents"

I agree with you!

However, as we did have a hiccup that caused @jjasghar to open this PR, should we at least have some documented process for dealing with said hiccups amongst the maintainer team? I am sure we have one that is informal and ad hoc, but I'd like to see it written down (for my ignorant self).

I think @jjasghar can maybe just give me a summary of the what happened and how to remediate/deal in this issue. The deep specifics are unlikely to be interesting or useful for GitHub history purposes. :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about the example in question. It would probably help to speak about it more concretely instead of in the hypothetical.


## When should I file an incident report?

Preferably after the incident is resolved. Or at least don't consider it even remotely
a priority to do the paperwork in the middle of an ongoing issue. But if possible do
think about recording timeline information that isn't in Slack or the Mailing list which
might otherwise be lost, so that you will have them for the report later.

## Incident Template

## [Incident Date and Title]

Incident Commander: [Name of IC]

## Summary

[A quick one or two sentence description of the issue from a high-level view.]

## Duration

[The amount of time of the user impact. This does not include any time spent after restoration of service but before the formal end of the incident.]

## User Impact

[Describe the impact to end users, such as which features or services were unavailable.]

## Timeline

[A timeline of events from the start of the incident until the Incident Commander declares it over. All times should be in UTC.]

* 01:23 - Incident start.
* 23:34 - Incident concluded.

## Proximate Trigger

[The most direct trigger of incident. Often times this will be a human error such as a code bug that was missed in review or an operational mistake. Our process is blameless and we document our mistakes so that we can learn from them. But try to not turn this into a personal callout, even of yourself.]

## Root Cause

[Root causes are the underlying deeper problem that lead to the incident. For example, if the proximate trigger was a bug missed in review then a root cause might be missing static analysis tooling in CI that could have caught it, or missing code review guidelines. Root causes should never be human error, they are systemic issues that create the conditions for human error to become a problem. Also root causes are sometimes slippery, you can trace the chain of events back infinitely far if you try hard enough but putting "Root Cause: 13 billion years ago a singularity expanded into the Big Bang" is not productive. Look for root causes that help guide our future path rather than documenting every contributing factor for its own sake.]

## Detection

[How was this problem noticed? User reports, automated alerts, etc.]

## Resolution

[What steps were taken to resolve the incident. Try to be specific, such as linking to a commit/PR for code fixes or listing the commands used for an interactive fix, as these can help guide future improvements.]

## What Went Well?

[Any notes about things that went well in our process during the handling of the incident.]

## What Went Poorly?

[Like the above but things that went poorly. This is only related to the process and handling, the incident itself is probably something that went poorly too but that is discussed above.]

## Where We Got Lucky?

[Any places where things went well but more due to happenstance, such as one bug cancelling out another or an issue being noticed early before automated detection warned us.]

## Action Items

[Tasks we should take away from this incident to prevent it from recurring or to improve our handling of similar incidents in the future. Action items can be divided into 5 categories listed below. It's possible not all categories will have an action item, they are only to help guide your thinking.]

### Detect

[Ways to improve the detection of problems.]

### Respond

[Ways to get eyeballs on the incident faster.]

### Contain

[Ways to limit the damage of incidents.]

### Prevent

[Ways to reduce the chances of the proximate triggers of this incident.]

### Eliminate

[Ways to solve the root causes of this incident so as to make it structurally impossible.]

_All times in UTC._

## Next steps

1. Create a [email protected] mailing list

Check failure on line 118 in docs/incident-management.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Bare URL used

docs/incident-management.md:118:13 MD034/no-bare-urls Bare URL used [Context: "[email protected]"] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md034.md
1. Create a incidents directory in the community repository
1. Create a #incident channel in Slack


Check failure on line 122 in docs/incident-management.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/incident-management.md:122 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
[gist]: https://gist.github.com/coderanger/cbf7e80b76a7b2ff284ab592d798de8e
[ranger]: https://github.com/coderanger
Loading