「入門監視」からPagerDuty Incident Response


This documentation covers parts of the PagerDuty Incident Response process. It is a cut-down version of our internal documentation, used at PagerDuty for any major incidents, and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after.

PagerDutyの内部文書(internal documentation)で、on-callでの業務について書いている。

Announce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe.
Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts,


Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly. Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, tweet it out").


For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.

post mortemは辞書では死体解剖という訳が出てくるが、要するに発生した事象について振り返り方を書いている。blame-free、つまり誰かに責任をなすりつけないよう配慮するように書かれている。

