Playbook for MME Sev 1 Outages

Below is a codified playbook used to respond to MME Sev 1 Outages.

For the latest version, refer to the playbook in our Mattermost community instance: https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw

1 - Escalation

[ ] Create Incident Channel, run MME Sev1 Playbook

  • Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook

[ ] Add CSM, TAM & DE to Incident Channel

  • Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.

[ ] Start audio & screen share with customer

  • Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting

[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer

2 - Data gathering

[ ] Share system information

  • Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)

[ ] Share Grafana screenshots

  • Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization

[ ] Share output from support bundle

  • Link to relevant docs

[ ] Share output from slow query logs

  • Link to relevant docs

[ ] Pin data to channel

  • Link to relevant docs

3 - Data review

[ ] Review system configuration settings that may impact performance

  • Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings

[ ] Review Grafana screenshots to identify potential issues

  • Includes XXX

[ ] Review support bundle output to identify potential issues

  • Includes XXX

[ ] Review slow query log output to identify potential issues

  • Includes XXX

[ ] Summary findings from data review

  • Includes XXX

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

4 - Code investigation

[ ] Based on findings from data review, identify areas of codebase with potential root cause

  • Includes XXX

[ ] Identify potential root cause based on the code

  • Includes XXX

[ ] Identify solution for root cause

  • Includes XXX

[ ] Submit PR for solution

  • Includes XXX

[ ] Deem whether verification of a fix is required for release candidate

  • If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

5 - Release preparation

[ ] Merge PR to master branch

  • Includes XXX

[ ] Cherry pick PR to dot release branch

  • Includes XXX

[ ] Cut dot release candidate

  • Includes XXX

[ ] Verify fix in dot release candidate

  • Includes XXX

[ ] Cut dot release

  • Includes XXX

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

6 - Dot release deployment

[ ] Send dot release binary to customer

  • Includes XXX

[ ] Upgrade customer’s dev/staging environment with dot release

  • Includes XXX

[ ] Verify fix in customer’s dev/staging environment

  • Includes XXX

[ ] Upgrade customer’s production environment with dot release

  • Includes XXX

[ ] Verify fix in customer’s production environment

  • Includes XXX

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

7 - Resolution

[ ] Monitor fix in customer environment for 24 hours

  • Includes XXX

[ ] Receive confirmation from customer about issue resolution

  • Includes XXX

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer

8 - Retrospective

[ ] Complete incident retrospective within 1 business day from resolution

  • Includes XXX

[ ] Draft incident summary analysis within 2 business days from resolution

  • Includes XXX

[ ] Send completed incident summary analysis with customer within 3 business days

  • Includes XXX

[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

  • MME Sev 1 outage >1 hour requires CEO looped into customer

Last updated