Links

Playbook for MME Sev 1 Outages

Below is a codified playbook used to respond to MME Sev 1 Outages.
For the latest version, refer to the playbook in our Mattermost community instance: https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw

1 - Escalation

[ ] Create Incident Channel, run MME Sev1 Playbook
  • Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook
[ ] Add CSM, TAM & DE to Incident Channel
  • Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.
[ ] Start audio & screen share with customer
  • Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting
[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer

2 - Data gathering

[ ] Share system information
  • Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)
[ ] Share Grafana screenshots
  • Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization
[ ] Share output from support bundle
  • Link to relevant docs
[ ] Share output from slow query logs
  • Link to relevant docs
[ ] Pin data to channel
  • Link to relevant docs

3 - Data review

[ ] Review system configuration settings that may impact performance
  • Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings
[ ] Review Grafana screenshots to identify potential issues
  • Includes XXX
[ ] Review support bundle output to identify potential issues
  • Includes XXX
[ ] Review slow query log output to identify potential issues
  • Includes XXX
[ ] Summary findings from data review
  • Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

4 - Code investigation

[ ] Based on findings from data review, identify areas of codebase with potential root cause
  • Includes XXX
[ ] Identify potential root cause based on the code
  • Includes XXX
[ ] Identify solution for root cause
  • Includes XXX
[ ] Submit PR for solution
  • Includes XXX
[ ] Deem whether verification of a fix is required for release candidate
  • If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

5 - Release preparation

[ ] Merge PR to master branch
  • Includes XXX
[ ] Cherry pick PR to dot release branch
  • Includes XXX
[ ] Cut dot release candidate
  • Includes XXX
[ ] Verify fix in dot release candidate
  • Includes XXX
[ ] Cut dot release
  • Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

6 - Dot release deployment

[ ] Send dot release binary to customer
  • Includes XXX
[ ] Upgrade customer’s dev/staging environment with dot release
  • Includes XXX
[ ] Verify fix in customer’s dev/staging environment
  • Includes XXX
[ ] Upgrade customer’s production environment with dot release
  • Includes XXX
[ ] Verify fix in customer’s production environment
  • Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.

7 - Resolution

[ ] Monitor fix in customer environment for 24 hours
  • Includes XXX
[ ] Receive confirmation from customer about issue resolution
  • Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer

8 - Retrospective

[ ] Complete incident retrospective within 1 business day from resolution
  • Includes XXX
[ ] Draft incident summary analysis within 2 business days from resolution
  • Includes XXX
[ ] Send completed incident summary analysis with customer within 3 business days
  • Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
  • MME Sev 1 outage >1 hour requires CEO looped into customer