Playbook for MME Sev 1 Outages
Below is a codified playbook used to respond to MME Sev 1 Outages.
For the latest version, refer to the playbook in our Mattermost community instance: https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw
1 - Escalation
[ ] Create Incident Channel, run MME Sev1 Playbook
Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook
[ ] Add CSM, TAM & DE to Incident Channel
Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.
[ ] Start audio & screen share with customer
Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting
[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer
2 - Data gathering
[ ] Share system information
Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)
[ ] Share Grafana screenshots
Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization
[ ] Share output from support bundle
Link to relevant docs
[ ] Share output from slow query logs
Link to relevant docs
[ ] Pin data to channel
Link to relevant docs
3 - Data review
[ ] Review system configuration settings that may impact performance
Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings
[ ] Review Grafana screenshots to identify potential issues
Includes XXX
[ ] Review support bundle output to identify potential issues
Includes XXX
[ ] Review slow query log output to identify potential issues
Includes XXX
[ ] Summary findings from data review
Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
4 - Code investigation
[ ] Based on findings from data review, identify areas of codebase with potential root cause
Includes XXX
[ ] Identify potential root cause based on the code
Includes XXX
[ ] Identify solution for root cause
Includes XXX
[ ] Submit PR for solution
Includes XXX
[ ] Deem whether verification of a fix is required for release candidate
If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
5 - Release preparation
[ ] Merge PR to master branch
Includes XXX
[ ] Cherry pick PR to dot release branch
Includes XXX
[ ] Cut dot release candidate
Includes XXX
[ ] Verify fix in dot release candidate
Includes XXX
[ ] Cut dot release
Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
6 - Dot release deployment
[ ] Send dot release binary to customer
Includes XXX
[ ] Upgrade customer’s dev/staging environment with dot release
Includes XXX
[ ] Verify fix in customer’s dev/staging environment
Includes XXX
[ ] Upgrade customer’s production environment with dot release
Includes XXX
[ ] Verify fix in customer’s production environment
Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.
7 - Resolution
[ ] Monitor fix in customer environment for 24 hours
Includes XXX
[ ] Receive confirmation from customer about issue resolution
Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer
8 - Retrospective
[ ] Complete incident retrospective within 1 business day from resolution
Includes XXX
[ ] Draft incident summary analysis within 2 business days from resolution
Includes XXX
[ ] Send completed incident summary analysis with customer within 3 business days
Includes XXX
[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)
MME Sev 1 outage >1 hour requires CEO looped into customer
Last updated