# Playbook for MME Sev 1 Outages

Below is a codified playbook used to respond to MME Sev 1 Outages.

For the latest version, refer to the playbook in our Mattermost community instance: <https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw>

## 1 - Escalation

\[ ] Create Incident Channel, run MME Sev1 Playbook

* *Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook*

\[ ] Add CSM, TAM & DE to Incident Channel

* *Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.*

\[ ] Start audio & screen share with customer

* *Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting*

\[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*

## 2 - Data gathering

\[ ] Share system information

* *Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)*

\[ ] Share Grafana screenshots

* *Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization*

\[ ] Share output from support bundle

* *Link to relevant docs*

\[ ] Share output from slow query logs

* *Link to relevant docs*

\[ ] Pin data to channel

* *Link to relevant docs*

## 3 - Data review

\[ ] Review system configuration settings that may impact performance

* *Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings*

\[ ] Review Grafana screenshots to identify potential issues

* *Includes XXX*

\[ ] Review support bundle output to identify potential issues

* *Includes XXX*

\[ ] Review slow query log output to identify potential issues

* *Includes XXX*

\[ ] Summary findings from data review

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 4 - Code investigation

\[ ] Based on findings from data review, identify areas of codebase with potential root cause

* *Includes XXX*

\[ ] Identify potential root cause based on the code

* *Includes XXX*

\[ ] Identify solution for root cause

* *Includes XXX*

\[ ] Submit PR for solution

* *Includes XXX*

\[ ] Deem whether verification of a fix is required for release candidate

* *If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 5 - Release preparation

\[ ] Merge PR to master branch

* *Includes XXX*

\[ ] Cherry pick PR to dot release branch

* *Includes XXX*

\[ ] Cut dot release candidate

* *Includes XXX*

\[ ] Verify fix in dot release candidate

* *Includes XXX*

\[ ] Cut dot release

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 6 - Dot release deployment

\[ ] Send dot release binary to customer

* *Includes XXX*

\[ ] Upgrade customer’s dev/staging environment with dot release

* *Includes XXX*

\[ ] Verify fix in customer’s dev/staging environment

* *Includes XXX*

\[ ] Upgrade customer’s production environment with dot release

* *Includes XXX*

\[ ] Verify fix in customer’s production environment

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 7 - Resolution

\[ ] Monitor fix in customer environment for 24 hours

* *Includes XXX*

\[ ] Receive confirmation from customer about issue resolution

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*

## 8 - Retrospective

\[ ] Complete incident retrospective within 1 business day from resolution

* *Includes XXX*

\[ ] Draft incident summary analysis within 2 business days from resolution

* *Includes XXX*

\[ ] Send completed incident summary analysis with customer within 3 business days

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*
