# Playbook for MME Sev 1 Outages

Below is a codified playbook used to respond to MME Sev 1 Outages.

For the latest version, refer to the playbook in our Mattermost community instance: <https://community.mattermost.com/playbooks/playbooks/9agdqr7jdtda7p4g8dxbppcibw>

## 1 - Escalation

\[ ] Create Incident Channel, run MME Sev1 Playbook

* *Once MME Sev1 issue is escalated by CSM, TAM or CRE, create incident channel, and run MME Sev 1 Playbook*

\[ ] Add CSM, TAM & DE to Incident Channel

* *Add CSM, TAM & DE leaders (@Brent Fox @Stu Doherty @Jason Blais) to the channel to add the appropriate staff member. Also add @Ian Tien to view Playbooks in motion for L2 and L1 incidents.*

\[ ] Start audio & screen share with customer

* *Include a Mattermost engineer & customer DBA on the call who can run queries to support troubleshooting*

\[ ] Reply to customer (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*

## 2 - Data gathering

\[ ] Share system information

* *Includes relevant system configuration setting, database specs (with CPU, RAM) & application specs (with CPU, RAM)*

\[ ] Share Grafana screenshots

* *Include DB calls, API latency, Store latency, Top HTTP requests, Top API requests, CPU utilization, memory utilization*

\[ ] Share output from support bundle

* *Link to relevant docs*

\[ ] Share output from slow query logs

* *Link to relevant docs*

\[ ] Pin data to channel

* *Link to relevant docs*

## 3 - Data review

\[ ] Review system configuration settings that may impact performance

* *Includes user typing timeout, user typing message, max notifications per channel & db replica lag settings*

\[ ] Review Grafana screenshots to identify potential issues

* *Includes XXX*

\[ ] Review support bundle output to identify potential issues

* *Includes XXX*

\[ ] Review slow query log output to identify potential issues

* *Includes XXX*

\[ ] Summary findings from data review

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 4 - Code investigation

\[ ] Based on findings from data review, identify areas of codebase with potential root cause

* *Includes XXX*

\[ ] Identify potential root cause based on the code

* *Includes XXX*

\[ ] Identify solution for root cause

* *Includes XXX*

\[ ] Submit PR for solution

* *Includes XXX*

\[ ] Deem whether verification of a fix is required for release candidate

* *If yes, provide clear step-by-step instructions for QA to verify the fix, including specifications for test server such as database type (MySQL vs Postgres)*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 5 - Release preparation

\[ ] Merge PR to master branch

* *Includes XXX*

\[ ] Cherry pick PR to dot release branch

* *Includes XXX*

\[ ] Cut dot release candidate

* *Includes XXX*

\[ ] Verify fix in dot release candidate

* *Includes XXX*

\[ ] Cut dot release

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 6 - Dot release deployment

\[ ] Send dot release binary to customer

* *Includes XXX*

\[ ] Upgrade customer’s dev/staging environment with dot release

* *Includes XXX*

\[ ] Verify fix in customer’s dev/staging environment

* *Includes XXX*

\[ ] Upgrade customer’s production environment with dot release

* *Includes XXX*

\[ ] Verify fix in customer’s production environment

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer. Include a timeline for anticipated resolution.*

## 7 - Resolution

\[ ] Monitor fix in customer environment for 24 hours

* *Includes XXX*

\[ ] Receive confirmation from customer about issue resolution

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*

## 8 - Retrospective

\[ ] Complete incident retrospective within 1 business day from resolution

* *Includes XXX*

\[ ] Draft incident summary analysis within 2 business days from resolution

* *Includes XXX*

\[ ] Send completed incident summary analysis with customer within 3 business days

* *Includes XXX*

\[ ] Reply to customer with update (CEO if MME Sev 1 > 1 hour)

* *MME Sev 1 outage >1 hour requires CEO looped into customer*


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://handbook.mattermost.com/operations/deployment-engineering/playbook-mme-sev1-outages.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
