> For the complete documentation index, see [llms.txt](https://handbook.mattermost.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://handbook.mattermost.com/operations/research-and-development/organization/sre.md).

# Site Reliability Engineering

### Who We Are

Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.

### Vision

Enable developers to build reliable, scalable, cost efficient services that keep the Mattermost Cloud customer's reliability targets (SLA, SLO) and SRE principles (observability, SLOs etc.) in mind.

### Actions

* On-call, Incident Management and Incident Review
* Change management
* Influence and encourage best practices
* Maintain and update documentation, training, and Runbooks
* Empower developers with self-serve tools
* Meet & review reliability, quality, and cost KPIs regularly

### Principles

* Be proactive instead of reactive
* Be open and data driven
* Embrace risk and chaos
* Promote and embrace communication
* Evangelise cost effectiveness
* Evangelise and adopt SRE & DevOps practices

### Ask for Help

If you need our assistance please follow the [General Workflow](#general-workflow)

You can also reach us on the Mattermost Community Server as follows:

1. For support questions related to production you can reach us in `~cloud-support` channel.
2. For all the other questions you can reach us in `~cloud-sre-team` channel

### How we work

Each quarter, we maintain our OKRs as epics in this [board](https://mattermost.atlassian.net/jira/software/c/projects/CLD/boards/109/roadmap?statuses=2%2C4) which is divided into the following areas:

* Reliability & Resiliency
* Availability & Cost Optimisation
* Automation
* Security
* Documentation
* Keep the lights on

#### General Workflow

1. Mattermost members open new issues in the [SRE Board](https://mattermost.atlassian.net/jira/software/c/projects/CLD/boards/109)
2. New issues are reviewed weekly via Issue Triage.

### Areas of Ownership

The team regularly works on the following tasks, listed below in priority order:

* Ensuring availability of Mattermost Cloud user-facing services
* Ensuring availability of community.mattermost.com (e.g. daily releases, calls)
* Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )

### Meetings

| Topics                            | Meeting                 | Participants                                  | Cadence  |
| --------------------------------- | ----------------------- | --------------------------------------------- | -------- |
| Incident Review & Knowledge share | Reliability Engineering | Cloud leadership, SRE, Release                | Monday   |
| Triage & Planning                 | SRE Planning            | SRE, Platform                                 | Tuesday  |
| Cross-org collaboration           | Infrastructure Guild    | Leadership, Infrastructure, Product, Security | Thursday |
| Bring chaos to our systems        | Chaos Gameday           | SRE                                           | Monthly  |