# Site Reliability Engineering

### Who We Are

Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.

### Vision

Enable developers to build reliable, scalable, cost efficient services that keep the Mattermost Cloud customer's reliability targets (SLA, SLO) and SRE principles (observability, SLOs etc.) in mind.

### Actions

* On-call, Incident Management and Incident Review
* Change management
* Influence and encourage best practices
* Maintain and update documentation, training, and Runbooks
* Empower developers with self-serve tools
* Meet & review reliability, quality, and cost KPIs regularly

### Principles

* Be proactive instead of reactive
* Be open and data driven
* Embrace risk and chaos
* Promote and embrace communication
* Evangelise cost effectiveness
* Evangelise and adopt SRE & DevOps practices

### Ask for Help

If you need our assistance please follow the [General Workflow](#general-workflow)

You can also reach us on the Mattermost Community Server as follows:

1. For support questions related to production you can reach us in `~cloud-support` channel.
2. For all the other questions you can reach us in `~cloud-sre-team` channel

### How we work

Each quarter, we maintain our OKRs as epics in this [board](https://mattermost.atlassian.net/jira/software/c/projects/CLD/boards/109/roadmap?statuses=2%2C4) which is divided into the following areas:

* Reliability & Resiliency
* Availability & Cost Optimisation
* Automation
* Security
* Documentation
* Keep the lights on

#### General Workflow

1. Mattermost members open new issues in the [SRE Board](https://mattermost.atlassian.net/jira/software/c/projects/CLD/boards/109)
2. New issues are reviewed weekly via Issue Triage.

### Areas of Ownership

The team regularly works on the following tasks, listed below in priority order:

* Ensuring availability of Mattermost Cloud user-facing services
* Ensuring availability of community.mattermost.com (e.g. daily releases, calls)
* Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )

### Meetings

| Topics                            | Meeting                 | Participants                                  | Cadence  |
| --------------------------------- | ----------------------- | --------------------------------------------- | -------- |
| Incident Review & Knowledge share | Reliability Engineering | Cloud leadership, SRE, Release                | Monday   |
| Triage & Planning                 | SRE Planning            | SRE, Platform                                 | Tuesday  |
| Cross-org collaboration           | Infrastructure Guild    | Leadership, Infrastructure, Product, Security | Thursday |
| Bring chaos to our systems        | Chaos Gameday           | SRE                                           | Monthly  |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://handbook.mattermost.com/operations/research-and-development/organization/sre.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
