Site Reliability Engineering

Who We Are

Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.

Vision

Enable developers to build reliable, scalable, cost efficient services that keep the Mattermost Cloud customer's reliability targets (SLA, SLO) and SRE principles (observability, SLOs etc.) in mind.

Actions

  • On-call, Incident Management and Incident Review

  • Change management

  • Influence and encourage best practices

  • Maintain and update documentation, training, and Runbooks

  • Empower developers with self-serve tools

  • Meet & review reliability, quality, and cost KPIs regularly

Principles

  • Be proactive instead of reactive

  • Be open and data driven

  • Embrace risk and chaos

  • Promote and embrace communication

  • Evangelise cost effectiveness

  • Evangelise and adopt SRE & DevOps practices

Ask for Help

If you need our assistance please follow the General Workflow

You can also reach us on the Mattermost Community Server as follows:

  1. For support questions related to production you can reach us in ~cloud-support channel.

  2. For all the other questions you can reach us in ~cloud-sre-team channel

How we work

Each quarter, we maintain our OKRs as epics in this board which is divided into the following areas:

  • Reliability & Resiliency

  • Availability & Cost Optimisation

  • Automation

  • Security

  • Documentation

  • Keep the lights on

General Workflow

  1. Mattermost members open new issues in the SRE Board

  2. New issues are reviewed weekly via Issue Triage.

Areas of Ownership

The team regularly works on the following tasks, listed below in priority order:

  • Ensuring availability of Mattermost Cloud user-facing services

  • Ensuring availability of community.mattermost.com (e.g. daily releases, calls)

  • Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )

Meetings

Topics
Meeting
Participants
Cadence

Incident Review & Knowledge share

Reliability Engineering

Cloud leadership, SRE, Release

Monday

Triage & Planning

SRE Planning

SRE, Platform

Tuesday

Cross-org collaboration

Infrastructure Guild

Leadership, Infrastructure, Product, Security

Thursday

Bring chaos to our systems

Chaos Gameday

SRE

Monthly

Last updated