Links
Comment on page

Site Reliability Engineering

Who We Are

Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.

Vision

Enable developers to build reliable, scalable, cost efficient services that keep the Mattermost Cloud customer's reliability targets (SLA, SLO) and SRE principles (observability, SLOs etc.) in mind.

Actions

  • On-call, Incident Management and Incident Review
  • Change management
  • Influence and encourage best practices
  • Maintain and update documentation, training, and Runbooks
  • Empower developers with self-serve tools
  • Meet & review reliability, quality, and cost KPIs regularly

Principles

  • Be proactive instead of reactive
  • Be open and data driven
  • Embrace risk and chaos
  • Promote and embrace communication
  • Evangelise cost effectiveness
  • Evangelise and adopt SRE & DevOps practices

Ask for Help

If you need our assistance please follow the General Workflow
You can also reach us on the Mattermost Community Server as follows:
  1. 1.
    For support questions related to production you can reach us in ~cloud-support channel.
  2. 2.
    For all the other questions you can reach us in ~cloud-sre-team channel

How we work

Each quarter, we maintain our OKRs as epics in this board which is divided into the following areas:
  • Reliability & Resiliency
  • Availability & Cost Optimisation
  • Automation
  • Security
  • Documentation
  • Keep the lights on

General Workflow

  1. 1.
    Mattermost members open new issues in the SRE Board
  2. 2.
    New issues are reviewed weekly via Issue Triage.

Areas of Ownership

The team regularly works on the following tasks, listed below in priority order:
  • Ensuring availability of Mattermost Cloud user-facing services
  • Ensuring availability of community.mattermost.com (e.g. daily releases, calls)
  • Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )

Meetings

Topics
Meeting
Participants
Cadence
Incident Review & Knowledge share
Reliability Engineering
Cloud leadership, SRE, Release
Monday
Triage & Planning
SRE Planning
SRE, Platform
Tuesday
Cross-org collaboration
Infrastructure Guild
Leadership, Infrastructure, Product, Security
Thursday
Bring chaos to our systems
Chaos Gameday
SRE
Monthly
Last modified 6mo ago