Site Reliability Engineering
Who We Are
Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.
Vision
Enable developers to build reliable, scalable, cost efficient services that keep the Mattermost Cloud customer's reliability targets (SLA, SLO) and SRE principles (observability, SLOs etc.) in mind.
Actions
On-call, Incident Management and Incident Review
Change management
Influence and encourage best practices
Maintain and update documentation, training, and Runbooks
Empower developers with self-serve tools
Meet & review reliability, quality, and cost KPIs regularly
Principles
Be proactive instead of reactive
Be open and data driven
Embrace risk and chaos
Promote and embrace communication
Evangelise cost effectiveness
Evangelise and adopt SRE & DevOps practices
Ask for Help
If you need our assistance please follow the General Workflow
You can also reach us on the Mattermost Community Server as follows:
For support questions related to production you can reach us in
~cloud-support
channel.For all the other questions you can reach us in
~cloud-sre-team
channel
How we work
Each quarter, we maintain our OKRs as epics in this board which is divided into the following areas:
Reliability & Resiliency
Availability & Cost Optimisation
Automation
Security
Documentation
Keep the lights on
General Workflow
Mattermost members open new issues in the SRE Board
New issues are reviewed weekly via Issue Triage.
Areas of Ownership
The team regularly works on the following tasks, listed below in priority order:
Ensuring availability of Mattermost Cloud user-facing services
Ensuring availability of community.mattermost.com (e.g. daily releases, calls)
Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )
Meetings
Incident Review & Knowledge share
Reliability Engineering
Cloud leadership, SRE, Release
Monday
Triage & Planning
SRE Planning
SRE, Platform
Tuesday
Cross-org collaboration
Infrastructure Guild
Leadership, Infrastructure, Product, Security
Thursday
Bring chaos to our systems
Chaos Gameday
SRE
Monthly
Last updated