Site Reliability Engineering
Site Reliability Engineering (SRE) team ensures the availability of Mattermost Cloud user-facing services, building the tools and automation to monitor and enable this availability. These user-facing services include multiple environments including testing, development and production, among others.
Enable developers to build reliable, scalable, cost efficient services that keepthe Mattermost Cloud customer's reliability targets and SRE principles (observability, SLOs etc.) in mind.
- On-call, Incident Management and Incident Review
- Change management
- Influence and encourage best practices
- Maintain and update documentation, training, and Runbooks
- Empower developers with self-serve tools
- Meet & review reliability, quality, and cost KPIs regularly
- Be proactive instead of reactive
- Be open and data driven
- Embrace risk and chaos
- Promote and embrace communication
- Evangelise cost effectiveness - Evangelise and adopt SRE & DevOps practices
You can also reach us on the Mattermost Community Server as follows:
- 1.For support questions related to production you can reach us in
~cloud-support
channel. - 2.For all the other questions you can reach us in
~cloud-sre-team
channel
Each quarter, we maintain our OKRs as epics in this board which is divided into the following areas:
- Reliability & Resiliency
- Availability & Cost Optimisation
- Automation
- Security
- Documentation
- Keep the lights on
- 1.
- 2.New issues are reviewed weekly via Issue Triage.
The team regularly works on the following tasks, listed below in priority order:
- Ensuring availability of Mattermost Cloud user-facing services
- Ensuring availability of community.mattermost.com (e.g. daily releases, calls)
- Ensuring the availability of the internal Cloud Platform (e.g. self-serve testing environments, Observability Platform, GitOps Platform, IaC platform )
Topics | Meeting | Participants | Cadence |
---|---|---|---|
Incident Review & Knowledge share | Reliability Engineering | Cloud leadership, SRE, Release | Monday |
Triage & Planning | SRE Planning | SRE, Platform | Tuesday |
Cloud & Growth Sync | Cloud Engineering | SRE, Release, Platform, Growth | Weekly |
Bring chaos to our systems | Chaos Gameday | SRE | Monthly |
Last modified 12d ago