arrow-left

All pages
gitbookPowered by GitBook
1 of 7

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Reliability Manifesto

The Reliability Manifesto is a collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable and cost-effective system following a cloud first approach. This manifesto should keep things short, clear and easy to implement and monitor. This document should be acknowledged by all Mattermost teams that use the Mattermost cloud services and that develop code that will run in the production infrastructure.

To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.

  • Foundation

  • Continuous Improvement

  • Design & Architecture

  • Observability

  • Performance & Cost Efficiency

  • Resilience

  • Security

hashtag
Foundation

The set of rules that should set the foundation for this reliability manifesto.

F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.

F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.

F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.

hashtag
Continuous Improvement

There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.

C-1 We should track all our production incidents. Major and critical incidents are reported and updated using our statuspage. All incidents are tracked via a Infrastructure Incident Response v2.0 playbook run. More details here.

C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.

C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.

C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.

hashtag
Design & Architecture

Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.

D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the

D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference .

D-3 We should avoid monolithic designs for custom services when possible. Microservice design should be the aim and services should be open sourced for community usage when they don’t expose private company information.

D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.

D-5 Always be mindful that services work in a different way at scale. Configurations that might work in local environments could have a major impact in production systems. For example, a possible abuse of the amount of database connections used.

D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.

hashtag
Observability

The observability manifesto should cover three main pillars:

  • Logs

  • Metrics

  • Tracing

hashtag
Logs

O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.

O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).

O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.

O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.

O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.

hashtag
Metrics

O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).

O-7 All the application metrics should be exposed via 8067 port.

O-8 Application and service software version metrics of all applications should be exposed.

O- The application and services should expose availability and latency metrics.

O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.

hashtag
Performance & Cost Efficiency

Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.

P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.

P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.

P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.

P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.

hashtag
Resilience

Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.

R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.

R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.

R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.

R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)

hashtag
Security

Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.

S-1 We follow security team guidelines and we get security team approval on new system deployments.

S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.

S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.

S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.

S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.

S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).

Production Readiness Review
herearrow-up-right

Infrastructure engineering

hashtag
Mission

The Infrastructure group empowers Mattermost to provide a SaaS Platform as Product which serves internal and external users by guaranteeing that we operate an enterprise-grade SaaS platform with self-serve powers.

The Infrastructure group achieves this by focusing on quality, availability, reliability, scalability, and security objectives. In addition to this we prioritize cost efficiency and awareness adopting FinOps culture, which is strengthened by appropriately prioritized dogfooding initiatives.

For the success of the SaaS Platform as a Product, there are many other teams which also contribute. The responsibility of the Infrastructure group is to evolve the SaaS platform enabled by platform observability data.

hashtag
Vision

Operate fast, secure and reliable SaaS platform in which everyone can contribute

hashtag
Teams

hashtag
Principles

  • Be open and data driven

  • Use our own product to complete our mission

  • Align our strategy with the industry trends, company direction and customer needs

hashtag
Design

The Infrastructure group uses RFCs - requests for comment - or Design Docs as a common tool to describe the problem we are solving and represent the current state for any topic.

hashtag
Dogfooding

The Infrastructure group uses Mattermost features as a core tool for the followings:

  • Secure collaboration with Channels

  • Incident Management using Playbooks

  • Release and QA approval process using Playbooks

Having the above in mind, everyone is recommended to:

  • Feel comfortable to contribute to Mattermost open source projects

  • Sharing is caring and everyone should have the mindset to open source a project in order to give back to the community

  • Use our product to achieve our goals and dogfood

hashtag
Prioritisation

As a group we believe that predictability is an important piece of our DNA, so we aim for predictability first and we believe that speed will follow. High impact and customer obsession are some of the in Mattermost which are core factors to our success and how we prioritize work items and ideas using as a prioritization framework. The acronym PICK is for Possible, Implement, Challenge and Kill.

  • Possible: Low payoff, easy to do

  • Implement: High payoff, easy to do

  • Challenge: High payoff, hard to do

Any ideas that are:

  • Low ROI & Low Cost/Risk is considered as a Possibility.

  • High ROI & Low Cost/Risk is Implemented. (“Quick Wins”)

  • High ROI & High Cost/Risk is considered as a Challenge

hashtag
Meetings

Topics
Meeting
Participants
Cadence

hashtag
Common Links

hashtag
General Channels

hashtag
Team Channels

hashtag
JIRA trackers

hashtag
Resources

Cloud data export process

This page has now moved to docs.mattermost.com. Please see the latest processes for migrating your Cloud data, including data imports and exports.

https://docs.mattermost.com/manage/cloud-data-export.html#migrate-from-cloud-to-self-hosted

Influence and educate best practices
Self-serve powers using Plugins and Slash Commands
  • Community as internal release ring for testing potential releases

  • It’s part of Leads’ responsibility to influence this culture
    Kill: Low payoff, hard to do
    Low ROI & High Cost/Risk is Killed/Next year

    Leadership, Infrastructure, Product, Security

    Thursday

    Infrastructure Library
  • Incident Response Playbookarrow-up-right

  • Incident Review & Knowledge share

    Reliability Engineering Guild

    Leadership, SRE, Delivery

    Monday

    Cross-org collaboration

    SRE
    Delivery
    Platform
    leadership principlesarrow-up-right
    PICK chartarrow-up-right
    Infrastructure Engineering calendararrow-up-right
    Incident Management Frameworkarrow-up-right
    Status pagearrow-up-right
    ~infrastructurearrow-up-right
    ~cloud-supportarrow-up-right
    ~infrastructure-delivery-teamarrow-up-right
    ~infrastructure-sre-teamarrow-up-right
    ~infrastructure-platform-teamarrow-up-right
    Deliveryarrow-up-right
    SREarrow-up-right
    Platformarrow-up-right
    On-boarding Playbookarrow-up-right
    Reliability Manifesto
    Production Readiness Reviews

    Infrastructure Guild

    Production Readiness Review

    The production readiness review (PRR) is a process that identifies the reliability needs of a service, feature (mid to large only) or a significant change to infrastructure. The Mattermost PRR has been inspired from the production readiness reviewarrow-up-right from the SRE book. A PRR is considered a prerequisite for the Infrastructure team to accept responsibility for managing the production aspects of a service, feature or infrastructure change.

    The goals of the readiness review is to improve the followings:

    • System architecture and interservice dependencies

    • Observability (metrics, monitoring, logging)

    • Incident Response

    • Capacity planning

    • Change management

    • Performance, availability, latency and efficiency

    • Security posture

    • Increase awareness and boost collaboration with a goal to built a cloud-native product

    This will help to increase the collaboration between Product, Security and Infrastructure teams in order to bridge any gaps about a new service, feature or infrastructure change. The review document is not intended to be constantly updated and it reflects a snapshot of the reality of what is being deployed and the discussions around it.

    hashtag
    Process

    The PRR process is initiated by the DRI of the related work with the following steps:

    1. Run the Mattermost PRR Playbook

    2. The title of the Playbook run should be self-descriptive for the change.

    3. Include the relevant reviewers in the PRR Mattermost channel

    hashtag
    Completion

    Once all comments have been addressed and the reviewers are satisfied they could check off the task item in which they are owners. When all the set approvals are there, the playbook run needs to be closed.

    hashtag
    Pro-tips approvals

    The PRR playbook uses , so when you are done with your review and you are ready to approve you can just write in the channel the following:

    • SRE review is completed

    • Platform review is completed

    • etc.

    This will automatically mark the task item list as done based on the current members of the engineering teams.

    hashtag
    Library

    You can find the Infrastructure library

    Cloud infrastructure cost KPIs

    hashtag
    Cloud Cost Optimization Reports

    Early each month, a cost report is prepared with the spent of the previous month. This report gets inputs from AWS monthly billing and Zesty cost per AWS account information. That information is incorporated into various sheets, from which diagrams are created that are embedded inside the monthly cost report.

    hashtag

    Use the checklist in Playbook which will guide you through to complete the review
    herearrow-up-right
    Playbooks task actionsarrow-up-right
    here
    The reports - presentations

    The Cloud Cost Optimization Reports can be found in Google Slides herearrow-up-right. Each month a new one is created, For example the one for the January 2024 is the Cloud Cost Optimization Report #37arrow-up-right.

    hashtag
    Sheets and tools to create the reports

    • The sheet where the all information is combined and then populated into the cost reports: FY24 Cost tracking AWS accountsarrow-up-right

    • Zesty's report with the savings per accountarrow-up-right

    • From that sheet, the older cost reports were populated and now it is used for the average cost per workspace Monthly costs per environmentarrow-up-right

    • Show the costs analysis in Cloud Production per type, such as S3, ELB, VPCs, NAT etc.

    • How much each

    • Shows the

    hashtag
    Cloud Infrastructure Cost KPIs

    The Cloud SRE team uses certain KPIs to monitor cloud infrastructure costs from the engineering standpoint. These KPIs are used to help the business and the Cloud SRE team to set goals for cost optimization.

    As of July 2021, the goals for the SRE team are to:

    • Decrease the Average Cost Per Workspace (APWC).

    • Keep development and test environments costs (DM) below 20% of total costs.

    • Increase the WB fraction (ratio of paid workspaces relative to total workspaces).

    Below are the primary and secondary KPIs used to measure each.

    hashtag
    Primary KPIs

    hashtag
    Average Production Workspace Cost

    • Description: The fraction which compares the amount of Production Environment costs with the number of workspaces which are active in a given month of the Cloud Infrastructure service. Production Environment will be defined as the environment where we host all our customers, including paid and free workspaces.

    • Formula: APWC = Production Environment Costs/Number of Workspaces

    hashtag
    Average Production Freemium Cost

    • Description: An approximation of what free workspaces cost us. The fraction compares the amount of Production Environment costs with the number of active free workspaces.

    • Formula: APFC = Production Environment Costs/Freemium Active Workspaces

      Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.

    hashtag
    Average Subscription Workspace Cost

    • Description: The fraction which compares the amount of Production Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service. A paid workspace is defined as successful receipt of payment for the time period (e.g. payment for the previous month completed, when measuring ASWC for the previous month).

    • Formula: ASWC = Production Environment Costs/Paid Workspaces

    hashtag
    Development Metric

    • Description: The fraction which compares the amount we spend on environments related to our features testing with the total Cloud infrastructure costs.

    • Formula: DM = (Test Environment Costs + Dev Environment Costs)/Total AWS Cloud Costs

    hashtag
    Customers Platform Balance Metric

    • Description: The fraction which compares the variable costs we have because we run the Cloud product with the costs that are directly correlated with customers' usage (both paid and freemium).

    • Formula: CPBM = Customer Specific Costs/Variable Platform Costs

    hashtag
    Secondary KPIs

    hashtag
    Workspaces Balance Metric

    • Description: The fraction which compares the paid workspaces with the total number of workspaces within a given month of the Cloud Infrastructure service.

    • Formula: WB = Paid Workspaces/Total Workspaces

    hashtag
    Secondary Average Production Workspace Cost

    • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces within a given month of the Cloud Infrastructure service.

    • Formula: SAPWC = (Production Environment Costs + Core Environment Costs)/Workspaces

    hashtag
    Secondary Average Subscription Workspace Cost

    • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service.

    • Formula: SASWC = (Production Environment Costs + Core Environment Costs)/Paid Workspaces

    hashtag
    Active Average Production Workspace Cost

    • Description: The fraction which compares the amount of Production Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.

    • Formula: AAPWC = Production Environment Costs/Active Workspaces

    hashtag
    Secondary Active Average Subscription Workspace Cost

    • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.

    • Formula: SAASWC = (Production Environment Costs + Core Environment Costs)/Active Workspaces

    hashtag
    Average Total Production Freemium Cost

    • Description: Roughly how much free workspaces cost us. The fraction which compares the amount of Production Environment costs with the number of all free workspaces (active and inactive).

    • Formula: ATPFC = Production Environment Costs/Freemium Workspaces (Active and Inactive)

    hashtag
    Non Production Environment Cost

    • Description: The fraction which compares the amount we spend on all non production environments.

    • Formula: NPEC = (Test Environment Costs + Dev Environment Costs + Core Environment Costs + Staging Environment Costs)/Total AWS Cloud Costs

    hashtag
    Active Hibernated Workspaces Ratio

    • Description: The ratio between Active to Hibernated workspaces at any given month

    • Formula: AHR = Active Workpaces/Hibernated Workspaces

    Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.

    Infrastructure Library

    This is the library of the Design docs.

    Cost analysis by typearrow-up-right
    Cluster costs in Productionarrow-up-right
    AWS Cost Projectionarrow-up-right
  • Service Environmentarrow-up-right
    Migration from MySQL Aurora to Aurora Postgres with AWS DMSarrow-up-right
    Self-serve GitOpsarrow-up-right
    Crossplanearrow-up-right
    Self-serve Observabilityarrow-up-right
    Self-serve cost observabilityarrow-up-right
    Data Engineering Infraarrow-up-right
    Fault-tolerant service databasesarrow-up-right
    SLOs automationarrow-up-right
    Disaster Recoveryarrow-up-right
    Unified logging platform, Lokiarrow-up-right
    Spot Instancesarrow-up-right
    PGBouncer fine-tuningarrow-up-right
    Chimera OAuth2 proxyarrow-up-right
    Provisioner multi-domains supportarrow-up-right
    Provisioner eventsarrow-up-right
    CI in Forksarrow-up-right
    Unified CIarrow-up-right
    Unified CI - Self-service platformarrow-up-right

    Cloud churn process

    When a Mattermost Cloud customer requests their workspace to be deleted, the following steps are taken:

    1 - Customer sends a request to delete workspace.

    2 - Support team asks for:

    • Cloud URL and workspace admin email

    • Churn-related information:

      • Why are you no longer interested using Mattermost?

      • What tool/product will you use instead of Mattermost?

      • Would anything have changed your decision?

    3 - Support team opens a private Jira ticket with information from step 2 and tags SRE and Product Manager.

    4 - SRE checks if workspace is a) paid or b) has 5 or more monthly active users:

    • If yes, ask Support to introduce Product Manager to the customer via email. Produt Manager then owns trying to keep the customer working with Customer Success as appropriate, or get more detailed information on why they are discontinuing Mattermost.

    • Else, delete workspace.