Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The Reliability Manifesto is a collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable and cost-effective system following a cloud first approach. This manifesto should keep things short, clear and easy to implement and monitor. This document should be acknowledged by all Mattermost teams that use the Mattermost cloud services and that develop code that will run in the production infrastructure.
To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.
Foundation
Continuous Improvement
Design & Architecture
Observability
Performance & Cost Efficiency
Resilience
Security
The set of rules that should set the foundation for this reliability manifesto.
F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.
F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.
F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.
There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.
C-1 We should track all our production incidents. Major and critical incidents are reported and updated using our statuspage. All incidents are tracked via a Infrastructure Incident Response v2.0 playbook run. More details here.
C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.
C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.
C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.
Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.
D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the
D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference .
D-3 We should avoid monolithic designs for custom services when possible. Microservice design should be the aim and services should be open sourced for community usage when they don’t expose private company information.
D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.
D-5 Always be mindful that services work in a different way at scale. Configurations that might work in local environments could have a major impact in production systems. For example, a possible abuse of the amount of database connections used.
D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.
The observability manifesto should cover three main pillars:
Logs
Metrics
Tracing
O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.
O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).
O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.
O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.
O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.
O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).
O-7 All the application metrics should be exposed via 8067 port.
O-8 Application and service software version metrics of all applications should be exposed.
O- The application and services should expose availability and latency metrics.
O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.
Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.
P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.
P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.
P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.
P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.
Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.
R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.
R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.
R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.
R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)
Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.
S-1 We follow security team guidelines and we get security team approval on new system deployments.
S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.
S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.
S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.
S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.
S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).
The Infrastructure group empowers Mattermost to provide a SaaS Platform as Product which serves internal and external users by guaranteeing that we operate an enterprise-grade SaaS platform with self-serve powers.
The Infrastructure group achieves this by focusing on quality, availability, reliability, scalability, and security objectives. In addition to this we prioritize cost efficiency and awareness adopting FinOps culture, which is strengthened by appropriately prioritized dogfooding initiatives.
For the success of the SaaS Platform as a Product, there are many other teams which also contribute. The responsibility of the Infrastructure group is to evolve the SaaS platform enabled by platform observability data.
Operate fast, secure and reliable SaaS platform in which everyone can contribute
Be open and data driven
Use our own product to complete our mission
Align our strategy with the industry trends, company direction and customer needs
The Infrastructure group uses RFCs - requests for comment - or Design Docs as a common tool to describe the problem we are solving and represent the current state for any topic.
The Infrastructure group uses Mattermost features as a core tool for the followings:
Secure collaboration with Channels
Incident Management using Playbooks
Release and QA approval process using Playbooks
Having the above in mind, everyone is recommended to:
Feel comfortable to contribute to Mattermost open source projects
Sharing is caring and everyone should have the mindset to open source a project in order to give back to the community
Use our product to achieve our goals and dogfood
As a group we believe that predictability is an important piece of our DNA, so we aim for predictability first and we believe that speed will follow. High impact and customer obsession are some of the in Mattermost which are core factors to our success and how we prioritize work items and ideas using as a prioritization framework. The acronym PICK is for Possible, Implement, Challenge and Kill.
Possible: Low payoff, easy to do
Implement: High payoff, easy to do
Challenge: High payoff, hard to do
Any ideas that are:
Low ROI & Low Cost/Risk is considered as a Possibility.
High ROI & Low Cost/Risk is Implemented. (“Quick Wins”)
High ROI & High Cost/Risk is considered as a Challenge
This page has now moved to docs.mattermost.com. Please see the latest processes for migrating your Cloud data, including data imports and exports.
https://docs.mattermost.com/manage/cloud-data-export.html#migrate-from-cloud-to-self-hosted
Community as internal release ring for testing potential releases
Leadership, Infrastructure, Product, Security
Thursday
Incident Review & Knowledge share
Reliability Engineering Guild
Leadership, SRE, Delivery
Monday
Cross-org collaboration

Infrastructure Guild
The production readiness review (PRR) is a process that identifies the reliability needs of a service, feature (mid to large only) or a significant change to infrastructure. The Mattermost PRR has been inspired from the production readiness review from the SRE book. A PRR is considered a prerequisite for the Infrastructure team to accept responsibility for managing the production aspects of a service, feature or infrastructure change.
The goals of the readiness review is to improve the followings:
System architecture and interservice dependencies
Observability (metrics, monitoring, logging)
Incident Response
Capacity planning
Change management
Performance, availability, latency and efficiency
Security posture
Increase awareness and boost collaboration with a goal to built a cloud-native product
This will help to increase the collaboration between Product, Security and Infrastructure teams in order to bridge any gaps about a new service, feature or infrastructure change. The review document is not intended to be constantly updated and it reflects a snapshot of the reality of what is being deployed and the discussions around it.
The PRR process is initiated by the DRI of the related work with the following steps:
Run the Mattermost PRR Playbook
The title of the Playbook run should be self-descriptive for the change.
Include the relevant reviewers in the PRR Mattermost channel
Once all comments have been addressed and the reviewers are satisfied they could check off the task item in which they are owners. When all the set approvals are there, the playbook run needs to be closed.
The PRR playbook uses , so when you are done with your review and you are ready to approve you can just write in the channel the following:
SRE review is completed
Platform review is completed
etc.
This will automatically mark the task item list as done based on the current members of the engineering teams.
You can find the Infrastructure library
Early each month, a cost report is prepared with the spent of the previous month. This report gets inputs from AWS monthly billing and Zesty cost per AWS account information. That information is incorporated into various sheets, from which diagrams are created that are embedded inside the monthly cost report.
The Cloud Cost Optimization Reports can be found in Google Slides here. Each month a new one is created, For example the one for the January 2024 is the Cloud Cost Optimization Report #37.
The sheet where the all information is combined and then populated into the cost reports: FY24 Cost tracking AWS accounts
From that sheet, the older cost reports were populated and now it is used for the average cost per workspace Monthly costs per environment
Show the costs analysis in Cloud Production per type, such as S3, ELB, VPCs, NAT etc.
How much each
Shows the
The Cloud SRE team uses certain KPIs to monitor cloud infrastructure costs from the engineering standpoint. These KPIs are used to help the business and the Cloud SRE team to set goals for cost optimization.
As of July 2021, the goals for the SRE team are to:
Decrease the Average Cost Per Workspace (APWC).
Keep development and test environments costs (DM) below 20% of total costs.
Increase the WB fraction (ratio of paid workspaces relative to total workspaces).
Below are the primary and secondary KPIs used to measure each.
Description: The fraction which compares the amount of Production Environment costs with the number of workspaces which are active in a given month of the Cloud Infrastructure service. Production Environment will be defined as the environment where we host all our customers, including paid and free workspaces.
Formula: APWC = Production Environment Costs/Number of Workspaces
Description: An approximation of what free workspaces cost us. The fraction compares the amount of Production Environment costs with the number of active free workspaces.
Formula: APFC = Production Environment Costs/Freemium Active Workspaces
Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.
Description: The fraction which compares the amount of Production Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service. A paid workspace is defined as successful receipt of payment for the time period (e.g. payment for the previous month completed, when measuring ASWC for the previous month).
Formula: ASWC = Production Environment Costs/Paid Workspaces
Description: The fraction which compares the amount we spend on environments related to our features testing with the total Cloud infrastructure costs.
Formula: DM = (Test Environment Costs + Dev Environment Costs)/Total AWS Cloud Costs
Description: The fraction which compares the variable costs we have because we run the Cloud product with the costs that are directly correlated with customers' usage (both paid and freemium).
Formula: CPBM = Customer Specific Costs/Variable Platform Costs
Description: The fraction which compares the paid workspaces with the total number of workspaces within a given month of the Cloud Infrastructure service.
Formula: WB = Paid Workspaces/Total Workspaces
Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces within a given month of the Cloud Infrastructure service.
Formula: SAPWC = (Production Environment Costs + Core Environment Costs)/Workspaces
Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service.
Formula: SASWC = (Production Environment Costs + Core Environment Costs)/Paid Workspaces
Description: The fraction which compares the amount of Production Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.
Formula: AAPWC = Production Environment Costs/Active Workspaces
Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.
Formula: SAASWC = (Production Environment Costs + Core Environment Costs)/Active Workspaces
Description: Roughly how much free workspaces cost us. The fraction which compares the amount of Production Environment costs with the number of all free workspaces (active and inactive).
Formula: ATPFC = Production Environment Costs/Freemium Workspaces (Active and Inactive)
Description: The fraction which compares the amount we spend on all non production environments.
Formula: NPEC = (Test Environment Costs + Dev Environment Costs + Core Environment Costs + Staging Environment Costs)/Total AWS Cloud Costs
Description: The ratio between Active to Hibernated workspaces at any given month
Formula: AHR = Active Workpaces/Hibernated Workspaces
Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.
This is the library of the Design docs.
When a Mattermost Cloud customer requests their workspace to be deleted, the following steps are taken:
1 - Customer sends a request to delete workspace.
2 - Support team asks for:
Cloud URL and workspace admin email
Churn-related information:
Why are you no longer interested using Mattermost?
What tool/product will you use instead of Mattermost?
Would anything have changed your decision?
3 - Support team opens a private Jira ticket with information from step 2 and tags SRE and Product Manager.
4 - SRE checks if workspace is a) paid or b) has 5 or more monthly active users:
If yes, ask Support to introduce Product Manager to the customer via email. Produt Manager then owns trying to keep the customer working with Customer Success as appropriate, or get more detailed information on why they are discontinuing Mattermost.
Else, delete workspace.