1 of 7

Infrastructure engineering

Mission

The Infrastructure group empowers Mattermost to provide a SaaS Platform as Product which serves internal and external users by guaranteeing that we operate an enterprise-grade SaaS platform with self-serve powers.

The Infrastructure group achieves this by focusing on quality, availability, reliability, scalability, and security objectives. In addition to this we prioritize cost efficiency and awareness adopting FinOps culture, which is strengthened by appropriately prioritized dogfooding initiatives.

For the success of the SaaS Platform as a Product, there are many other teams which also contribute. The responsibility of the Infrastructure group is to evolve the SaaS platform enabled by platform observability data.

Vision

Operate fast, secure and reliable SaaS platform in which everyone can contribute

Teams

Principles

Be open and data driven
Use our own product to complete our mission
Align our strategy with the industry trends, company direction and customer needs

Design

The Infrastructure group uses RFCs - requests for comment - or Design Docs as a common tool to describe the problem we are solving and represent the current state for any topic.

Dogfooding

The Infrastructure group uses Mattermost features as a core tool for the followings:

Secure collaboration with Channels
Incident Management using Playbooks
Release and QA approval process using Playbooks

Having the above in mind, everyone is recommended to:

Feel comfortable to contribute to Mattermost open source projects
Sharing is caring and everyone should have the mindset to open source a project in order to give back to the community
Use our product to achieve our goals and dogfood

Prioritisation

As a group we believe that predictability is an important piece of our DNA, so we aim for predictability first and we believe that speed will follow. High impact and customer obsession are some of the in Mattermost which are core factors to our success and how we prioritize work items and ideas using as a prioritization framework. The acronym PICK is for Possible, Implement, Challenge and Kill.

Possible: Low payoff, easy to do
Implement: High payoff, easy to do
Challenge: High payoff, hard to do

Any ideas that are:

Low ROI & Low Cost/Risk is considered as a Possibility.
High ROI & Low Cost/Risk is Implemented. (“Quick Wins”)
High ROI & High Cost/Risk is considered as a Challenge

Meetings

Topics

Meeting

Participants

Cadence

Common Links

General Channels

Team Channels

JIRA trackers

Resources

Cloud infrastructure cost KPIs

Cloud Cost Optimization Reports

Early each month, a cost report is prepared with the spent of the previous month. This report gets inputs from AWS monthly billing and Zesty cost per AWS account information. That information is incorporated into various sheets, from which diagrams are created that are embedded inside the monthly cost report.

Cloud data export process

This page has now moved to docs.mattermost.com. Please see the latest processes for migrating your Cloud data, including data imports and exports.

https://docs.mattermost.com/manage/cloud-data-export.html#migrate-from-cloud-to-self-hosted

Cloud churn process

When a Mattermost Cloud customer requests their workspace to be deleted, the following steps are taken:

1 - Customer sends a request to delete workspace.

2 - Support team asks for:

Cloud URL and workspace admin email
Churn-related information:
- Why are you no longer interested using Mattermost?
- What tool/product will you use instead of Mattermost?
- Would anything have changed your decision?

3 - Support team opens a private Jira ticket with information from step 2 and tags SRE and Product Manager.

4 - SRE checks if workspace is a) paid or b) has 5 or more monthly active users:

If yes, ask Support to introduce Product Manager to the customer via email. Produt Manager then owns trying to keep the customer working with Customer Success as appropriate, or get more detailed information on why they are discontinuing Mattermost.
Else, delete workspace.

Reliability Manifesto

The Reliability Manifesto is a collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable and cost-effective system following a cloud first approach. This manifesto should keep things short, clear and easy to implement and monitor. This document should be acknowledged by all Mattermost teams that use the Mattermost cloud services and that develop code that will run in the production infrastructure.

To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.

Foundation
Continuous Improvement
Design & Architecture
Observability
Performance & Cost Efficiency
Resilience
Security

Foundation

The set of rules that should set the foundation for this reliability manifesto.

F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.

F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.

F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.

Continuous Improvement

There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.

C-1 We should track all our production incidents. Major and critical incidents are reported and updated using our statuspage. All incidents are tracked via a Infrastructure Incident Response v2.0 playbook run. More details here.

C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.

C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.

C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.

Design & Architecture

Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.

D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the

D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference .

D-3 We should avoid monolithic designs for custom services when possible. Microservice design should be the aim and services should be open sourced for community usage when they don’t expose private company information.

D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.

D-5 Always be mindful that services work in a different way at scale. Configurations that might work in local environments could have a major impact in production systems. For example, a possible abuse of the amount of database connections used.

D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.

Observability

The observability manifesto should cover three main pillars:

Logs
Metrics
Tracing

Logs

O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.

O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).

O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.

O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.

O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.

Metrics

O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).

O-7 All the application metrics should be exposed via 8067 port.

O-8 Application and service software version metrics of all applications should be exposed.

O- The application and services should expose availability and latency metrics.

O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.

Performance & Cost Efficiency

Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.

P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.

P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.

P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.

P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.

Resilience

Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.

R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.

R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.

R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.

R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)

Security

Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.

S-1 We follow security team guidelines and we get security team approval on new system deployments.

S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.

S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.

S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.

S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.

S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).

Production Readiness Review

The production readiness review (PRR) is a process that identifies the reliability needs of a service, feature (mid to large only) or a significant change to infrastructure. The Mattermost PRR has been inspired from the production readiness review from the SRE book. A PRR is considered a prerequisite for the Infrastructure team to accept responsibility for managing the production aspects of a service, feature or infrastructure change.

The goals of the readiness review is to improve the followings:

System architecture and interservice dependencies
Observability (metrics, monitoring, logging)
Incident Response
Capacity planning
Change management
Performance, availability, latency and efficiency
Security posture
Increase awareness and boost collaboration with a goal to built a cloud-native product

This will help to increase the collaboration between Product, Security and Infrastructure teams in order to bridge any gaps about a new service, feature or infrastructure change. The review document is not intended to be constantly updated and it reflects a snapshot of the reality of what is being deployed and the discussions around it.

Process

The PRR process is initiated by the DRI of the related work with the following steps:

Run the Mattermost PRR Playbook
The title of the Playbook run should be self-descriptive for the change.
Include the relevant reviewers in the PRR Mattermost channel

Completion

Once all comments have been addressed and the reviewers are satisfied they could check off the task item in which they are owners. When all the set approvals are there, the playbook run needs to be closed.

Pro-tips approvals

The PRR playbook uses , so when you are done with your review and you are ready to approve you can just write in the channel the following:

SRE review is completed
Platform review is completed
etc.

This will automatically mark the task item list as done based on the current members of the engineering teams.

Library

You can find the Infrastructure library

Infrastructure Library

This is the library of the Design docs.

Reliability Manifesto

To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.

Foundation
Continuous Improvement
Design & Architecture
Observability
Performance & Cost Efficiency
Resilience
Security

Foundation

The set of rules that should set the foundation for this reliability manifesto.

F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.

F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.

F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.

Continuous Improvement

There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.

C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.

C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.

C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.

Design & Architecture

Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.

D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the

D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference .

D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.

D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.

Observability

The observability manifesto should cover three main pillars:

Logs
Metrics
Tracing

Logs

O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.

O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).

O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.

O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.

O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.

Metrics

O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).

O-7 All the application metrics should be exposed via 8067 port.

O-8 Application and service software version metrics of all applications should be exposed.

O- The application and services should expose availability and latency metrics.

O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.

Performance & Cost Efficiency

Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.

P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.

P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.

P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.

P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.

Resilience

Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.

R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.

R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.

R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.

R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)

Security

Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.

S-1 We follow security team guidelines and we get security team approval on new system deployments.

S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.

S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.

S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.

S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.

S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).

Infrastructure engineering

Mission

Vision

Operate fast, secure and reliable SaaS platform in which everyone can contribute

Teams

Principles

Be open and data driven
Use our own product to complete our mission
Align our strategy with the industry trends, company direction and customer needs

Design

The Infrastructure group uses RFCs - requests for comment - or Design Docs as a common tool to describe the problem we are solving and represent the current state for any topic.

Dogfooding

The Infrastructure group uses Mattermost features as a core tool for the followings:

Secure collaboration with Channels
Incident Management using Playbooks
Release and QA approval process using Playbooks

Having the above in mind, everyone is recommended to:

Feel comfortable to contribute to Mattermost open source projects
Sharing is caring and everyone should have the mindset to open source a project in order to give back to the community
Use our product to achieve our goals and dogfood

Prioritisation

Possible: Low payoff, easy to do
Implement: High payoff, easy to do
Challenge: High payoff, hard to do

Any ideas that are:

Low ROI & Low Cost/Risk is considered as a Possibility.
High ROI & Low Cost/Risk is Implemented. (“Quick Wins”)
High ROI & High Cost/Risk is considered as a Challenge

Meetings

Topics

Meeting

Participants

Cadence

Infrastructure engineering

hashtagMission

hashtagVision

hashtagTeams

hashtagPrinciples

hashtagDesign

hashtagDogfooding

hashtagPrioritisation

hashtagMeetings

hashtagCommon Links

hashtagGeneral Channels

hashtagTeam Channels

hashtagJIRA trackers

hashtagResources

Cloud infrastructure cost KPIs

hashtagCloud Cost Optimization Reports

hashtag

Cloud data export process

Cloud churn process

Reliability Manifesto

hashtagFoundation

hashtagContinuous Improvement

hashtagDesign & Architecture

hashtagObservability

hashtagLogs

hashtagMetrics

hashtagPerformance & Cost Efficiency

hashtagResilience

hashtagSecurity

Production Readiness Review

hashtagProcess

hashtagCompletion

hashtagPro-tips approvals

hashtagLibrary

Infrastructure Library

Reliability Manifesto

hashtagFoundation

hashtagContinuous Improvement

hashtagDesign & Architecture

hashtagObservability

hashtagLogs

hashtagMetrics

hashtagPerformance & Cost Efficiency

hashtagResilience

hashtagSecurity

Infrastructure engineering

hashtagMission

hashtagVision

hashtagTeams

hashtagPrinciples

hashtagDesign

hashtagDogfooding

hashtagPrioritisation

hashtagMeetings

hashtagCommon Links

hashtagGeneral Channels

hashtagTeam Channels

hashtagJIRA trackers

hashtagResources

Cloud data export process

Production Readiness Review

hashtagProcess

hashtagCompletion

hashtagPro-tips approvals

hashtagLibrary

Cloud infrastructure cost KPIs

hashtagCloud Cost Optimization Reports

hashtag

hashtagSheets and tools to create the reports

hashtagCloud Infrastructure Cost KPIs

hashtagPrimary KPIs

hashtagAverage Production Workspace Cost

hashtagAverage Production Freemium Cost

hashtagAverage Subscription Workspace Cost

hashtagDevelopment Metric

hashtagCustomers Platform Balance Metric

hashtagSecondary KPIs

hashtagWorkspaces Balance Metric

hashtagSecondary Average Production Workspace Cost

hashtagSecondary Average Subscription Workspace Cost

Mission

Vision

Teams

Principles

Design

Dogfooding

Prioritisation

Meetings

Common Links

General Channels

Team Channels

JIRA trackers

Resources

Cloud Cost Optimization Reports

Foundation

Continuous Improvement

Design & Architecture

Observability

Logs

Metrics

Performance & Cost Efficiency

Resilience

Security

Process

Completion

Pro-tips approvals

Library

Foundation

Continuous Improvement

Design & Architecture

Observability

Logs

Metrics

Performance & Cost Efficiency

Resilience

Security

Mission

Vision

Teams

Principles

Design

Dogfooding

Prioritisation

Meetings

Common Links

General Channels

Team Channels

JIRA trackers

Resources

Process

Completion

Pro-tips approvals

Library

Cloud Cost Optimization Reports

Sheets and tools to create the reports

Cloud Infrastructure Cost KPIs

Primary KPIs

Average Production Workspace Cost

Average Production Freemium Cost

Average Subscription Workspace Cost

Development Metric

Customers Platform Balance Metric

Secondary KPIs

Workspaces Balance Metric

Secondary Average Production Workspace Cost

Secondary Average Subscription Workspace Cost

Active Average Production Workspace Cost

Secondary Active Average Subscription Workspace Cost

Average Total Production Freemium Cost

Non Production Environment Cost

Active Hibernated Workspaces Ratio