Mattermost Handbook
Need help?How to spend company moneyHow to update the HandbookRelease overview
0.2.1
0.2.1
  • Mattermost Handbook
  • Company
    • About Mattermost
      • List of terms
      • Business model
      • Mindsets
    • "How to" guides for staff
      • How to set up a 1-1 channel
      • How to update the handbook
      • How to manage Handbook notifications
      • How to change mobile device
        • How to handle a lost mobile device
      • How to do a mini-retrospective
      • How to autolink keywords in Mattermost
  • Operations
    • Company operations
      • Areas of Responsibility
      • Mattermost Leadership Team (MLT)
        • MLT cadence
      • Company measures
        • Metrics definitions
        • FY23 goals board
        • MLT metrics
      • Company cadence
      • Company policies
        • Community response policy
        • Security policy
      • Company processes
        • Issue/solution process
        • Company agreements
        • Publishing
          • Public web properties
          • Publishing guidelines
            • Brand and visual design guidelines
            • Voice, tone, and writing style guidelines
              • Contribute to documentation
            • Confidentiality guidelines
          • Post-publication quality control process
      • Handbook processes and policies
        • Handbook onboarding
      • Fiscal year planning
    • Research and Development
      • Organization
        • Tech Writing
        • Data engineering
        • Delivery
        • Cloud Platform
        • Site Reliability Engineering
        • GRC
        • Product Security
        • Security Operations
      • Processes
        • Feature Labels
      • Product
        • Product planning
          • Product philosophy and principles
          • Prioritization process
          • Release planning process
          • Roadmap views
          • Release plan
          • Launch plan
          • Feature requests
        • Development process
          • Mobile feature guidelines
          • Deprecation policy
          • Mattermost software requirements process
          • Jira ticket lifecycle
          • Creating new Jira bug tickets
            • Priority levels for tickets
            • Jira fix versions
        • Release process
          • Release overview
          • Feature release process
          • Dot release process
          • Security release process
          • Mobile app release process
          • Desktop app release process
          • Release tips
          • Release scorecard definitions
        • How-to guides for Product
          • How to use productboard
          • How to record a roadmap video
          • How to update integrations directory
          • How to write a feature release announcement
        • Product Management team handbook
          • Product Management Areas of Ownership
          • Product Manager onboarding
          • Product Manager levels
          • Professional development
        • Product Design team handbook
          • Product Design levels
        • Technical Writing team handbook
          • Work with us
          • User interface text guidelines
          • Documentation style guide
          • Our terminology
          • Guidelines for PMs and developers
          • Guidelines for community contributions
          • Technical Writer levels
          • Docathon 2021
            • Getting started with contributing
        • Growth
          • A/B testing methodology
          • PQL definition
        • Analytics
          • Product Analyst Engineer levels
          • Looker
            • Dashboards
            • Explores
          • Telemetry
        • Developer relations
        • Product team hangouts
      • Engineering
        • Infrastructure engineering
          • Cloud infrastructure cost KPIs
          • Cloud data export process
          • Cloud churn process
          • Reliability Manifesto
          • Production Readiness Review
          • Infrastructure Library
        • Integrations team processes
        • Plugin release process
        • Data Engineering
        • Sustained Engineering
          • On call
        • How to go to a conference
        • Public speaking
        • Core contributor expanded access policy
      • Quality Assurance
        • QA workflow
        • QA testing tips and tools
        • Rainforest process
    • Messaging and Math
      • How-to guides for M&M
        • How to create release announcements
        • How to create screenshots and GIFs
        • How to write Mattermost case studies
        • How to write guest blog posts for Mattermost apps and services
        • How to write Mattermost recipes
        • How to compose tweets
        • How to create a split test for web page
        • How to run meetups
        • How to run executive dinners
      • Checklists for M&M
        • Blog post checklist
        • Bio checklist
      • Mattermost websites
      • Demand generation reporting
      • M&M Asana guidelines
      • Content marketing
        • How to use the editorial calendar
        • Content development and distribution
        • Video content guidelines
        • How to contribute content
    • Sales
      • Deal Desk
      • Partner programs
      • Lead management
    • Deployment Engineering
      • Overview
      • Workflows
      • Frequently Asked Questions
      • Playbook for MME Sev 1 Outages
      • Status Update Template
    • Program Management
    • Customer Success
      • Customer Support
    • Legal
      • Contracts
      • Ironclad Basics
        • Company-Wide Workflows
        • Sales Contracts and Workflows
        • Signing a Contract and Contract Repository
    • Finance
      • Budget
      • How to use Airbase
        • Access Airbase
        • Navigate Airbase
        • How to submit a purchase request
        • How to submit a reimbursement request
        • How to review a reimbursement request
        • Vendor portal guide
        • Frequently asked questions
      • Onboarding
        • Vendor onboarding
        • ROW staff onboarding
      • Staff member expenses
        • How to spend company money
        • How to spend company money: Internships
        • Corporate credit card policy
        • How to access Airbase
        • Gifting policy
        • How to book airfare and travel
        • How to reimburse the company
        • How to convert currencies
        • How to get paid
      • Arrange a Bounty Program
      • Naming files and agreements
      • Risk management
        • Mattermost U.S. consulting agreements
      • Operations playbook
    • Security
      • Policies
      • Privacy
        • Data deletion requests
        • Data subject access requests
      • Product Security
        • Product Vulnerability Process
        • Working on security-sensitive pull requests
        • Secure Software Development guide
      • Security Operations
        • User guides
    • Workplace
      • PeopleOps
        • HR cadences
        • HR systems
        • HR Processes
        • Working at Mattermost
          • Onboarding
            • Things everyone must know
            • Staff onboarding
            • Engineer onboarding timeline and expectations
            • Manager onboarding
            • Frequently asked questions
          • Learning and development
          • Mattermost communication best practices
          • Paid time off
            • Out of office email example
          • Travel
            • Business travel insurance
          • Leaves of absence
            • Pregnancy leave
            • Baby bonding parental leave
            • Jury duty
          • Workplace program
          • Relocation
          • Total rewards
        • Performance reviews
          • Formal review process
          • New staff performance review
          • Informal review process
        • Transfers and promotions
        • Offboarding instructions for managers
        • People compliance
      • People policies
      • Groups
        • Staff Resource Groups
      • Approvals and iteration
      • IT
        • IT helpdesk
        • Hardware and software purchases
        • Hardware buy back policy
        • Software systems
  • Contributors
    • Contributors
      • Equity, diversity, and inclusion
      • How to contribute to Mattermost
        • Community Content program
        • Documentation contributions
        • Help Wanted tickets
        • Localization
        • Contribution events
      • Mattermost community
      • Contributor kindness
      • Community systems
      • Guidelines and playbooks
        • Social engagement guidelines
        • Contribution guidelines and code of conduct
        • Mattermost Community playbook
        • How to run a Hackathon
        • Hacktoberfest event organizer guide for Mattermost
    • MatterCon
      • Staff information privacy management
      • Mattermost events code of conduct
      • MatterCon2021
    • Join us
      • Ice-breakers
      • Help Wanted tickets
      • Localization
      • Mattermost GitHub sponsorship
      • Things candidates should know
      • Staff recruiting
      • Recruiting cadences
        • Product Manager hiring process
      • Exec recruiting
        • EA logistics
  • Help and support
    • Contact us
Powered by GitBook
On this page
  • Foundation
  • Continuous Improvement
  • Design & Architecture
  • Observability
  • Logs
  • Metrics
  • Performance & Cost Efficiency
  • Resilience
  • Security

Was this helpful?

Edit on Git
Export as PDF
  1. Operations
  2. Research and Development
  3. Engineering
  4. Infrastructure engineering

Reliability Manifesto

The Reliability Manifesto is a collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable and cost-effective system following a cloud first approach. This manifesto should keep things short, clear and easy to implement and monitor. This document should be acknowledged by all Mattermost teams that use the Mattermost cloud services and that develop code that will run in the production infrastructure.

To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.

  • Foundation

  • Continuous Improvement

  • Design & Architecture

  • Observability

  • Performance & Cost Efficiency

  • Resilience

  • Security

Foundation

The set of rules that should set the foundation for this reliability manifesto.

F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.

F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.

F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.

Continuous Improvement

There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.

C-1 We should track all our production incidents. Major and critical incidents are reported and updated using our statuspage. All incidents are tracked via a Infrastructure Incident Response v2.0 playbook run. More details here.

C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.

C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.

C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.

Design & Architecture

Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.

D-3 We should avoid monolithic designs for custom services when possible. Microservice design should be the aim and services should be open sourced for community usage when they don’t expose private company information.

D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.

D-5 Always be mindful that services work in a different way at scale. Configurations that might work in local environments could have a major impact in production systems. For example, a possible abuse of the amount of database connections used.

D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.

Observability

The observability manifesto should cover three main pillars:

  • Logs

  • Metrics

  • Tracing

Logs

O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.

O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).

O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.

O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.

O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.

Metrics

O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).

O-7 All the application metrics should be exposed via 8067 port.

O-8 Application and service software version metrics of all applications should be exposed.

O- The application and services should expose availability and latency metrics.

O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.

Performance & Cost Efficiency

Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.

P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.

P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.

P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.

P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.

Resilience

Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.

R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.

R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.

R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.

R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)

Security

Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.

S-1 We follow security team guidelines and we get security team approval on new system deployments.

S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.

S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.

S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.

S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.

S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).

PreviousCloud churn processNextProduction Readiness Review

Last updated 1 year ago

Was this helpful?

D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the

D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference .

Production Readiness Review
here