arrow-left

All pages
gitbookPowered by GitBook
1 of 7

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Cloud infrastructure cost KPIs

hashtag
Cloud Cost Optimization Reports

Early each month, a cost report is prepared with the spent of the previous month. This report gets inputs from AWS monthly billing and Zesty cost per AWS account information. That information is incorporated into various sheets, from which diagrams are created that are embedded inside the monthly cost report.

hashtag
The reports - presentations

The Cloud Cost Optimization Reports can be found in Google Slides . Each month a new one is created, For example the one for the January 2024 is the .

hashtag
Sheets and tools to create the reports

  • The sheet where the all information is combined and then populated into the cost reports:

  • From that sheet, the older cost reports were populated and now it is used for the average cost per workspace

hashtag
Cloud Infrastructure Cost KPIs

The Cloud SRE team uses certain KPIs to monitor cloud infrastructure costs from the engineering standpoint. These KPIs are used to help the business and the Cloud SRE team to set goals for cost optimization.

As of July 2021, the goals for the SRE team are to:

  • Decrease the Average Cost Per Workspace (APWC).

  • Keep development and test environments costs (DM) below 20% of total costs.

  • Increase the WB fraction (ratio of paid workspaces relative to total workspaces).

Below are the primary and secondary KPIs used to measure each.

hashtag
Primary KPIs

hashtag
Average Production Workspace Cost

  • Description: The fraction which compares the amount of Production Environment costs with the number of workspaces which are active in a given month of the Cloud Infrastructure service. Production Environment will be defined as the environment where we host all our customers, including paid and free workspaces.

  • Formula: APWC = Production Environment Costs/Number of Workspaces

hashtag
Average Production Freemium Cost

  • Description: An approximation of what free workspaces cost us. The fraction compares the amount of Production Environment costs with the number of active free workspaces.

  • Formula: APFC = Production Environment Costs/Freemium Active Workspaces

    Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.

hashtag
Average Subscription Workspace Cost

  • Description: The fraction which compares the amount of Production Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service. A paid workspace is defined as successful receipt of payment for the time period (e.g. payment for the previous month completed, when measuring ASWC for the previous month).

  • Formula: ASWC = Production Environment Costs/Paid Workspaces

hashtag
Development Metric

  • Description: The fraction which compares the amount we spend on environments related to our features testing with the total Cloud infrastructure costs.

  • Formula: DM = (Test Environment Costs + Dev Environment Costs)/Total AWS Cloud Costs

hashtag
Customers Platform Balance Metric

  • Description: The fraction which compares the variable costs we have because we run the Cloud product with the costs that are directly correlated with customers' usage (both paid and freemium).

  • Formula: CPBM = Customer Specific Costs/Variable Platform Costs

hashtag
Secondary KPIs

hashtag
Workspaces Balance Metric

  • Description: The fraction which compares the paid workspaces with the total number of workspaces within a given month of the Cloud Infrastructure service.

  • Formula: WB = Paid Workspaces/Total Workspaces

hashtag
Secondary Average Production Workspace Cost

  • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces within a given month of the Cloud Infrastructure service.

  • Formula: SAPWC = (Production Environment Costs + Core Environment Costs)/Workspaces

hashtag
Secondary Average Subscription Workspace Cost

  • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of workspaces that have 11 or more users within a given month of the Cloud Infrastructure service.

  • Formula: SASWC = (Production Environment Costs + Core Environment Costs)/Paid Workspaces

hashtag
Active Average Production Workspace Cost

  • Description: The fraction which compares the amount of Production Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.

  • Formula: AAPWC = Production Environment Costs/Active Workspaces

hashtag
Secondary Active Average Subscription Workspace Cost

  • Description: The fraction which compares the amount of Production Environment and Core Environment costs with the number of active workspaces within a given month of the Cloud Infrastructure service.

  • Formula: SAASWC = (Production Environment Costs + Core Environment Costs)/Active Workspaces

hashtag
Average Total Production Freemium Cost

  • Description: Roughly how much free workspaces cost us. The fraction which compares the amount of Production Environment costs with the number of all free workspaces (active and inactive).

  • Formula: ATPFC = Production Environment Costs/Freemium Workspaces (Active and Inactive)

hashtag
Non Production Environment Cost

  • Description: The fraction which compares the amount we spend on all non production environments.

  • Formula: NPEC = (Test Environment Costs + Dev Environment Costs + Core Environment Costs + Staging Environment Costs)/Total AWS Cloud Costs

hashtag
Active Hibernated Workspaces Ratio

  • Description: The ratio between Active to Hibernated workspaces at any given month

  • Formula: AHR = Active Workpaces/Hibernated Workspaces

Note: Until February 2021 we measured all workspaces as active. Hibernation functionality was introduced in the last three days of February 2021.

Show the costs analysis in Cloud Production per type, such as S3, ELB, VPCs, NAT etc.
  • How much each

  • Shows the

  • herearrow-up-right
    Cloud Cost Optimization Report #37arrow-up-right
    FY24 Cost tracking AWS accountsarrow-up-right
    Zesty's report with the savings per accountarrow-up-right
    Monthly costs per environmentarrow-up-right
    Cost analysis by typearrow-up-right
    Cluster costs in Productionarrow-up-right
    AWS Cost Projectionarrow-up-right

    Cloud data export process

    This page has now moved to docs.mattermost.com. Please see the latest processes for migrating your Cloud data, including data imports and exports.

    https://docs.mattermost.com/manage/cloud-data-export.html#migrate-from-cloud-to-self-hosted

    Cloud churn process

    When a Mattermost Cloud customer requests their workspace to be deleted, the following steps are taken:

    1 - Customer sends a request to delete workspace.

    2 - Support team asks for:

    • Cloud URL and workspace admin email

    • Churn-related information:

      • Why are you no longer interested using Mattermost?

      • What tool/product will you use instead of Mattermost?

      • Would anything have changed your decision?

    3 - Support team opens a private Jira ticket with information from step 2 and tags SRE and Product Manager.

    4 - SRE checks if workspace is a) paid or b) has 5 or more monthly active users:

    • If yes, ask Support to introduce Product Manager to the customer via email. Produt Manager then owns trying to keep the customer working with Customer Success as appropriate, or get more detailed information on why they are discontinuing Mattermost.

    • Else, delete workspace.

    Infrastructure Library

    This is the library of the Design docs.

    Production Readiness Review

    The production readiness review (PRR) is a process that identifies the reliability needs of a service, feature (mid to large only) or a significant change to infrastructure. The Mattermost PRR has been inspired from the from the SRE book. A PRR is considered a prerequisite for the Infrastructure team to accept responsibility for managing the production aspects of a service, feature or infrastructure change.

    The goals of the readiness review is to improve the followings:

    • System architecture and interservice dependencies

    Self-serve GitOpsarrow-up-right
  • Crossplanearrow-up-right

  • Self-serve Observabilityarrow-up-right

  • Self-serve cost observabilityarrow-up-right

  • Data Engineering Infraarrow-up-right

  • Fault-tolerant service databasesarrow-up-right

  • SLOs automationarrow-up-right

  • Disaster Recoveryarrow-up-right

  • Unified logging platform, Lokiarrow-up-right

  • Spot Instancesarrow-up-right

  • PGBouncer fine-tuningarrow-up-right

  • Chimera OAuth2 proxyarrow-up-right

  • Provisioner multi-domains supportarrow-up-right

  • Provisioner eventsarrow-up-right

  • CI in Forksarrow-up-right

  • Unified CIarrow-up-right

  • Unified CI - Self-service platformarrow-up-right

  • Service Environmentarrow-up-right
    Migration from MySQL Aurora to Aurora Postgres with AWS DMSarrow-up-right
    Observability (metrics, monitoring, logging)
  • Incident Response

  • Capacity planning

  • Change management

  • Performance, availability, latency and efficiency

  • Security posture

  • Increase awareness and boost collaboration with a goal to built a cloud-native product

  • This will help to increase the collaboration between Product, Security and Infrastructure teams in order to bridge any gaps about a new service, feature or infrastructure change. The review document is not intended to be constantly updated and it reflects a snapshot of the reality of what is being deployed and the discussions around it.

    hashtag
    Process

    The PRR process is initiated by the DRI of the related work with the following steps:

    1. Run the Mattermost PRR Playbook herearrow-up-right

    2. The title of the Playbook run should be self-descriptive for the change.

    3. Include the relevant reviewers in the PRR Mattermost channel

    4. Use the checklist in Playbook which will guide you through to complete the review

    hashtag
    Completion

    Once all comments have been addressed and the reviewers are satisfied they could check off the task item in which they are owners. When all the set approvals are there, the playbook run needs to be closed.

    hashtag
    Pro-tips approvals

    The PRR playbook uses Playbooks task actionsarrow-up-right, so when you are done with your review and you are ready to approve you can just write in the channel the following:

    • SRE review is completed

    • Platform review is completed

    • etc.

    This will automatically mark the task item list as done based on the current members of the engineering teams.

    hashtag
    Library

    You can find the Infrastructure library here

    production readiness reviewarrow-up-right

    Infrastructure engineering

    hashtag
    Mission

    The Infrastructure group empowers Mattermost to provide a SaaS Platform as Product which serves internal and external users by guaranteeing that we operate an enterprise-grade SaaS platform with self-serve powers.

    The Infrastructure group achieves this by focusing on quality, availability, reliability, scalability, and security objectives. In addition to this we prioritize cost efficiency and awareness adopting FinOps culture, which is strengthened by appropriately prioritized dogfooding initiatives.

    For the success of the SaaS Platform as a Product, there are many other teams which also contribute. The responsibility of the Infrastructure group is to evolve the SaaS platform enabled by platform observability data.

    hashtag
    Vision

    Operate fast, secure and reliable SaaS platform in which everyone can contribute

    hashtag
    Teams

    hashtag
    Principles

    • Be open and data driven

    • Use our own product to complete our mission

    • Align our strategy with the industry trends, company direction and customer needs

    hashtag
    Design

    The Infrastructure group uses RFCs - requests for comment - or Design Docs as a common tool to describe the problem we are solving and represent the current state for any topic.

    hashtag
    Dogfooding

    The Infrastructure group uses Mattermost features as a core tool for the followings:

    • Secure collaboration with Channels

    • Incident Management using Playbooks

    • Release and QA approval process using Playbooks

    Having the above in mind, everyone is recommended to:

    • Feel comfortable to contribute to Mattermost open source projects

    • Sharing is caring and everyone should have the mindset to open source a project in order to give back to the community

    • Use our product to achieve our goals and dogfood

    hashtag
    Prioritisation

    As a group we believe that predictability is an important piece of our DNA, so we aim for predictability first and we believe that speed will follow. High impact and customer obsession are some of the in Mattermost which are core factors to our success and how we prioritize work items and ideas using as a prioritization framework. The acronym PICK is for Possible, Implement, Challenge and Kill.

    • Possible: Low payoff, easy to do

    • Implement: High payoff, easy to do

    • Challenge: High payoff, hard to do

    Any ideas that are:

    • Low ROI & Low Cost/Risk is considered as a Possibility.

    • High ROI & Low Cost/Risk is Implemented. (“Quick Wins”)

    • High ROI & High Cost/Risk is considered as a Challenge

    hashtag
    Meetings

    Topics
    Meeting
    Participants
    Cadence

    hashtag
    Common Links

    hashtag
    General Channels

    hashtag
    Team Channels

    hashtag
    JIRA trackers

    hashtag
    Resources

    Reliability Manifesto

    The Reliability Manifesto is a collection of rules, guidelines, and best practices that reflect our current thinking on what it takes to build a reliable and cost-effective system following a cloud first approach. This manifesto should keep things short, clear and easy to implement and monitor. This document should be acknowledged by all Mattermost teams that use the Mattermost cloud services and that develop code that will run in the production infrastructure.

    To make things simple, we decided to split this document into a number of distinct categories that can be considered individual manifestos and can be seen below.

    • Foundation

    Continuous Improvement
  • Design & Architecture

  • Observability

  • Performance & Cost Efficiency

  • Resilience

  • Security

  • hashtag
    Foundation

    The set of rules that should set the foundation for this reliability manifesto.

    F-1 We do not silently bypass the rules in this document, but instead, start a discussion to change the rules when we think something does not fit our situation.

    F-2 This set of rules should reflect measurable targets in the quarter OKRs of the teams that utilize the Infrastructure platform.

    F-3 We assess the validity of the rules in this document twice a year and we adapt accordingly.

    hashtag
    Continuous Improvement

    There is always room for improvement in our knowledge and our systems. The continuous improvement manifesto aims to make us more efficient, resilient and better engineers.

    C-1 We should track all our production incidents. Major and critical incidents are reported and updated using our statuspage. All incidents are tracked via a Infrastructure Incident Response v2.0 playbook run. More details here.

    C-2 Post mortem meetings and documents should follow every major and critical production incident. Runbooks if any should be added in the internal documentation for future reference.

    C-3 No single engineer should fully deploy a new service across all environments. We share knowledge and reduce the team bus factor.

    C-4 Blameless culture is a key. People make mistakes and systems do fail. We need to learn from them and help each other.

    hashtag
    Design & Architecture

    Better design and architectures create more reliable systems and reduce technical debt. This is what the Design & Architecture manifesto is all about.

    D-1 Infrastructure team should be involved early in the design process for applications that will run in the infrastructure systems. Read about the Production Readiness Review

    D-2 All new services should go through the POC phase and findings should be presented in a team meeting, together with proposal, design architecture and implementation steps. Example reference herearrow-up-right.

    D-3 We should avoid monolithic designs for custom services when possible. Microservice design should be the aim and services should be open sourced for community usage when they don’t expose private company information.

    D-4 All services should be followed by architecture diagrams (e.g. Lucidchart) when possible and documentation.

    D-5 Always be mindful that services work in a different way at scale. Configurations that might work in local environments could have a major impact in production systems. For example, a possible abuse of the amount of database connections used.

    D-6 All services should support the ability to turn off/on on demand without affecting the rest of the infrastructure services.

    hashtag
    Observability

    The observability manifesto should cover three main pillars:

    • Logs

    • Metrics

    • Tracing

    hashtag
    Logs

    O-1 Logs should include meaningful messages and stack trace should not be a requirement to understand the log message.

    O-2 Service logs are enriched with metadata & attributes that differentiate the service (e.g service: ldap).

    O-3 Logging levels should be consistent across services and false errors should be avoided to enable generation of metrics, patterns and dashboards.

    O-4 Logging format structure should be consistent across services and should be agreed across all teams. String and JSON formats should be used for all logs.

    O-5 Retention of 30 days for all critical and 7 days of all non-critical production system logs should be met. With exceptions, logs can be kept up to 1 year in S3 utilizing Glacier options.

    hashtag
    Metrics

    O-6 All services should export key metrics under /metrics path and should be enriched with metadata & attributes that differentiate the service (e.g service: playbooks).

    O-7 All the application metrics should be exposed via 8067 port.

    O-8 Application and service software version metrics of all applications should be exposed.

    O- The application and services should expose availability and latency metrics.

    O-10 Raw metrics should be kept for 7 days and 5m/1h resolution metrics should be kept for 365 days for all production systems.

    hashtag
    Performance & Cost Efficiency

    Improving performance and cutting costs of our systems makes our services more competitive and efficient. This is what the Performance & Cost Efficiency manifesto is all about.

    P-1 We avoid the solution of throwing money to the problem. If a system constantly needs money to perform, we need to seek for other solutions.

    P-2 We utilize cost optimization at scale. The more customers we add, the better we should utilize our shared infrastructure and cost per customer KPI should be reduced.

    P-3 We should aim for a 2% increase in our spot machines in each quarter with a maximum of 50% of the total fleet.

    P-4 The cost of non production environments should not exceed 30% of the total Mattermost infrastructure environments cost.

    hashtag
    Resilience

    Resilience ensures that our services are there for our customers even on bad days. This is what this manifesto is about.

    R-1 We should always aim for higher SLOs than the ones promised to our customers. The SRE team would influence other teams on what needs to be improved to meet higher SLAs.

    R-2 Each service should have a dedicated SLO target, which should always be higher than the external SLO.

    R-3 We run monthly gamedays, injecting chaos to the system. The impact should be measured and post game day actions should be defined. The whole team should be involved.

    R-4 Disaster recovery protocols and playbooks should be defined for all our key components (K8s Clusters, Database Clusters, DNS, etc.)

    hashtag
    Security

    Keeping our systems secure, keeps our customers safe and our nights lighter. This is what the Security manifesto is all about.

    S-1 We follow security team guidelines and we get security team approval on new system deployments.

    S-2 We reduce our attack surface by not exposing to the public internet whatever should not be public.

    S-3 Secrets should be stored in secret management services (eg. Vault, AWS Secret Manager) and not in Gitlab/Github repos.

    S-4 Security groups should be applied in all machines and limit access only to ports and sources needed. Both deployment and evaluation should be done in an automated way.

    S-5 IAM Roles should be used when possible, IAM users should be avoided and IAM policies should be granting least privilege.

    S-6 We should always aim to eliminate high severity issues arising from security tooling analysis (e.g Stackrox).

    Influence and educate best practices
    Self-serve powers using Plugins and Slash Commands
  • Community as internal release ring for testing potential releases

  • It’s part of Leads’ responsibility to influence this culture
    Kill: Low payoff, hard to do
    Low ROI & High Cost/Risk is Killed/Next year

    Leadership, Infrastructure, Product, Security

    Thursday

    Infrastructure Libraryarrow-up-right
  • Incident Response Playbookarrow-up-right

  • Incident Review & Knowledge share

    Reliability Engineering Guild

    Leadership, SRE, Delivery

    Monday

    Cross-org collaboration

    SRE
    Delivery
    Platform
    leadership principlesarrow-up-right
    PICK chartarrow-up-right
    Infrastructure Engineering calendararrow-up-right
    Incident Management Frameworkarrow-up-right
    Status pagearrow-up-right
    ~infrastructurearrow-up-right
    ~cloud-supportarrow-up-right
    ~infrastructure-delivery-teamarrow-up-right
    ~infrastructure-sre-teamarrow-up-right
    ~infrastructure-platform-teamarrow-up-right
    Deliveryarrow-up-right
    SREarrow-up-right
    Platformarrow-up-right
    On-boarding Playbookarrow-up-right
    Reliability Manifestoarrow-up-right
    Production Readiness Reviewsarrow-up-right

    Infrastructure Guild