Teams choose Mattermost because it provides a flexible and secure environment for collaboration. This is particularly valuable during incidents that require timely resolution, that's why many of our customers today leverage Mattermost as a key part of their response workflow. This document summarizes our observation and vision for how Mattermost can help your team be more proactive and effective.
For software operators, the most important incidents tend to be related to site reliability, information security, or development and operations. Regardless of the domain, the response protocol tend to have four broad stages in common: preparation, detection, recovery, summary.
By integrating with your favourite monitoring tools such as Prometheus and Nagios, information flows into Mattermost about potential incidents. Incident Managers can then be notified to review and escalate as appropriate based on standard operating procedure.
Our Incident Response plug-in can be configured to conditionally escalate alerts and automatically start a workflow - typically by first creating a dedicated channel.
The workflow plug-in can be customized to streamline your standard operating procedure, for example:
Spin up a Docker instance to initiate investigation infrastructure
Start real-time audio/video meeting using Zoom
Configure integrations to funnel relevant posts to the incident channel so response actions are centrally logged.
Perform diagnostic commands on other systems and post results in the channel
This stage typically consists of several swim lanes with different team members potentially working in parallel. For example:
Coordinate efforts and collaboration to keep the team on track
Investigate cause of issue by querying other systems
Communicate status and update with stakeholders such as Leadership, Legal, or Public Relations.
Here are some of the ways that Mattermost is used to help teams respond more effectively:
Centralize investigation results and actions performed into one channel so they're visible to the whole team
Track statistics such as mean-time-to-acknowledgment (MTTA) and mean-time-to-resolution (MTTR) in a glanceable report
Save time jumping between other tools by leveraging slash command to perform actions on other systems and log who did what and when
Pin key messages so that they can be easily reviewed as a mini-timeline of the incident at any time
Once the issues have been resolved, the incident channel becomes a complete record of the entire response, without any additional effort to compile, which is essential for post-mortem and Root Cause Analysis (RCA). Message within the channel can be searched, filtered, and eventually exported for reporting and analysis that will help the team learn and improve process.
Based on our observation of how teams leverage Mattermost today, we believe there is amazing potential for us to be more intentional in supporting them. We will work towards making Mattermost the best collaboration platform for software operators by investing in these four areas:
Integration with monitoring, ticketing, and CI/CD tools to enable out-of-the-box experience most of the time while supporting the flexibility for customization
Workflow framework to enable inter-plugin communication and automate actions that can be tailored to your operating procedure
Configurable slash commands to query data and perform actions on external systems from within Mattermost to keep information in-context
Channel data export and cleansing functionality to support analysis and reporting as well as meet compliance requirements