Originally posted here: Alegion’s blog
In a world where risks seemingly grow at an exponential rate, an engineering organization’s ability to detect and respond to incidents is critical to its survival and success.
But how do we define an ‘incident’? And what does it mean to be ‘prepared’ for one? In this article, we hope to answer some of these fundamental questions, and more. We believe sharing this information is beneficial for engineering teams everywhere looking to improve resiliency, reduce incident severity/downtime, and increase confidence in their product.
The goal of this article is to outline incident response best practices collected and implemented at Alegion. Our response plan has been forged from input provided by industry experts, as well as plenty of hands on experience managing incidents while growing our platform over the years.
A good analogy for incident response in the physical world are first responders. Paramedics, fire fighters, and police officers all fill critical roles in society. Their services are extremely time-sensitive, where delays cost lives and money. They are specialists who practice frequently and can handle a variety of situations by following predefined game plans. There are different roles, and the type of incident dictates the form and shape of the team that responds.
Fortunately, in information technology, most of us do not work with systems that have life-and-death criticality. That doesn’t mean we can’t learn something from first responders in how they organize, prepare, and handle problems.
Anyone involved in ops, development, security, or product management can benefit from reading this article. As you will see, good incident response requires bidirectional support between different business functions – there is plenty of room for everyone to be involved and informed.
Whether your team is struggling to keep up with constant firefights, or if you already have a strong incident response game, there should be something here for everyone.
The time to define what is and is not an incident is not when something anomalous is happening. That is too late. Take some time after reading this section to consider incident categories that apply to your business model. Then, share that distilled information with relevant stakeholders. An example document is linked below.
At a high level, there are three broad categories of incidents.
A deployment has introduced an undesirable side effect.
Evidence exists of unauthorized access, misconfiguration, malicious activity, or other related concern.
Part of the infrastructure has become a bottleneck and is affecting end user experience. Alternatively, the platform is at risk of failing its service level objectives, potentially triggering complications with clients.
One example of a document outlining types of incidents is Alegion’s incident triage checklist, which is designed around the information security CIA triad of confidentiality, integrity, and availability. Checking any box on the incident triage checklist is a strong indication that you are currently experiencing an incident.
Typically, if your company is in compliance with SOC or similar security auditing regime, you’ll already have an incident response plan in place. If so, review it and ensure the plan is relevant and up to date.
If no incident response plan is in place, do some research and create one as soon as possible. Below are critical elements of a successful incident response plan.
It’s important to keep everyone on the same page as far as the status of the response. One strategy to help with that is to define stages of the response, which will quickly communicate the rough state of the response.
Prepare – what you are doing right now.
Detect – logging, monitoring, observability.
Respond – targeted changes, hypothesized to fix the problem.
Recover – retro and develop action items.
Successful incident response requires defined roles for the responders. Role titles we have defined at Alegion include:
|Lead||Incident commander, and the individual ultimately accountable for response effectiveness.|
|Coordinator||Incident admin, responsible for fulfilling the needs of the response team. Provides supplies as needed (food, equipment, responder requests).|
|Responder||The individuals who are actively responding to the incident.|
|Scribe||Documents important timeline events, including decisions made and key players involved.|
|Communicator||Communicates to the rest of the org while the incident is ongoing. Communicators provide status updates through pre-defined communication channels (email, slack, etc), and state when to expect the next update. Communicators also field questions from the org in order to isolate the rest of the team and allow them to focus on the response.|
Not all incidents have the same severity or prioritization. Define priority levels for your org, and acceptable response times for the different priorities. You may not know the priority right away, so handle reports with appropriate urgency until triaging can occur. The table below is an example of defined priority levels:
|P1||30 minutes||1 hour||24 hours||36 hours|
|P2||4 hours||12 hours||3 days||3 days|
|P3||24 hours||24||7 days||7 days|
This is similar to, but not exactly like, defining service level objectives, where a given metric is assigned SLOs for a predefined unit of time. An example SLO would be: availability (the metric) is at a minimum of 99.5% (the service objective) over a one year period (the unit of time).
Read more about SLOs: https://en.wikipedia.org/wiki/Service-level_objective
Document what channels to use for incident reporting. Typically, email is too slow for effective incident response – if this is true for you, then say so in your policy and communicate that to your internal stakeholders. Require the usage of real time communication methods, such as: slack, a phone call, or an in-person report if co-located.
Any reported incident falls under the prioritization.
The Berkeley Information Security Office has a good guide for incident response planning, including a detailed list of components for incident response plans.
Make sure your plan includes how to actually conduct the response itself. This is crucial, because once a problem occurs many things typically happen at once, and events can quickly spiral out of control if the process isn’t defined and practiced.
When things are actively broken, it is tempting to shotgun possible solutions as fast as possible until service is restored. Avoid this temptation. It’s extremely easy, and in fact likely, that more harm will come from panic patching. Instead, have the response team use a private communication channel to triage, understand, and isolate anomalies. Share evidence, and wait to take action until consensus can be reached.
An explanation of how and when communications occur should be part of the incident response plan. For example, the plan can dictate hourly updates to internal stakeholders over Slack, #incident-command, for the duration of an incident. #incident-command should then be restricted, so that chatter happens on other channels or in threads. The important announcements must be easy to find.
The incident communicator should use the predefined communication channels to announce:
when an incident is declared
timely updates, including when to expect the next update
when an incident is closed
Discuss privately amongst the incident response team what to do and assign one specific responder to apply patches. The scribe should monitor this private team communication channel in order to document what was changed, when, and what impact the change had, if any. The communicator should craft updates based on a distillation of activity from this channel.
Once the problem is fixed, we are done, right? Not exactly.
We still have a critical step to complete: a retrospective meeting needs to be conducted. This should happen ideally within days of incident closure, that way details are still fresh for everyone. The retrospective needs to cover a few key things:
Timeline – the scribe should retell the timeline, from the start of the anomaly, to its discovery, important decisions that were made, as well as any remediation steps that the response team took.
Blameless RCA – identify what the underlying problem is, as much as reasonably possible. One method to dig below the surface is asking the five whys, where you start with a high level question, ie “why did break?”, and each answer forms the basis of the next question. By the end of five iterations, practitioners should be closer to the root cause of an issue.
Action items – don’t stop at determining RCA. We must learn, improve, and avoid repeating mistakes. Create action items and assign them during the retrospective. Get a commitment on when to expect updates. These action items should be documented in the main ticketing system the rest of the company uses. At this point you have everything needed in order to follow up later and check on improvement progress.
As information security professionals, our customers and internal stakeholders place a lot of trust in us. We have an enormous job to be prepared for, well, just about anything. In order to be successful in this role we need every resource at our disposal, when it’s needed. This kind of organization just doesn’t happen organically. You need to plan, document, practice, and retro every incident for process improvement. I hope I’ve convinced you of the value in taking these extra steps. Your customers and coworkers will thank you when things break and you are able to confidently and swiftly navigate operations back to normal.
Incident response training provided by:
Boyd Hemphill - https://www.linkedin.com/in/boydhemphill/
Kark Katzke - https://www.linkedin.com/in/karlkatzke/
Alegion’s Incident Response Checklist: https://github.com/Alegion/incident-triage-checklist
Information Security CIA Triad: https://en.wikipedia.org/wiki/Information_security#Key_concepts
Service Level Objects: https://en.wikipedia.org/wiki/Service-level_objective
Stanford IR Planning Guidelines: https://security.berkeley.edu/incident-response-planning-guideline
Five Whys: https://en.wikipedia.org/wiki/Five_whys