How do you know you are "Ready to Respond"?

How do you know you are "Ready

to Respond"?

The Continuous Improvement Framework - A framework designed to help improve a team’s response readiness through data driven actions 

Authors: Angelika Rohrer, Jon Brown

Contributors: Joachim Metz

January 2024

___

About this paper

What is the CI Framework?

Introduction

What does “Ready to Respond” mean?

Measuring Response Readiness

Continuous Improvement (CI) Framework

Benefits

How do you implement the CI Framework?

So, where do you start?

1. Response Strategy

2. Critical Phases

3. Measurements and Metric Selection (KPIs)

4. Procedural Health Assessment

5. Gap Analysis Report & Planning input

Conclusion

Appendix

Appendix A: CI Framework - Response Strategy Categorisation Template

Appendix B: CI Framework - Sample Evaluation Phishing

Appendix C: CI Framework - Sample Response Category Catalog

Appendix D: CI Framework - Sample Gap Analysis Report

___


About this paper

In this paper we are introducing the idea of a “Continuous Improvement (CI) Framework”, which enables an organization to self-assess the health of the underlying operational infrastructure that needs to be in place for an incident response team to be effective. The central focus of the CI Framework is maintaining and maturing the level at which an organization is "Ready to Respond" to any given type of incident.


----------------------------- 


Choice of words and phrases

- CI - Continuous Improvement

- CMMI - Capability Maturity Model Integration

- Frameworks are seen as tools to solve big-picture problems. Frameworks should be used as a tool to create a common language that other organizations, including customers and regulators, can understand when wanting to learn more about an organization’s security posture (see Trapped in a frame).

- Incident Response (IR) is a structured, well documented, and formalized strategic approach to respond to an incident with the goal to limit or prevent damage to an organization and remediate the cause to reduce the risk of future incidents. IR is part of the broader Incident Management (IM) process and focuses on handling technical tasks and considerations. In our case, the term “incident” refers to security incidents such as cybersecurity threats, data breaches or system failures. 

- KPI - Key performance indicator

- Operations or operational work describes ongoing, often repetitive, activities that need to be completed to keep the Incident Response Team’s lights on. Activities include administration, training, process documentation, system, tooling and lab maintenance.

- SLA - Service Level Agreement

- SLO - Service Level Objective



----------------------------- 


NOTE: In the context of existing Cyber Security Incident Response Team (CSIRT) Security Maturity Models, the CI Framework can be compared with SIM3 v2 interim Self Assessment Tool, section P-8: Audit and Feedback Process which "describes how the CSIRT assesses their set-up and operations by self-assessment, external or internal assessment and a subsequent feedback mechanism. Those elements considered not up-to-standard by the CSIRT and their management are considered for future improvement." [see SIM3-mkXVIIIc]



-----------------------------   ⏺ ⏺ ⏺   -----------------------------


What is the CI Framework?


Introduction

Within the hectic, reactive world of Incident Response (IR) the key to be effective is meticulous preparation and planning as outlined in Data Incident Response Process and Building Secure and Reliable Systems. Effective incident response teams are always dedicated to learning from every incident. IR Teams use findings to improve their incident handling, and are always on the lookout for ways to implement additional preventive measures. Striving for continuous improvement in this area can feel like a full-time job. IR Teams recognize that the available tools, capabilities, and processes are often good enough, but they could always be better. However, improving on existing operational infrastructure and procedures, response capabilities and partnerships, is often an afterthought and done spur of the moment. It often feels like maturing underlying operational processes has less priority than responding to an incident, doing research, or taking part in exercises, tabletops, and training - and yet, not maturing the operational infrastructure can lead to a lag in general team preparedness.

The CI Framework's strength lies in its ability to assess the current level of a team’s IR readiness and evaluate their preparedness to effectively respond to potential major incidents. It considers various factors related to the health and maturity of essential operational infrastructure, including playbooks, partnerships, and tools. By doing so, the framework provides insights into the team's ability to efficiently handle future incidents.



What does “Ready to Respond” mean?

The term "Ready to Respond" can hold various meanings for different teams. In the context of this paper, it means the IR team has not only the ability but also the necessary, well maintained operational infrastructure and healthy resources available, to successfully engage in managing an incident.

Some examples of what this looks like is: 

- Incidents have a well defined escalation path;

- Playbooks exist and are up to date, reviewed regularly and gaps are known;

- Critical tools are always available and capable of handling any incident type;

- Partnerships with essential stakeholders for an incident are clearly identified (i.e. legal counsel, communications department, etc.);

- Teams operate at a high level and meet their SLA/SLOs and/or KPIs;

- After an incident, if needed, follow up to verify root cause analysis and action item completion.



Measuring Response Readiness

Anyone that has attempted to measure the success of incident response within the fast-paced, reactive security landscape likely came to the conclusion that this is a difficult problem to solve with no readily available "one-size-fits-all" solution. Numerous industry security maturity models exist that can assess an organization's overall security posture to determine whether the organization has an adequate security management program in place. Some great examples are NIST 800-61, ISO 27035, SIM3 and COBIT. All of these examples have one thing in common: they focus on how well a team performs, the effectiveness of individual responders, or the time to recovery. Measuring performance provides insights on what went well and what went wrong while working an incident. However, it does not give detailed insights into the state of underlying infrastructure. The team will not most likely not be able to easily answer questions such as:

 Are we ready to respond to the next big incident? Are we ready for issues that happen infrequently? 

Are our processes sufficiently up to date for new regulations?



Continuous Improvement (CI) Framework

One way of getting efficient answers to these questions is to implement the Continuous Improvement Framework. This framework enables creating an accurate, comprehensive, and holistic picture of capability and process health. Creating a clear picture of which areas to invest, what projects to prioritize and where to allocate bandwidth and budget.  


The CI Framework is designed to track and measure capability and process health to ensure the team is ‘Ready to Respond’.


The framework categorizes incident response efforts into clearly defined response strategies and critical phases common to all response strategies. Specifically selected points of measurement (KPIs) highlight gaps and areas of improvements within the IR team’s operational infrastructure which will help to mature overall response readiness. 


The diagram above “Phishing Response”, “Malware Response”and “Ransomware Response” are response strategies that consist of multiple critical phases such as “intake”, “playbooks” and “tools”.




Benefits

Once strategies and phases are identified, the CI Framework is easily set up and it does not require much maintenance bandwidth. 

Benefits include:

- It helps the organization understand and improve response readiness and capabilities, regardless of the type or severity of the security incident to manage.

- It measures and tracks the health of processes, tooling, and response strategies over time.

- It creates a scalable and flexible way to onboard new response categories. Categories and phases can be freely chosen based on specific needs or data at hand.

- Over time the measurements can be refined and will expose additional gaps which in turn can be prioritized and fixed at an appropriate pace. 

- Additional points of measurement can be added at any time. 

- It simplifies planning and bandwidth prioritization by providing a big picture view and highlighting gaps and shortcomings.

- It helps to improve process resilience by identifying single points of failure.

- It allows Incident Commanders, Responders and Security Engineers to have a clear vision of the impact of their actions and roadmaps.



-----------------------------   ⏺ ⏺ ⏺   -----------------------------


How do you implement the CI Framework?


So, where do you start?

The Continuous Improvement Framework is designed as a long-term goal that can be achieved by making regular incremental changes. The Plan-Do-Check-Act (PDCA) project management approach is a great supporting tool to highlight the individual steps necessary for implementation.

The full setup cycle, as shown in Figure 1, only needs to be performed once in its entirety. Upon defining response strategies, identifying critical phases, and selecting appropriate measurements, the primary focus of the ongoing time investment should shift to consistently completing the procedural health assessments, conducting gap analysis, and planning and prioritizing projects. The average time investment for the setup cycle is 30-60 minutes per strategy per quarter.

IR Teams can always onboard or offboard new response categories, critical phases, and measurements at any time to increase the depth of the analysis as well.


Figure 1 - The CI Framework setup cycle


  1. Response Strategy 

Start by creating a Response Strategy Catalog (overview) tracking all defined response strategies. Bucket individual response strategies by impact or common threat, ensuring all cases that fall into the same category are of similar nature. Start with high level categories like malware, phishing, fraud, ransomware and then refine in later iterations.



Example: Creating a Response Strategy Catalog

Iteration 1 (What type of response?): Compromise


Iteration 2 (What type of compromise?): Malware



Impact:

Security breach

Impact:

Security breach / compromise

Definition: 

Attacks targeting infrastructure, services, products, users or devices

Definition: 

Malicious software targeting infrastructure, services, products, users or devices

In Scope:

Malware, Social Engineering, Credential Theft, …

In Scope:

Malicious Apps, Man-in-the-Middle, third party software issues, Mobile phone malware, …


Result (Malware response strategy with 3 specific sub-categories):

Impact

Response Strategy

PoC

Main

Sub

Security Breach

Malware Response

Android Malware

Rob

Third party software issue

Derek

Malicious Apps

Tarik

Abuse

Cloud Response

Coinmining

Jason

Ransomware

Ollie




Once all individual response strategies have been categorized, formally document the process. For each strategy, provide a definition, clearly define its scope, establish exit criteria, and assign a primary owner or subject matter expert. The owner will be able to assist initially, potentially owning the full assessment and planning of their strategy once they are comfortable with the process.


Example Phishing

Definition: 

phishing attacks within your organization

In Scope:

email phishing, spear phishing, whaling

Out of Scope:

bad marketing / recruiting practices

Edge Cases:

smishing and vishing

Exit Strategy:

mitigation and remediation completed



With this structured approach, subject matter experts can easily add new response strategies or refine existing ones. 

Note: The chosen strategies will be  different based on the organization



  1. Critical Phases

The next step, after defining response strategies, is to identify critical phases that need to be completed during each response, regardless of the response strategy. 

For the purpose of measuring response readiness, consider the following critical phases: 




Intake

Quality / accuracy and completeness of information reported. Is this information routed to the most appropriate team with the fewest handoffs possible? 


Playbooks

Playbooks (or response plans or runbooks) are written, reviewed and have all necessary features to enable responders to consistently achieve their goals.

Tools and Capabilities

Tools support responders to efficiently and consistently respond to the incident type.

Partnerships / Stakeholders

Responders know appropriate internal partner teams or points of contacts to engage throughout an incident. Partner teams are trained in participating in an incident response.

Operational Excellence

Operational SLAs and SLOs (or other organizational measures like KPIs) are met.


Aftercare

If a post mortem was needed, action items are clearly identified and brought to conclusion.


Trends and common (root) causes identified and actioned.




  1. Measurements and Metric Selection (KPIs)

Now select appropriate data points with the goal to measure the health of each critical phase. For a first iteration, hereafter also referred to as “V1”, focus on existing and easily available data points like "Does a playbook exist?" and "Is it up to date?." Depending on what data points are already recorded, the used measurements may be either very detailed and elaborate or may start off as a simple checklist. 

Note: Adjust these measurements to fit your specific needs or metrics  available.



Phase

    Example V1 Measurement

Intake

  • Does a defined escalation path exist?

  • Does an automated email to ticket system exist?

Playbooks

  • Does a playbook exist?

Tools and Capabilities

  • Are critical tools available?

  • Are critical capabilities available?

Partnerships

  • Are support partners identified

  • Are escalation path(s) in place?

Operational Excellence

  • Are processes completed within SLA?

Aftercare

  • Was a post mortem written?



Over time, the chosen measurements can be refined and tweaked to expose more granular shortcomings (gaps). These gaps can then be fixed at an appropriate pace. 



Phase

    Example V2 Measurement

Intake

  • Does a defined escalation path exist?

  • Does an automated email to ticket system exist?

  • How often are tickets misrouted?

Playbooks

  • Does a playbook exist?

  • Is it up to date?

Tools and Capabilities

  • Are critical tools available?

  • Are critical capabilities available?

Partnerships

  • Are support partners identified

  • Are escalation path(s) in place?

  • Does a hand-off process exist?

Operational Excellence

  • Are processes completed within SLA?

Aftercare

  • Was a post mortem written?

  • Was a trend analysis completed?

  • Are essential AIs identified?



After the selection of appropriate data points, the next step is to  measure the health of each identified data point. For measuring each data point, we chose the Capability Maturity Model Integration (CMMI) framework. However, any other predefined, industry standard maturity model is equally suitable. 



Using the CMMI as reference, the given answers from the “Measurements and Metric” section can then be easily converted to a simple score of 1-5 as shown in the Procedural Health Assessment section below.


Note: The measurement scale itself will not change over time, how the KPI per "critical phase" will change upwards or downwards once you start adding more data points to measure and/or fix identified gaps.




  1. Procedural Health Assessment

Now that the initial setup of the CI Framework is done, the evaluation and assessment part of the CI Framework starts. The simplest approach is to ask the owner of each response strategy category to help with the evaluation, assessment and scoring. For V1, shown below in Figure 2, keep track of capabilities that are working well, but also highlight areas that are broken or missing.   



Example Evaluation: Phishing

Phase

     V1 Measurement

Evaluation

Known Gaps

KPI

Intake

  • Does a defined escalation path exist?

  • Does an automated email to ticket system exist?

Yes

No

    

Automated ticketing system missing

3

Playbooks

  • Does a playbook exist?

Yes


4

Tools and Capabilities

  • Are critical tools available?

  • Are critical capabilities available?

Partially

Yes

Some minor tooling features to obtain full automation are missing

4

Partnerships

  • Are support partners identified

  • Are escalation path(s) in place?

Partially


Yes



4

Operational Excellence

  • Are processes completed within SLA?

Partially

Process Z is broken and causes delays

2

Aftercare

  • Was a post mortem written?

Yes


4


Figure 2: Results of the V1 evaluation and assessment



The selected measurements and data points can be refined any time. Figure 3 below shows a refined next iteration (V2) of the phishing example. Certain scores were adversely affected by recently added measurements, elevating the priority for resolving them promptly.



Example Evaluation: Phishing

Phase

     V2 Measurement

Evaluation

Known Gaps

KPI

Intake

  • Does a defined escalation path exist?

  • Does an automated email to ticket system exist?

  • How often are tickets misrouted?

Yes

No

       23%

Automated ticketing system missing

Misrouting due to missing ticketing system

2

Playbooks

  • Does a playbook exist?

  • Is it up to date?

Yes

Yes


4

Tools and Capabilities

  • Are critical tools available?

  • Are critical capabilities available?

Partially

Yes

Some minor tooling features to obtain full automation are missing

4

Partnerships

  • Are support partners identified

  • Are escalation path(s) in place?

  • Does a hand-off process exist?

Partially


Yes


No

Missing hand off process / playbook

3

Operational Excellence

  • Are processes completed within SLA?

Partially

Process Z is broken and causes delays

2

Aftercare

  • Was a post mortem written?

  • Was a trend analysis completed?

  • Are essential Action Items identified?

Yes

No


No

Process to write trend analysis missing


 Action Item and Feature tracking process incomplete

2

Figure 3: Results of the V2 evaluation and assessment



  • New Response Strategies can quickly be evaluated and documented.

  • Additional points of measurement can be added at any time. 


Note: In case additional measurements are needed, run the new evaluation for at least one cycle alongside the old version, comparing the results that use the same measurement version to reflect progress made accurately.




  1. Gap Analysis Report & Planning input

Prioritize the identified gaps based on the risk they pose to successfully respond to an incident. Use this prioritized list to present the results to your major stakeholders. The results provide a big picture overview of the IR Team’s overall incident response health and serve as input for management planning. The IR team and/or management can explore the list of identified gaps and determine whether they are already covered by ongoing efforts or need to be evaluated for future work. The results of the gap analysis can also be integrated alongside other ongoing work.



1st Evaluation, March 2023, V1 Measurement

Response Strategy

Intake

Play- books

Tools & Capabilities

Partners

Ops Excellence

Post Mortems

Notes

Phishing

3

4

3

3

2

3

GAPS

- Intake: Ticketing system missing

- Operation Excellence: Process Z broken 

- Partners: Missing hand off process 

- Post-mortems: Trend analysis missing

- Post-mortems: Action Item tracking incomplete

Malware

2

3

3

4

2

3


Ransome

2

4

3

2

3

5




2nd Evaluation, September 2023, V1 Measurement 

Response Strategy

Intake

Play- books

Tools & Capabilities

Partners

Ops Excellence

Post Mortems

Notes

Phishing

3

4

3

4

3

4

GAPS

- Intake: Ticketing system missing

- Post-mortems: Action Item tracking incomplete


FIXED

- Partners: Hand off process launched

- Operational Excellence: Process Z fixed

- Post-mortems: Trend analysis implemented

Malware

3

3

3

4

2

4


Ransome

3

4

3

3

3

5




The results of the CI assessments provide key insights into the following questions about the organizations response readiness:

  • How well prepared are we to respond?

  • What are our biggest gaps and how critical are those?

  • Are we ready to respond to the next big incident?

  • Are we ready for issues that happen infrequently?

  • Do we set the right priorities for ongoing project and tooling work?

  • Is bandwidth for project work allocated correctly?

  • Are our processes sufficiently up to date for new regulations?




Conclusion

In uncertain landscapes with ad-hoc decisions and reactive workflows, maintaining a stable foundation for a consistently ready response team is challenging.


The Continuous Improvement Framework offers a unified perspective on response health and operational maturity, enabling informed resource investments and risk assessments. It prevents surprises and minimizes mishandling risks that could harm reputation and finances.


The framework's simplicity, flexibility, and scalability make it applicable to various areas, including response/investigation strategies, partnerships, and tooling resilience.


In Summary: The CI Framework outlines the high-level purpose of all ongoing work, categorizes it and measures its impact. Through the defined KPIs it allows for correct project prioritization helping you to reach annual objectives, which ultimately help to achieve your mission to “always remain ready to respond”. 

Note: In case we piqued your interest and you want to try it out for your team, we have included several templates and examples to get you started.  



-----------------------------   ⏺ ⏺ ⏺   -----------------------------


Appendix


Appendix A: CI Framework - Response Strategy Categorisation Template


Security Response Continuous Improvement Framework

Response Classification $Template


Status: Draft | Authors: {...} | Version: 1.1 | Last update: 



Response Strategy Definition: $ Name 1

Incident Response Evaluation 2

1st Evaluation, V1, {Date} 2


Response Strategy Definition: $ Name

This is what the response is about



Description

Notes / Reasoning 

Response Strategy Scope 

Describe the bounds within incident is handled


Edge Cases

What are one off scenarios that might require additional support


Exit / de-escalation / transition criteria

The point when incident can be close or transitioned to another team



Incident Response Evaluation 

How well are we responding?


1

Initial / Ad hoc

Completely missing, no instructions or predefined partner team to help

2

Managed

Would work with extreme effort, no documentation, unclear partners, minimal critical artifacts/logs/processes available

3

Defined

Slow and manual effort, some documentation, some partnerships, some critical artifacts/logs/processes available

4

Quantitatively Managed

Slow but automated, fairly well documented, strong partnerships, most critical artifacts/logs /processes available

5

Optimizing

Fully automated and documented, strong partnerships, has specialized tools / processes, all critical artifacts/logs/processes available, business as usual


1st Evaluation, V1, {Date}


Category

Measure V1

KPI

Notes / Reasoning 

Intake 

- (100%) How often are tickets misrouted?


* currently only 1 measurement, answer makes 100% of the KPI

NA




Playbooks

- (50%) Does Playbook exist? 

- (50%) Is it up to date?


* 2 measurements available,  both answers are equally important, answer weight split evenly and makes 50% of the KPI

NA

  • Link to Playbook:  

  • Up to date: Yes / No


Tools and Capabilities 

- (50%) Are critical tools available?

- (50%) Are critical capabilities available?


NA

  • Critical tools: Yes / No / Partially

    • Tool 1

    • Tool 2

  • Critical capabilities: Yes / No / Partially

    • Capability 1


Support Partnerships

( Contributors to an Incident)

- (60%) Are support partners identified?

- (40%) Are escalation paths in place?

* 2 measurements available,  measurement 1 is considered more important, receives higher weight in % contributing to the overall KPI

NA

  • RACI chart: Yes / No / Partially

  • Escalation path up to date: Yes / No / Partially


Ops Excellence

- (100%) Are tickets within our defined SLA’s/SLOs?

NA

  • SLA: Yes / No / Partially

  • SLO: Yes / No / Partially

Post Mortems (PM)

- (100%) Was a PM written?

NA

  • PM written: Yes / No



Other Examples



Example

Training

- (40%) Do we have incident specific training?

- (20%) Do we have a test plan?

- (20%) Was a "lessons learned" doc created after the end of the test?

- (20%) Has training been completed within the last 6 month?

NA

  • Training in place: Yes / No / Partially

  • Test plan: Yes / No / Partially

  • "Lessons learned": Yes / No / Partially

  • Training completed in time: Yes / No / Partially





Appendix B: CI Framework - Sample Evaluation Phishing



Security Response Continuous Improvement Framework

Response Classification Phishing


Status: Draft | Authors: {...} | Version: 1.1 | Last update: 


Response Category Definition: Phishing Response 1

Incident Response Evaluation 1

1st Evaluation, V1 - September 2023 2


Response Strategy Definition: Phishing Response

This is what the response is about



Description

Notes / Reasoning 

Response Strategy Scope 

Describe the bounds within incident is handled

  1. phishing attacks within your organization

    1. email phishing, spear phishing, whaling


  1. Out of Scope: bad marketing / recruiting practices

Edge Cases

What are one off scenarios that might require additional support

  1. smishing and vishing

Exit / de-escalation / transition criteria

The point when incident can be close or transitioned to another team

  • mitigation and remediation completed


Incident Response Evaluation 

How well are we responding?


1

Initial / Ad hoc

Completely missing, no instructions or predefined partner team to help

2

Managed

Would work with extreme effort, no documentation, unclear partners, minimal critical artifacts/logs/processes available

3

Defined

Slow and manual effort, some documentation, some partnerships, some critical artifacts/logs/processes available

4

Quantitatively Managed

Slow but automated, fairly well documented, strong partnerships, most critical artifacts/logs /processes available

5

Optimizing

Fully automated and documented, strong partnerships, has specialized tools / processes, all critical artifacts/logs/processes available, business as usual


1st Evaluation, V1 - September 2023


Category

Measure V1

KPI

Notes / Reasoning 

Intake 

- (40%) Does a defined escalation path exist?

- (40%) Does an automated email to ticket system exist?

- (20%) Are tickets misrouted?

3

  • Escalation Path: YES

  • Ticket System: YES

  • Misrouting: 23% of the time



Playbooks

- (50%) Does Playbook exist? 

- (50%) Is it up to date?

4

  • Link to Playbook:  YES

  • Up to date: YES

Tools and Capabilities 

- (50%) Are critical tools available?

- (50%) Are critical capabilities available?

3

  • Critical tools: Partially


  • Critical capabilities: Partially

Support Partnerships

( Contributors to an Incident)

- (60%) Are support partners identified?

- (40%) Are escalation paths in place?

3

  • RACI chart: Yes 

  • Escalation path up to date: No 

Ops Excellence

- (100%) Are tickets within our defined SLA’s/SLOs?

3

  • SLA: No 

  • SLO: Partially

Post Mortems (PM)

- (100%) Was a PM written?

3

  • PM written: Not always





Appendix C: CI Framework - Sample Response Category Catalog


Impact

Response Strategy


PoC

Main

Sub

Definition

Security Breach

Malware Response

Android Malware


Alice

Third party software issue


Bob

Malicious Apps


Cedric

Abuse

Cloud Response

Coinmining


John

Ransomware


Nicole




Appendix D: CI Framework - Sample Gap Analysis Report


Impact

Response Strategy

Point of Contact

Ticket #

Ø

Strategy Health Score

Intake

Play-

books

GAPS

Scoring changes

Main

Sub

Ø Score

Ø Score


Ø

Ø


Data





Ø all critical phases






Abuse





Ø all critical phases






Security Breach





Ø all critical phases










Comments

Popular posts from this blog

Parsing the $MFT NTFS metadata file

Incident Response in the Cloud

Container Forensics with Docker Explorer