Resilience and Recovery Testing
(OBJ 3.4)
Goal: Plan for the worst and learn how to overcome any obstacle
Resilience Testing
- Assess system's ability to withstand and adapt to disruptive events
- Both Resilience Testing and recovery Testing serves as "fire drill" for enterprise networks and operations
- Ensures the system can recover from unforeseen incidents
- Conducted through tabletop exercises, failover tests, simulations, and parallel processing
- Helps prepare for events like power loss, natural disasters, ransomware attacks, and data breaches
Recovery Testing
- Evaluates the system's capacity to restore normal operation after a disruptive event
- Involves executing planned recovery actions
- Performed through failover tests, simulations, and parallel processing
- Ensures that planned recovery procedures work effectively in a real-world scenario
Tabletop Exercises
- Scenario-based discussion among key stakeholders
- A simulated discussion to improve crisis readiness without deploying resources
- Assess and improve an organization's preparedness and response
- No deployment of actual resources
- Identifies gaps and seams in response plans
- Promotes team-building among stakeholders
- It lets each stakeholder and their team figure out how they're going to respond effectively to the given inject, and this is a fairly low-cost option to use while still providing an extremely engaging environment.
Failover Tests
- Controlled experiment for transitioning from primary to backup components
- Verifies seamless system transition to a backup for uninterrupted functionality during disasters.
- Ensures uninterrupted functionality during disasters
- Example:
- Plans to shift business operations to an alternative hot site, due to a large-scale disaster, can be verified through failover tests
- Can actually attempt to do this cutover from the East Coast to the West Coast.
- Requires more resources, time, and energy but verify planned actions will work.
- Validates the effectiveness of disaster recovery plans
- Can identify and rectify issues in the failover process
- Example:
- Fly out a small team to the remote hot site to ensure our operations could continue smoothly.
- After any issues, we always have in place a rollback plan, where we can shift operations back to our main facility, while troubleshooting.
- Happens once or twice a year.
Simulations
- Computer-generated representation of a real-world scenario
- Allows for hands-on response actions in a virtual environment
- Example:
- Spin up a virtual/simulated version of our corporate network inside of the cloud and then we can have a red team attack that network, while our defenders, who are known as the blue team are trying to detect that red team's attacks and utilize their proper incident response techniques to isolate the attackers from the network.
- Assesses incident responders and system administrators in real-time
- Often involves Pentesting resources and staff.
- Helps evaluate reactions and staff performance
- Provides feedback for learning and improvement from each side.
Parallel Processing
- Replicates data and system processes onto a secondary system
- Runs primary and secondary systems concurrently
- Tests reliability and stability of the secondary setup to make sure it can handle processing data without disruption
- Ensures no disruption to day-to-day operations
- Assesses the system's ability to handle multiple failure scenarios simultaneously
- Require meticulously planning, flawless execution, and an eagle eye for detail to ensure zero disruption
- Uses of Parallel Processing
- Resilience Testing
- Tests the ability of the system to handle multiple failure scenarios
- Recovery Testing
- Tests the efficiency of the system to recover from multiple points of failure
- Resilience Testing