Resilience and Recovery Testing

(OBJ 3.4)
Goal: Plan for the worst and learn how to overcome any obstacle

Assess system's ability to withstand and adapt to disruptive events
Both Resilience Testing and recovery Testing serves as "fire drill" for enterprise networks and operations
Ensures the system can recover from unforeseen incidents
Conducted through tabletop exercises, failover tests, simulations, and parallel processing
Helps prepare for events like power loss, natural disasters, ransomware attacks, and data breaches

Evaluates the system's capacity to restore normal operation after a disruptive event
Involves executing planned recovery actions
Performed through failover tests, simulations, and parallel processing
Ensures that planned recovery procedures work effectively in a real-world scenario

Scenario-based discussion among key stakeholders
- A simulated discussion to improve crisis readiness without deploying resources
Assess and improve an organization's preparedness and response
No deployment of actual resources
Identifies gaps and seams in response plans
Promotes team-building among stakeholders
- It lets each stakeholder and their team figure out how they're going to respond effectively to the given inject, and this is a fairly low-cost option to use while still providing an extremely engaging environment.

Controlled experiment for transitioning from primary to backup components
- Verifies seamless system transition to a backup for uninterrupted functionality during disasters.
Ensures uninterrupted functionality during disasters
Example:
- Plans to shift business operations to an alternative hot site, due to a large-scale disaster, can be verified through failover tests
- Can actually attempt to do this cutover from the East Coast to the West Coast.
Requires more resources, time, and energy but verify planned actions will work.
Validates the effectiveness of disaster recovery plans
Can identify and rectify issues in the failover process
Example:
- Fly out a small team to the remote hot site to ensure our operations could continue smoothly.
- After any issues, we always have in place a rollback plan, where we can shift operations back to our main facility, while troubleshooting.
- Happens once or twice a year.

Computer-generated representation of a real-world scenario
Allows for hands-on response actions in a virtual environment
Example:
- Spin up a virtual/simulated version of our corporate network inside of the cloud and then we can have a red team attack that network, while our defenders, who are known as the blue team are trying to detect that red team's attacks and utilize their proper incident response techniques to isolate the attackers from the network.
Assesses incident responders and system administrators in real-time
- Often involves Pentesting resources and staff.
Helps evaluate reactions and staff performance
Provides feedback for learning and improvement from each side.

Replicates data and system processes onto a secondary system
Runs primary and secondary systems concurrently
Tests reliability and stability of the secondary setup to make sure it can handle processing data without disruption
Ensures no disruption to day-to-day operations
Assesses the system's ability to handle multiple failure scenarios simultaneously
Require meticulously planning, flawless execution, and an eagle eye for detail to ensure zero disruption
Uses of Parallel Processing
- Resilience Testing
  - Tests the ability of the system to handle multiple failure scenarios
- Recovery Testing
  - Tests the efficiency of the system to recover from multiple points of failure