Podcast
Questions and Answers
What is one of the main reasons for the Skype outage mentioned?
What is one of the main reasons for the Skype outage mentioned?
What characteristic of the system recovery process is highlighted?
What characteristic of the system recovery process is highlighted?
Which architecture emphasizes the assumption that everything is insecure?
Which architecture emphasizes the assumption that everything is insecure?
In disaster recovery terminology, what is the purpose of 'Pilot Light'?
In disaster recovery terminology, what is the purpose of 'Pilot Light'?
Signup and view all the answers
What does the term 'warm standby' refer to in the context of disaster recovery?
What does the term 'warm standby' refer to in the context of disaster recovery?
Signup and view all the answers
What does 'USE' stand for in the context of monitoring systems?
What does 'USE' stand for in the context of monitoring systems?
Signup and view all the answers
What was one of the actions taken by major video streaming companies during the COVID-19 pandemic?
What was one of the actions taken by major video streaming companies during the COVID-19 pandemic?
Signup and view all the answers
What is the purpose of implementing a disaster recovery center as done by Infosys?
What is the purpose of implementing a disaster recovery center as done by Infosys?
Signup and view all the answers
Which software is mentioned for event monitoring and alerting?
Which software is mentioned for event monitoring and alerting?
Signup and view all the answers
What is suggested by the architecture of Prometheus?
What is suggested by the architecture of Prometheus?
Signup and view all the answers
What is the primary focus of service continuity management?
What is the primary focus of service continuity management?
Signup and view all the answers
What is defined as a serious disruption to the functioning of a service?
What is defined as a serious disruption to the functioning of a service?
Signup and view all the answers
What was the estimated financial impact of the CrowdStrike outage in July 2024?
What was the estimated financial impact of the CrowdStrike outage in July 2024?
Signup and view all the answers
What was a significant cause of the Skype outage in December 2010?
What was a significant cause of the Skype outage in December 2010?
Signup and view all the answers
Which version of Windows clients was unable to process delayed messages during the Skype outage?
Which version of Windows clients was unable to process delayed messages during the Skype outage?
Signup and view all the answers
How long did the Skype outage last on 22nd December 2010?
How long did the Skype outage last on 22nd December 2010?
Signup and view all the answers
What percentage of Fortune 1000 companies was affected by the CrowdStrike outage?
What percentage of Fortune 1000 companies was affected by the CrowdStrike outage?
Signup and view all the answers
What mechanism do super-nodes use to manage high traffic during overload situations?
What mechanism do super-nodes use to manage high traffic during overload situations?
Signup and view all the answers
Study Notes
Service Continuity and Disaster Recovery Architectural Strategies
- Service continuity (or service, or continuity) is planning for incident prevention, prediction, and management.
- The goal is maintaining service availability and performance at high levels during and after a disaster-level incident.
- During = service continuity
- After = disaster recovery
- A disaster is any serious disruption to a service that exceeds its capacity to cope using its own resources.
- A real-world example of a disaster is the CrowdStrike/Microsoft Outage in July 2024.
- ~8.5 million systems crashed and failed to restart.
- Considered the largest outage in IT history.
- Estimated financial impact of at least $10 billion USD worldwide.
- ~60% Fortune 500 and ~50% Fortune 1000 companies were affected.
- Another example is the Skype network outage of December 2010.
- The Skype network experienced a critical failure.
- The outage lasted approximately 24 hours.
- More than 20 million users were kicked off Skype's network.
- Signed-in accounts decreased from 23.3 million to fewer than 1.6 million in a few hours.
- Main causes
- Overload of cluster responsible for offline messaging
- The cluster sending delayed responses
- Windows clients (version 5.0.0152) could not process delayed messages and crashed.
- These clients included ~30% of the available super-nodes.
- Remaining super-nodes received traffic 100 times greater than normal.
- Super-nodes had built-in mechanisms to avoid impacting host systems, which caused super-nodes to shut down.
Introduction to HA/DR
- HA/DR is a combination of activities:
- RPO/RTO Analysis and Planning
- Load management - Scale in/out
- Observability and Reporting
- Planned testing and disruption
- Note: HA/DR is an iterative process.
RTO & RPO
- RTO: Recovery Time Objective. How long can you wait for your system to come back up if it fails?
- RPO: Recovery Point Objective. To what point in time will you accept lost data?
- Generally, the lower the RTO & RPO, the higher the cost to the business.
Best Practice: 101
- Avoid single points of failure. Implement redundancy wherever possible to prevent single failures from bringing down an entire system.
- Assume everything fails and design back from there. A similar principle is zero-trust architecture—assume everything is insecure.
Single Point of Failure
- This occurs when a single component failure can take down the entire system; the database is a common example of this.
A Review of DR Strategies
- There is no one-size-fits-all DR strategy. Solutions need to be tailored based on specific requirements.
- Increasing RPO/RTO lowers cost.
- Active/Passive vs. Active/Active is a fundamental component of DR strategies with high cost-complexity for multi-site active-active DR.
Active/Passive v Active/Active
- Different Recovery Strategies exist, differentiated by their RPO and RTO (Recovery Point Objective and Recovery Time Objective)
Backup & Restore
- A low-cost & relatively simple approach to DR.
- Lower priority use cases, less crucial functionality.
- Prioritizes backup and restore procedures.
DR: Pilot Light
- Low cost, medium complexity.
- A less comprehensive approach.
- Involves a persistent copy of a component, scaled down but fully functional for DR triggers.
- Warm standby extends pilot light by copying the whole production environment into a different region.
Warm Standby
- Advantages include cost savings, production traffic handling, preparing for full DR.
- Critical function failure over, adjusting DNS records to point to AWS, scaling up the system.
DR: Multi-Site Active-Active
- Complex and expensive but provides high availability, disaster recovery.
- At any given time, this solution can handle full production loads.
- It's similar to a low-capacity standby environment, fully scaling in and out with production loads.
- Ideal for disaster recovery.
- The solution facilitates immediate failover, global replication.
Summary
- A summary table showing the different recovery strategies, RPO/RTO, cost, and complexity.
Observability
- Investigations using USE methodology (Utilization, Saturation, and Errors).
- Resources are physical components such as CPU, disk, networking, and RAM, as well as software resources like threads, PIDs, or inode IDs.
- Utilization is measured on a scale of 0-1—the portion of time a resource is used, on average.
- Saturation is the amount of work that a resource cannot handle immediately.
USE vs RED
- RED method focuses on request rate, error rate, and duration.
- USE method focuses on utilization, saturation, and errors.
USE Methodology
- Detailed step-by-step investigation process using the USE method.
Observability: Data Collection
- Using Prometheus, a free software solution for event monitoring & alerting. Useful for distributed systems.
Prometheus Architecture
- Prometheus architecture details, including components like the Prometheus server, client libraries, exporters, and storage (Grafana).
Prometheus Agent
- Setting up a Prometheus agent using code.
Sample Output
- A sample using Prometheus outputs.
Language Bindings
- Utilizing libraries and code for Prometheus to support languages other than Go or Java.
Grafana Dashboard
- Example of Grafana dashboard for visualizing data.
Real-World Example: YouTube, et-al
- Video streaming companies lowered video quality during the COVID-19 pandemic to manage internet traffic.
Real-World Example: Infosys
- Infosys invested in a DR center in Mauritius, ready to handle 2,000+ employees quickly. Maintains operational continuity when disaster strikes in India because a large percentage of their employees live and work there.
Additional Considerations
- These notes cover the main points of a presentation; they may be incomplete.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.