Untitled Quiz

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is one of the main reasons for the Skype outage mentioned?

Inadequate fault-recovery testing (correct)
Insufficient user feedback mechanisms
Incompatibility with mobile devices
Poor video compression algorithms

What characteristic of the system recovery process is highlighted?

It is completely linear in nature
It can be fully automated without human oversight
It is a simple one-time process
It is iterative and ongoing (correct)

Which architecture emphasizes the assumption that everything is insecure?

Monolithic architecture
Distributed architecture
Client-server architecture
Zero-trust architecture (correct)

In disaster recovery terminology, what is the purpose of 'Pilot Light'?

To have data always on while services are off until drag is triggered (B)

Signup and view all the answers

What does the term 'warm standby' refer to in the context of disaster recovery?

A fully functional scaled-down environment ready for activation (B)

Signup and view all the answers

What does 'USE' stand for in the context of monitoring systems?

Utilization Saturation Errors (A)

Signup and view all the answers

What was one of the actions taken by major video streaming companies during the COVID-19 pandemic?

Lowered video quality to accommodate internet traffic (A)

Signup and view all the answers

What is the purpose of implementing a disaster recovery center as done by Infosys?

To ensure service continuity in case of a disaster (A)

Signup and view all the answers

Which software is mentioned for event monitoring and alerting?

Prometheus (B)

Signup and view all the answers

What is suggested by the architecture of Prometheus?

It is a distributed system designed to monitor other distributed systems (C)

Signup and view all the answers

What is the primary focus of service continuity management?

Planning for incident prediction and management (C)

Signup and view all the answers

What is defined as a serious disruption to the functioning of a service?

A disaster (D)

Signup and view all the answers

What was the estimated financial impact of the CrowdStrike outage in July 2024?

$10 billion (A)

Signup and view all the answers

What was a significant cause of the Skype outage in December 2010?

Overload of the offline messaging cluster (C)

Signup and view all the answers

Which version of Windows clients was unable to process delayed messages during the Skype outage?

Version 5.0.0152 (A)

Signup and view all the answers

How long did the Skype outage last on 22nd December 2010?

24 hours (D)

Signup and view all the answers

What percentage of Fortune 1000 companies was affected by the CrowdStrike outage?

50% (C)

Signup and view all the answers

What mechanism do super-nodes use to manage high traffic during overload situations?

Built-in traffic regulation (C)

Signup and view all the answers

Flashcards

Skype Outage Cause

Skype's system couldn't handle failures effectively due to insufficient fault-recovery and load testing on super-nodes.

High Availability

A system design approach to ensure continuous operation by handling failures and high loads effectively, like fault recovery testing.

Zero-Trust Architecture

Security approach assuming all network resources are untrusted, requiring strict authentication and authorization for all access.

Disaster Recovery

Strategies & procedures to restore IT services & systems after significant disruptions, like natural disasters or outages.

Signup and view all the flashcards

Warm Standby

A disaster recovery strategy maintaining a scaled-down, fully functional replica of the production environment in a different location.

Signup and view all the flashcards

Service Continuity

Planning for preventing, predicting, and managing incidents to maintain service availability and performance during and after a disaster.

Signup and view all the flashcards

Disaster (IT)

A significant disruption in service beyond the system's ability to handle using its own resources.

Signup and view all the flashcards

CrowdStrike/Microsoft Outage (example)

A major IT incident causing widespread system crashes, estimated to have impacted billions of dollars, and affecting many Fortune 500 and 1000 companies.

Signup and view all the flashcards

Skype Outage (example)

A significant disruption in the Skype network causing a 24-hour outage, impacting millions of users in December 2010.

Signup and view all the flashcards

Prometheus architecture

A distributed system for monitoring other (potentially) distributed systems.

Signup and view all the flashcards

USE metric

A metric in logging and troubleshooting that stands for Utilization, Saturation, Errors.

Signup and view all the flashcards

Cost and Complexity

A factor to consider when evaluating systems and applications.

Signup and view all the flashcards

Disaster Recovery Center (DR)

A backup facility to maintain business continuity in case of a major disruption.

Signup and view all the flashcards

Impact of COVID-19

The large-scale increase in traffic affected some streaming companies to reduce video quality.

Signup and view all the flashcards

Study Notes

Service Continuity and Disaster Recovery Architectural Strategies

Service continuity (or service, or continuity) is planning for incident prevention, prediction, and management.
The goal is maintaining service availability and performance at high levels during and after a disaster-level incident.
- During = service continuity
- After = disaster recovery
A disaster is any serious disruption to a service that exceeds its capacity to cope using its own resources.
A real-world example of a disaster is the CrowdStrike/Microsoft Outage in July 2024.
- ~8.5 million systems crashed and failed to restart.
- Considered the largest outage in IT history.
- Estimated financial impact of at least $10 billion USD worldwide.
- ~60% Fortune 500 and ~50% Fortune 1000 companies were affected.
Another example is the Skype network outage of December 2010.
- The Skype network experienced a critical failure.
- The outage lasted approximately 24 hours.
- More than 20 million users were kicked off Skype's network.
- Signed-in accounts decreased from 23.3 million to fewer than 1.6 million in a few hours.
- Main causes
  - Overload of cluster responsible for offline messaging
  - The cluster sending delayed responses
  - Windows clients (version 5.0.0152) could not process delayed messages and crashed.
  - These clients included ~30% of the available super-nodes.
  - Remaining super-nodes received traffic 100 times greater than normal.
  - Super-nodes had built-in mechanisms to avoid impacting host systems, which caused super-nodes to shut down.

Introduction to HA/DR

HA/DR is a combination of activities:
- RPO/RTO Analysis and Planning
- Load management - Scale in/out
- Observability and Reporting
- Planned testing and disruption
Note: HA/DR is an iterative process.

RTO & RPO

RTO: Recovery Time Objective. How long can you wait for your system to come back up if it fails?
RPO: Recovery Point Objective. To what point in time will you accept lost data?
Generally, the lower the RTO & RPO, the higher the cost to the business.

Best Practice: 101

Avoid single points of failure. Implement redundancy wherever possible to prevent single failures from bringing down an entire system.
Assume everything fails and design back from there. A similar principle is zero-trust architecture—assume everything is insecure.

Single Point of Failure

This occurs when a single component failure can take down the entire system; the database is a common example of this.

A Review of DR Strategies

There is no one-size-fits-all DR strategy. Solutions need to be tailored based on specific requirements.
Increasing RPO/RTO lowers cost.
Active/Passive vs. Active/Active is a fundamental component of DR strategies with high cost-complexity for multi-site active-active DR.

Active/Passive v Active/Active

Different Recovery Strategies exist, differentiated by their RPO and RTO (Recovery Point Objective and Recovery Time Objective)

Backup & Restore

A low-cost & relatively simple approach to DR.
Lower priority use cases, less crucial functionality.
Prioritizes backup and restore procedures.

DR: Pilot Light

Low cost, medium complexity.
A less comprehensive approach.
Involves a persistent copy of a component, scaled down but fully functional for DR triggers.
Warm standby extends pilot light by copying the whole production environment into a different region.

Warm Standby

Advantages include cost savings, production traffic handling, preparing for full DR.
Critical function failure over, adjusting DNS records to point to AWS, scaling up the system.

DR: Multi-Site Active-Active

Complex and expensive but provides high availability, disaster recovery.
At any given time, this solution can handle full production loads.
It's similar to a low-capacity standby environment, fully scaling in and out with production loads.
Ideal for disaster recovery.
The solution facilitates immediate failover, global replication.

Summary

A summary table showing the different recovery strategies, RPO/RTO, cost, and complexity.

Observability

Investigations using USE methodology (Utilization, Saturation, and Errors).
Resources are physical components such as CPU, disk, networking, and RAM, as well as software resources like threads, PIDs, or inode IDs.
Utilization is measured on a scale of 0-1—the portion of time a resource is used, on average.
Saturation is the amount of work that a resource cannot handle immediately.

USE vs RED

RED method focuses on request rate, error rate, and duration.
USE method focuses on utilization, saturation, and errors.

USE Methodology

Detailed step-by-step investigation process using the USE method.

Observability: Data Collection

Using Prometheus, a free software solution for event monitoring & alerting. Useful for distributed systems.

Prometheus Architecture

Prometheus architecture details, including components like the Prometheus server, client libraries, exporters, and storage (Grafana).

Prometheus Agent

Setting up a Prometheus agent using code.

Sample Output

A sample using Prometheus outputs.

Language Bindings

Utilizing libraries and code for Prometheus to support languages other than Go or Java.

Grafana Dashboard

Example of Grafana dashboard for visualizing data.

Real-World Example: YouTube, et-al

Video streaming companies lowered video quality during the COVID-19 pandemic to manage internet traffic.

Real-World Example: Infosys

Infosys invested in a DR center in Mauritius, ready to handle 2,000+ employees quickly. Maintains operational continuity when disaster strikes in India because a large percentage of their employees live and work there.

Additional Considerations

These notes cover the main points of a presentation; they may be incomplete.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Untitled Quiz

Choose a study mode

Podcast

Questions and Answers

What is one of the main reasons for the Skype outage mentioned?

What characteristic of the system recovery process is highlighted?

Which architecture emphasizes the assumption that everything is insecure?

In disaster recovery terminology, what is the purpose of 'Pilot Light'?

What does the term 'warm standby' refer to in the context of disaster recovery?

What does 'USE' stand for in the context of monitoring systems?

What was one of the actions taken by major video streaming companies during the COVID-19 pandemic?

What is the purpose of implementing a disaster recovery center as done by Infosys?

Which software is mentioned for event monitoring and alerting?

What is suggested by the architecture of Prometheus?

What is the primary focus of service continuity management?

What is defined as a serious disruption to the functioning of a service?

What was the estimated financial impact of the CrowdStrike outage in July 2024?

What was a significant cause of the Skype outage in December 2010?

Which version of Windows clients was unable to process delayed messages during the Skype outage?

How long did the Skype outage last on 22nd December 2010?

What percentage of Fortune 1000 companies was affected by the CrowdStrike outage?

What mechanism do super-nodes use to manage high traffic during overload situations?

Flashcards

Skype Outage Cause

High Availability

Zero-Trust Architecture

Disaster Recovery

Warm Standby

Service Continuity

Disaster (IT)

CrowdStrike/Microsoft Outage (example)

Skype Outage (example)

Prometheus architecture

USE metric

Cost and Complexity

Disaster Recovery Center (DR)

Impact of COVID-19

Study Notes

Service Continuity and Disaster Recovery Architectural Strategies

Introduction to HA/DR

RTO & RPO

Best Practice: 101

Single Point of Failure

A Review of DR Strategies

Active/Passive v Active/Active

Backup & Restore

DR: Pilot Light

Warm Standby

DR: Multi-Site Active-Active

Summary

Observability

USE vs RED

USE Methodology

Observability: Data Collection

Prometheus Architecture

Prometheus Agent

Sample Output

Language Bindings

Grafana Dashboard

Real-World Example: YouTube, et-al

Real-World Example: Infosys

Additional Considerations

Studying That Suits You

Related Documents

More Like This

Untitled Quiz

Untitled Quiz

Untitled Quiz

Untitled Quiz