Untitled Quiz
18 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is one of the main reasons for the Skype outage mentioned?

  • Inadequate fault-recovery testing (correct)
  • Insufficient user feedback mechanisms
  • Incompatibility with mobile devices
  • Poor video compression algorithms
  • What characteristic of the system recovery process is highlighted?

  • It is completely linear in nature
  • It can be fully automated without human oversight
  • It is a simple one-time process
  • It is iterative and ongoing (correct)
  • Which architecture emphasizes the assumption that everything is insecure?

  • Monolithic architecture
  • Distributed architecture
  • Client-server architecture
  • Zero-trust architecture (correct)
  • In disaster recovery terminology, what is the purpose of 'Pilot Light'?

    <p>To have data always on while services are off until drag is triggered</p> Signup and view all the answers

    What does the term 'warm standby' refer to in the context of disaster recovery?

    <p>A fully functional scaled-down environment ready for activation</p> Signup and view all the answers

    What does 'USE' stand for in the context of monitoring systems?

    <p>Utilization Saturation Errors</p> Signup and view all the answers

    What was one of the actions taken by major video streaming companies during the COVID-19 pandemic?

    <p>Lowered video quality to accommodate internet traffic</p> Signup and view all the answers

    What is the purpose of implementing a disaster recovery center as done by Infosys?

    <p>To ensure service continuity in case of a disaster</p> Signup and view all the answers

    Which software is mentioned for event monitoring and alerting?

    <p>Prometheus</p> Signup and view all the answers

    What is suggested by the architecture of Prometheus?

    <p>It is a distributed system designed to monitor other distributed systems</p> Signup and view all the answers

    What is the primary focus of service continuity management?

    <p>Planning for incident prediction and management</p> Signup and view all the answers

    What is defined as a serious disruption to the functioning of a service?

    <p>A disaster</p> Signup and view all the answers

    What was the estimated financial impact of the CrowdStrike outage in July 2024?

    <p>$10 billion</p> Signup and view all the answers

    What was a significant cause of the Skype outage in December 2010?

    <p>Overload of the offline messaging cluster</p> Signup and view all the answers

    Which version of Windows clients was unable to process delayed messages during the Skype outage?

    <p>Version 5.0.0152</p> Signup and view all the answers

    How long did the Skype outage last on 22nd December 2010?

    <p>24 hours</p> Signup and view all the answers

    What percentage of Fortune 1000 companies was affected by the CrowdStrike outage?

    <p>50%</p> Signup and view all the answers

    What mechanism do super-nodes use to manage high traffic during overload situations?

    <p>Built-in traffic regulation</p> Signup and view all the answers

    Study Notes

    Service Continuity and Disaster Recovery Architectural Strategies

    • Service continuity (or service, or continuity) is planning for incident prevention, prediction, and management.
    • The goal is maintaining service availability and performance at high levels during and after a disaster-level incident.
      • During = service continuity
      • After = disaster recovery
    • A disaster is any serious disruption to a service that exceeds its capacity to cope using its own resources.
    • A real-world example of a disaster is the CrowdStrike/Microsoft Outage in July 2024.
      • ~8.5 million systems crashed and failed to restart.
      • Considered the largest outage in IT history.
      • Estimated financial impact of at least $10 billion USD worldwide.
      • ~60% Fortune 500 and ~50% Fortune 1000 companies were affected.
    • Another example is the Skype network outage of December 2010.
      • The Skype network experienced a critical failure.
      • The outage lasted approximately 24 hours.
      • More than 20 million users were kicked off Skype's network.
      • Signed-in accounts decreased from 23.3 million to fewer than 1.6 million in a few hours.
      • Main causes
        • Overload of cluster responsible for offline messaging
        • The cluster sending delayed responses
        • Windows clients (version 5.0.0152) could not process delayed messages and crashed.
        • These clients included ~30% of the available super-nodes.
        • Remaining super-nodes received traffic 100 times greater than normal.
        • Super-nodes had built-in mechanisms to avoid impacting host systems, which caused super-nodes to shut down.

    Introduction to HA/DR

    • HA/DR is a combination of activities:
      • RPO/RTO Analysis and Planning
      • Load management - Scale in/out
      • Observability and Reporting
      • Planned testing and disruption
    • Note: HA/DR is an iterative process.

    RTO & RPO

    • RTO: Recovery Time Objective. How long can you wait for your system to come back up if it fails?
    • RPO: Recovery Point Objective. To what point in time will you accept lost data?
    • Generally, the lower the RTO & RPO, the higher the cost to the business.

    Best Practice: 101

    • Avoid single points of failure. Implement redundancy wherever possible to prevent single failures from bringing down an entire system.
    • Assume everything fails and design back from there. A similar principle is zero-trust architecture—assume everything is insecure.

    Single Point of Failure

    • This occurs when a single component failure can take down the entire system; the database is a common example of this.

    A Review of DR Strategies

    • There is no one-size-fits-all DR strategy. Solutions need to be tailored based on specific requirements.
    • Increasing RPO/RTO lowers cost.
    • Active/Passive vs. Active/Active is a fundamental component of DR strategies with high cost-complexity for multi-site active-active DR.

    Active/Passive v Active/Active

    • Different Recovery Strategies exist, differentiated by their RPO and RTO (Recovery Point Objective and Recovery Time Objective)

    Backup & Restore

    • A low-cost & relatively simple approach to DR.
    • Lower priority use cases, less crucial functionality.
    • Prioritizes backup and restore procedures.

    DR: Pilot Light

    • Low cost, medium complexity.
    • A less comprehensive approach.
    • Involves a persistent copy of a component, scaled down but fully functional for DR triggers.
    • Warm standby extends pilot light by copying the whole production environment into a different region.

    Warm Standby

    • Advantages include cost savings, production traffic handling, preparing for full DR.
    • Critical function failure over, adjusting DNS records to point to AWS, scaling up the system.

    DR: Multi-Site Active-Active

    • Complex and expensive but provides high availability, disaster recovery.
    • At any given time, this solution can handle full production loads.
    • It's similar to a low-capacity standby environment, fully scaling in and out with production loads.
    • Ideal for disaster recovery.
    • The solution facilitates immediate failover, global replication.

    Summary

    • A summary table showing the different recovery strategies, RPO/RTO, cost, and complexity.

    Observability

    • Investigations using USE methodology (Utilization, Saturation, and Errors).
    • Resources are physical components such as CPU, disk, networking, and RAM, as well as software resources like threads, PIDs, or inode IDs.
    • Utilization is measured on a scale of 0-1—the portion of time a resource is used, on average.
    • Saturation is the amount of work that a resource cannot handle immediately.

    USE vs RED

    • RED method focuses on request rate, error rate, and duration.
    • USE method focuses on utilization, saturation, and errors.

    USE Methodology

    • Detailed step-by-step investigation process using the USE method.

    Observability: Data Collection

    • Using Prometheus, a free software solution for event monitoring & alerting. Useful for distributed systems.

    Prometheus Architecture

    • Prometheus architecture details, including components like the Prometheus server, client libraries, exporters, and storage (Grafana).

    Prometheus Agent

    • Setting up a Prometheus agent using code.

    Sample Output

    • A sample using Prometheus outputs.

    Language Bindings

    • Utilizing libraries and code for Prometheus to support languages other than Go or Java.

    Grafana Dashboard

    • Example of Grafana dashboard for visualizing data.

    Real-World Example: YouTube, et-al

    • Video streaming companies lowered video quality during the COVID-19 pandemic to manage internet traffic.

    Real-World Example: Infosys

    • Infosys invested in a DR center in Mauritius, ready to handle 2,000+ employees quickly. Maintains operational continuity when disaster strikes in India because a large percentage of their employees live and work there.

    Additional Considerations

    • These notes cover the main points of a presentation; they may be incomplete.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    More Like This

    Untitled Quiz
    6 questions

    Untitled Quiz

    AdoredHealing avatar
    AdoredHealing
    Untitled Quiz
    37 questions

    Untitled Quiz

    WellReceivedSquirrel7948 avatar
    WellReceivedSquirrel7948
    Untitled Quiz
    19 questions

    Untitled Quiz

    TalentedFantasy1640 avatar
    TalentedFantasy1640
    Untitled Quiz
    18 questions

    Untitled Quiz

    RighteousIguana avatar
    RighteousIguana
    Use Quizgecko on...
    Browser
    Browser