Lecture03 Service Continuity PDF

Summary

This lecture discusses service continuity and disaster recovery, with examples of outages such as the CrowdStrike/Microsoft outage and Skype outage. It explores architectural strategies for high availability and various deployment models.

Full Transcript

What is service continuity? (or service, or continuity) Any thoughts? What is service continuity? (IT) service continuity management focuses on planning for incident prevention, prediction, and management. The goal is to maintain service availability and performance at the highest...

What is service continuity? (or service, or continuity) Any thoughts? What is service continuity? (IT) service continuity management focuses on planning for incident prevention, prediction, and management. The goal is to maintain service availability and performance at the highest possible levels when a disaster-level incident occurs: – During (i.e., service continuity), – After (i.e., disaster recovery). What is a disaster? Any thoughts? What is a disaster? Very similar to a real-world one! It is any serious disruption to the functioning of a service that exceed its capacity to cope using its own resources. Example of Disaster: CrowdStrike/Microsoft Outage In July 2024, CrowdStrike distributed a faulty update of its software for Microsoft Windows. – ~8.5 million systems crashed and could not restart. – Has been called the largest outage in IT history. Its financial impact estimated in at least $10 billion USD worldwide – ~60% Fortune 500 and ~50% Fortune 1000 companies affected! https://www.ctol.digital/news/global-it-chaos-crowdstrike- development-cpp-complexity-enterprise-it-failures/ Example of Disaster: Skype Outage A peer-to-peer VoIP client developed by KaZaa in 2003 Purchased by Microsoft in May 2011 22nd December 2010: the Skype network suffered a critical failure Lasted approximately 24 hours Affected more than 23 millions of online users 7 Example of Disaster: Skype Outage (Tuesday) (Tuesday) Skype Outage: Network Topology Skype Outage: Causes 1. Overload of cluster responsible for offline messaging 2. The cluster sends delayed responses 10 Skype Outage: Causes 3. Windows clients (version 5.0.0152) can not process delayed messages and crash 4. These clients included ~30% of the publicly available super-nodes 11 Skype Outage: Causes 5. Remaining super-nodes receive a traffic 100 times bigger than the normal 6. Super-nodes have a built-in mechanism to avoid having a huge impact on the host system. This makes more super-nodes to shut down 12 Skype Outage: Causes >90% of affected users (Tuesday) 13 Skype Outage: Causes Recap The whole system was unable to properly handle such failures – Enough fault-recovery testing? The mechanism used to avoid the impact of load on host systems caused more failures – Enough load testing on the super-nodes? 14 (Recovery Point Objective, Recovery Time Objective) IMPORTANT NOTE: Although it looks linear in nature, it is an iterative (and never-ending) process!!!! https://www.slideshare.net/AmazonWebServices/high-availability-websites-part-one/12 CTO @ Amazon https://aws.amazon.com/message/41926/ Btw, have you heard about zero-trust architecture? It follows a similar principle (assume everything is insecure!) Too many ways to prevent it. A fair summary is available here Also, make sure you know what an anti-pattern in CS/SE is! https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part- i-strategies-for-recovery-in-the-cloud/ https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery- workloads-on-aws/disaster-recovery-options-in-the-cloud.html Btw, you do remember my advise about using OneDrive to keep your module work safe? (such strategy fall in this category) In Pilot light, data is always on, but services not (until DR is triggered!). Warm standby extends pilot light by having a scaled down (but fully functional!) copy of the whole production environment in another region. xx Cost & Complexity USE — Utilization, Saturation, Errors EG = e-graph/Equality saturation USE vs RED https://www.observeinc.com/resources/microservices-logging-and-troubleshooting-with-observability/ USE — Utilization, Saturation, Errors https://prometheus.io/ (free software for event monitoring and alerting) Prometheus Architecture A distributed system to monitor other (potentially!) distributed systems docker run -d \ --net="host" \ --pid="host" \ -v "/:/host:ro,rslave" \ quay.io/prometheus/node-exporter \ --path.rootfs=/host Where: - host: use the host’s PID namespace & host net inside the container. An example of a Grafana dashboard available at: https://grafana.com/grafana/dashboards/11074 Real World Example: YouTube, et-al During the COVID19 pandemic, the main video streaming companies lowered their video quality. o To support heavy internet traffic amid quarantine. o To avoid breaking the internet (as 80%+ of the total internet traffic is video nowadays!) o Part capacity planning, part governments’ requests. Real World Example: Infosys Infosys is the second-largest Indian IT company. Main customer base in USA and Europe. Has invested ~US$25 millions to set up a DR centre on Mauritius. o Ready to operate at full capacity (~2,000 people) in days. o To support its customers in case of a disaster in India (as ~80% of its employees work there). o To prevent service continuity disruption! Afaik, it has not been used yet. That is all, folks!

Use Quizgecko on...
Browser
Browser