Resilience and System Reliability Quiz

Podcast

Play an AI-generated podcast conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is a common way to achieve resilience in a system under high load?

Processing requests more cheaply (correct)
Increasing the average cost of requests
Shutting down the system temporarily
Adding more components to the system

How can system design address component failures effectively?

Removing all redundancy in the system
Increasing the impact of failures by rerouting requests
Processing more requests to overload the system
Incorporating redundancy and distinct failure domains (correct)

What concept can help limit the impact of system failures by rerouting requests?

Defense in depth
Increasing the processing cost for each request
Adding more load to the system
Distinct failure domains (correct)

In terms of resilience, what is the function of 'defense in depth'?

Applying multiple defense mechanisms (C) Signup and view all the answers

How can a sufficiently complex system be made more immune to compromise?

By using defense in depth and distinct failure domains (A) Signup and view all the answers

What can happen when complexity accumulates inadvertently in a system?

A system's reliability and security can be compromised (C) Signup and view all the answers

How did the bug in the Debian GNU/Linux version of the OpenSSL library affect the pseudo-random number generator?

It only seeded the generator with a process ID (B) Signup and view all the answers

What led to YouTube's global downtime in October 2018?

A small change in a generic logging library (A) Signup and view all the answers

In the context of software development, what lesson can be learned from the OpenSSL bug?

Even small changes should be rigorously tested for potential consequences (A) Signup and view all the answers

What critical aspect should designers and developers consider to prevent major failures in a system?

Understanding the impact of even minor changes (B) Signup and view all the answers

What caused the cascade of failures in the internal service at Google on September 27, 2012?

The sudden change in the WiFi password for the guest system (A) Signup and view all the answers

Why did the primary replica of the password manager become unresponsive during the incident?

It was not designed to handle traffic from a large audience (A) Signup and view all the answers

What caused the secondary replica of the password manager to fail similarly to the primary replica during the incident?

Overload from diverted traffic due to primary replica failure (B) Signup and view all the answers

Why was the on-call engineer unable to restart the password manager during the incident?

The engineer did not have access to the necessary tools (B) Signup and view all the answers

What critical element was missing in the design of Google's password manager that led to its failure under high load?

Scalability considerations for a larger audience (D) Signup and view all the answers

What was the initial reason that the engineer in New York City could not retrieve a smart card?

The combination to the safe was stored in a password manager. (D) Signup and view all the answers

Why did the engineer in Australia face difficulty opening the safe?

The combination to the safe was stored in an offline password manager. (B) Signup and view all the answers

What error message did the engineers encounter even after successfully inserting the newly retrieved cards?

The password could not load any of the cards protecting this key. (C) Signup and view all the answers

Why did the engineers in Australia resort to a brute-force approach to open the safe?

They lost the combination to the safe. (C) Signup and view all the answers

What delayed the engineers from realizing that the smart card was not inserted correctly?

The green light on the reader indicated successful insertion. (C) Signup and view all the answers

Which of the following is one of the best ways to improve the assessment of a system's reliability and security?

Minimizing the system's overall design complexity (D) Signup and view all the answers

What is the primary benefit of a simpler system design, as discussed in the text?

It reduces the potential for unanticipated system interactions and makes the system easier for humans to comprehend and reason about. (A) Signup and view all the answers

What is the value of understandability, especially during emergencies?

It can help responders mitigate symptoms quickly and reduce mean time to repair (MTTR). (B) Signup and view all the answers

What is the primary reason why systems rarely remain unchanged over time, according to the text?

All of the above reasons contribute to the fact that systems rarely remain unchanged over time. (D) Signup and view all the answers

Which chapter in the SRE book provides more information on error budgets?

Chapter 3 (A) Signup and view all the answers

Which of the following is NOT a primary reliability risk?

An adversary actively trying to exploit vulnerabilities (C) Signup and view all the answers

When designing for reliability, what assumption should be made?

Things will go wrong at some point (B) Signup and view all the answers

Which of the following is a design consideration for security, but NOT for reliability?

Assuming an adversary could try to make things go wrong at any point (B) Signup and view all the answers

In the absence of an adversary, how are systems often designed to respond to failures?

Fail safe (or open) (A) Signup and view all the answers

What can happen when fail safe/open behavior is implemented without considering security?

Security vulnerabilities can arise (D) Signup and view all the answers

What is the primary goal of a good system design in terms of security?

To limit an adversary's ability to exploit a compromised host or stolen credentials (D) Signup and view all the answers

How can distinct failure domains be implemented to enhance security?

By compartmentalizing permissions and restricting the scope of credentials (A) Signup and view all the answers

What is an example of a mechanism mentioned in the text for implementing distinct failure domains?

Google's internal infrastructure supporting credentials explicitly scoped to a geographic region (A) Signup and view all the answers

Why is it often recommended to encrypt data at the application layer, even if device-level encryption is implemented?

To protect against flawed implementations of encryption algorithms in drive controllers (C) Signup and view all the answers

Which principle can help mitigate threats from malicious insiders?

The principle of least privilege (B) Signup and view all the answers

How do the threats from external attackers and malicious insiders often compare in practice?

The threats from external attackers and malicious insiders often don't differ much in practice (A) Signup and view all the answers

What is the purpose of implementing defense in depth strategies, such as multiple layers of encryption?

To provide multiple layers of protection against different types of threats (C) Signup and view all the answers

Which of the following is NOT a recommended practice for enhancing system security, according to the text?

Restricting access to the system only to trusted insiders (B) Signup and view all the answers

What is the primary goal of implementing the principle of least privilege in system design?

To mitigate threats from both external attackers and malicious insiders (A) Signup and view all the answers

Which of the following statements is NOT true, based on the information provided in the text?

Encrypting data at the application layer is unnecessary if device-level encryption is implemented (D) Signup and view all the answers

Flashcards are hidden until you start studying

Study Notes

System Failure and Resilience

A global service outage occurred due to a memory utilization problem, which highlighted the importance of designing systems to be resilient under adverse circumstances.
Systems can be made resilient by shedding some of the incoming load or reducing the processing cost for each request.
Redundancy and distinct failure domains can help limit the impact of failures by rerouting requests.

Complexity and Reliability

As systems become more complex, it becomes difficult to demonstrate their reliability and security.
Defense in depth and distinct failure domains can help address this problem by limiting the "blast radius" of a failure.
Complexity can lead to tipping-point situations where a small change has major consequences for a system's reliability or security.

Case Study: OpenSSL Bug

A bug in the OpenSSL library introduced in 2006 and discovered in 2008 caused cryptographic keys to be broken by brute force.
The bug was caused by removing two lines of code to eliminate warnings about memory used prior to initialization.
This led to OpenSSL's pseudo-random number generator being seeded with a process ID, making it vulnerable to brute force attacks.

Case Study: YouTube Outage

A small change in a generic logging library caused YouTube to go down globally for more than an hour in 2018.
The change was intended to improve the granularity of event logging but was not fully tested at YouTube scale.
The change caused YouTube servers to run out of memory and crash under production load.

Case Study: Password Manager Failure

A Google-wide announcement about a change in the WiFi password caused a spike in traffic to the password manager, leading to a cascading failure.
The load balancer diverted traffic to the secondary replica, which failed, leading to a system outage.
The on-call engineer had to use a power drill to open a safe to retrieve a smart card to restart the service.

Simplicity and Trustworthiness

Simplicity in system design is crucial for building reliable and secure systems.
A simpler design reduces the attack surface, decreases the potential for unanticipated system interactions, and makes it easier for humans to comprehend and reason about the system.

Evolution and Complexity

Systems rarely remain unchanged over time, and new features, changes in scale, and evolution of infrastructure can introduce complexity.
Pressure to meet market demands can lead to technical debt and increased complexity.
It is essential to keep up with evolving attacks and new adversaries to maintain system security.

Reliability vs. Security

Reliability and security require different design considerations and have different risks.
Reliability risks are non-malicious, while security risks come from adversaries trying to exploit system vulnerabilities.
Systems must be designed to respond to failures and security threats differently, with security considerations taking into account an adversary's ability to exploit a compromised system.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.