Podcast
Questions and Answers
What is the purpose of dependency scanning in software development?
What is the purpose of dependency scanning in software development?
Which tool is NOT mentioned as a dependency scanning tool?
Which tool is NOT mentioned as a dependency scanning tool?
What is the function of a vulnerability database in the delivery pipeline?
What is the function of a vulnerability database in the delivery pipeline?
What does Continuous Delivery enable in software development?
What does Continuous Delivery enable in software development?
Signup and view all the answers
Which of the following tools is used for release orchestration?
Which of the following tools is used for release orchestration?
Signup and view all the answers
What is the main goal of fuzzing in automated testing?
What is the main goal of fuzzing in automated testing?
Signup and view all the answers
Review apps are designed to facilitate which of the following?
Review apps are designed to facilitate which of the following?
Signup and view all the answers
Which type of scanning ensures that a container image does not contain known vulnerabilities?
Which type of scanning ensures that a container image does not contain known vulnerabilities?
Signup and view all the answers
What is the primary goal of Site Reliability Engineering (SRE)?
What is the primary goal of Site Reliability Engineering (SRE)?
Signup and view all the answers
How do Site Reliability Engineers (SREs) allocate their working time?
How do Site Reliability Engineers (SREs) allocate their working time?
Signup and view all the answers
Which statement about SRE is true?
Which statement about SRE is true?
Signup and view all the answers
What was a primary reason for the creation of Site Reliability Engineering at Google?
What was a primary reason for the creation of Site Reliability Engineering at Google?
Signup and view all the answers
Which of the following is NOT one of the key pillars of success for DevOps as defined at Google?
Which of the following is NOT one of the key pillars of success for DevOps as defined at Google?
Signup and view all the answers
What percentage of their time do SREs spend on monitoring, alerting, and automation?
What percentage of their time do SREs spend on monitoring, alerting, and automation?
Signup and view all the answers
Which of these activities is primarily associated with the 'ops' related work of an SRE?
Which of these activities is primarily associated with the 'ops' related work of an SRE?
Signup and view all the answers
What does incremental rollout involve?
What does incremental rollout involve?
Signup and view all the answers
Who popularized the Site Reliability Engineering discipline?
Who popularized the Site Reliability Engineering discipline?
Signup and view all the answers
What is the primary purpose of canary deployments?
What is the primary purpose of canary deployments?
Signup and view all the answers
What are feature flags primarily used for?
What are feature flags primarily used for?
Signup and view all the answers
What is the focus of release governance?
What is the focus of release governance?
Signup and view all the answers
What is meant by secrets management?
What is meant by secrets management?
Signup and view all the answers
Which of the following best describes Auto DevOps?
Which of the following best describes Auto DevOps?
Signup and view all the answers
What is a common characteristic of blue/green deployments?
What is a common characteristic of blue/green deployments?
Signup and view all the answers
What is the primary purpose of tracing in application management?
What is the primary purpose of tracing in application management?
Signup and view all the answers
What is the significance of launch Darkly in feature flags?
What is the significance of launch Darkly in feature flags?
Signup and view all the answers
What does synthetic monitoring involve?
What does synthetic monitoring involve?
Signup and view all the answers
Which tool actively blocks threats in the production environment?
Which tool actively blocks threats in the production environment?
Signup and view all the answers
What is the purpose of a status page in service management?
What is the purpose of a status page in service management?
Signup and view all the answers
In incident management, which aspects are captured to ensure service level objectives are met?
In incident management, which aspects are captured to ensure service level objectives are met?
Signup and view all the answers
What does User and Entity Behavior Analytics (UEBA) primarily focus on?
What does User and Entity Behavior Analytics (UEBA) primarily focus on?
Signup and view all the answers
Which type of monitoring informs the deployment environment's health?
Which type of monitoring informs the deployment environment's health?
Signup and view all the answers
Which of the following is a feature of a Web Application Firewall (WAF)?
Which of the following is a feature of a Web Application Firewall (WAF)?
Signup and view all the answers
What is the primary focus of vulnerability management?
What is the primary focus of vulnerability management?
Signup and view all the answers
What does DLP stand for in the context of security?
What does DLP stand for in the context of security?
Signup and view all the answers
Which metric is used to measure the Mean Time to Recover for components?
Which metric is used to measure the Mean Time to Recover for components?
Signup and view all the answers
What is chaos engineering primarily used for?
What is chaos engineering primarily used for?
Signup and view all the answers
How can organizations benefit from anti-fragility?
How can organizations benefit from anti-fragility?
Signup and view all the answers
What does SLO stand for in service metrics?
What does SLO stand for in service metrics?
Signup and view all the answers
Which metric measures the Mean Time to Detect failures or incidents?
Which metric measures the Mean Time to Detect failures or incidents?
Signup and view all the answers
What is the Recovery Point Objective (RPO) associated with?
What is the Recovery Point Objective (RPO) associated with?
Signup and view all the answers
What does MTRS represent in terms of service metrics?
What does MTRS represent in terms of service metrics?
Signup and view all the answers
What purpose does simulating component failure serve in a resilient system?
What purpose does simulating component failure serve in a resilient system?
Signup and view all the answers
Which tool is commonly used for managing the lifecycle of services?
Which tool is commonly used for managing the lifecycle of services?
Signup and view all the answers
What functionality do tools like Jira and Trello provide in value stream management?
What functionality do tools like Jira and Trello provide in value stream management?
Signup and view all the answers
Which of the following is not a capability of issue tracking tools?
Which of the following is not a capability of issue tracking tools?
Signup and view all the answers
Which tools are specifically noted for providing visualization in the DevOps lifecycle?
Which tools are specifically noted for providing visualization in the DevOps lifecycle?
Signup and view all the answers
In Agile Portfolio Management, what is the primary focus?
In Agile Portfolio Management, what is the primary focus?
Signup and view all the answers
Which of these is a tool that can be used for peer code reviews?
Which of these is a tool that can be used for peer code reviews?
Signup and view all the answers
Which aspect of management is associated with handling test case planning and defect tracking?
Which aspect of management is associated with handling test case planning and defect tracking?
Signup and view all the answers
What is a primary feature of Kanban boards in relation to issue tracking?
What is a primary feature of Kanban boards in relation to issue tracking?
Signup and view all the answers
Study Notes
Site Reliability Engineering Foundation Course
- Course goals include learning about SRE, its core vocabulary, principles, practices, and automation.
- The course also aims to explore real-life scenarios and have fun while doing so.
- Passing the SRE Foundation Exam is also a goal, requiring 40 multiple-choice questions, completed within 60 minutes, with a 65% passing score.
- The exam is accredited by the DevOps Institute.
- A digital badge is awarded upon successful completion.
Course Content
- Module 1: SRE principles and practices
- Module 2: Service Level Objectives & Error Budgets
- Module 3: Reducing Toil
- Module 4: Monitoring & Service Level Indicators
- Module 5: SRE Tools & Automation
- Module 6: Anti-fragility & learning from failure
- Module 7: Organizational Impact of SRE
- Module 8: SRE, other frameworks, the future
Module 1: SRE Principles & Practices
-
What is site reliability engineering?
-
SRE & DevOps: What is the difference?
-
SRE principles & practices
-
Site Reliability Engineering (SRE) is a discipline that incorporates software engineering aspects and applies them to infrastructure and operations problems.
-
It originated at Google around 2003 and was publicized via SRE books.
-
SRE's spend 50% of their time on operations-related tasks (e.g., issue resolution, on-call, manual interventions).
-
The other 50% of their time is dedicated to development tasks (e.g., new features, scaling, automation).
-
DevOps (at Google) defines 5 key pillars of success:
- Reduce organizational silos.
- Accept failure as normal.
- Implement gradual changes.
- Leverage tooling and automation.
- Measure everything.
-
SRE is a specific implementation of DevOps with some extensions.
-
DevOps is a set of practices, guidelines, and culture designed to break down silos in IT development, operations, architecture, networking, and security.
-
SRE is a set of practices found to work and some beliefs animating those practices, as well as a job role.
-
Operations is a software problem.
-
SRE utilizes software engineering approaches to solve operational problems.
-
Estimates suggest anywhere from 40% to 90% of total ownership costs are incurred after launch.
-
A Service Level Objective (SLO) is an availability target for a product or service (it's not 100%).
-
SRE services are managed to the SLO
-
SLOs need consequences if they are violated
-
Any manual, mandated operational task is considered bad.
-
If a task can be automated, it should be automated.
-
Tasks can provide wisdom from production to inform better system design and behavior.
-
SRE teams have the ability to regulate their workload.
-
Automate what is currently done manually.
-
Decide what to automate and how to automate it.
-
Take an engineering-based approach to problems rather than toiling at them repeatedly.
-
Prioritize automating, not automating bad processes.
-
Late problem (defect) discovery is expensive, so SREs look for ways to avoid it.
-
Look to improve MTTR (mean time to repair).
-
Smaller changes can address this.
-
Canary deployments are also related to this.
-
SREs share skill sets with product development teams.
-
Boundaries between application development and production (Dev & Ops) should be removed.
Module 2: SLO's & Error Budgets
-
Example SLO's and error budgets
-
SLI's for measurement
-
SLO's adoption
-
Error Budgets – Good and Bad
-
Error Budgets - Fixed?
-
Consequences of missed SLO's
-
The VALET Dimensions of SLO
-
The importance of SLO's in error budget and policies
-
The service level objective is a goal for how well a product or service functions.
-
SLOs are strongly related to the user experience.
-
Setting and measuring service-level objectives is important for SRE roles.
-
Availability is the most widely tracked SLO.
-
Products and services often have multiple SLO's.
-
SLOs aim to improve the user experience.
Module 3: Reducing Toil
-
What is toil?
-
Why toil is bad
-
Doing something about toil
-
Work is toil if it is manual, repetitive, automatable, tactical, lacks enduring value, and scales linearly.
-
Doing the same test over and over, acknowledging the same alert every morning, dealing with interrupts, physical meetings to approve production deployments, manual starts/resets of equipment and components, and creating users are also forms of toil.
-
Known workarounds, on-call responses, and manual scaling infrastructure are also forms of toil.
-
Extracting some data is also a form of toil.
-
Toil (a specific description) isn't "stuff I don't like doing."
-
Toil reduction requires engineering time.
-
Creating external automation, internal automation, or enhancing services to avoid intervention are all choices for reducing toil.
-
Google has an advertised goal of keeping operational work (toil) below 50% of an engineer's time.
-
At least 50% of each SRE's time should be spent on engineering.
-
The 50% rule ensures that one team or person doesn't handle operational tasks solely.
Module 4: Monitoring & SLI's
- SLI's - Service Level Indicators
- Monitoring
- Observability
- SLI's are ways for engineers to communicate quantitative data about systems.
- Multiple numbers can function as an SLI, generally as a ratio of good to total events.
- Service-level indicators may also need client-side data collection.
- SLI measurement needs to be time-bound in some way.
- Monitoring tools frequently used include Catchpoint, Nagios, Prometheus, Splunk, Grafana, and Collectd.
- Monitoring is the use of hardware or software components to monitor system resources and their performance.
- Telemetry is the automated communications process for receiving measurements.
- Application Performance Management (APM) monitors and manages application performance and availability.
Module 5: SRE Tools & Automation
-
Automation Defined
-
Automation Focus
-
Hierarchy of Automation Types
-
Secure Automation
-
Automation Tools
-
Manage (1): Audit Management
-
Authentication & Authorization
-
Manage (2): DevOps Score
-
Value Stream Management
-
Plan(1): Issue Tracking, Kanban Boards, Time Tracking, Agile Portfolio Management
-
Plan (2): Service Desk, Requirements Management, Quality Management,
-
Create (1): Source Code Management, Code Review, Wiki,
-
Create (2): Web IDE, Snippets
-
Verify (1): Continuous Integration, Code Quality,
-
Verify (2): Performance Testing, Usability Testing,
-
Package (1): Package Registry, Container Registry, Dependency Proxy
-
Package (2): Helm Chart Registry, Dependency Firewall
-
Secure (1): SAST, DAST, IAST, Secret Detection
-
Secure (2): Dependency Scanning, Container Scanning, License Compliance,
-
Secure (3): Vulnerability Database, Fuzzing
-
Release (1): Continuous Delivery, Release Orchestration, Pages, Review Apps, Incremental Rollout
-
Release (2): Canary Deployments, Feature Flags, Release Governance, Secrets Management
Module 6: Antifragility & Learning from Failure
-
Why learn from failure?
-
Benefits of antifragility
-
Shifting the organizational balance
-
MTTD - Mean Time to Detect (Failure/Incidents)
-
MTTR - Mean Time to Recover (Components)
-
MTRS - Mean Time to Recover (Service)
-
SLO - Service Level Objective
-
RPO - Recovery Point Objective
-
Chaos Engineering Next Steps
-
Simulating component failure allows for automation of recovery
-
More frequent backup of queue data may be needed to meet RPO
-
Chaos engineering approaches identify key interfaces & dependencies across services pinpointing areas where more resilience may be required.
-
Introducing failure to a messaging queue may indicate excess data loss outside the RPO.
-
A fire drill (where, e.g., a database is taken down) may result in an SLO being broken but caching data in the case of a database outage instead could mean the SLO is met
-
Introducing failure to a queue could indicate excessive data loss outside the RPO
-
More frequent backups of the queue data may be needed to meet the RPO.
Module 7: Organizational Impact of SRE
-
Why organizations embrace SRE
-
Patterns for SRE adoption
-
SRE Job Description
-
Sustainable Incident Response
-
Blameless post mortems
-
SRE & Scale
-
Increased Service Resilience
-
Minimize Loss of Revenue
-
Average cost of service downtime is $5,600 per minute.
-
Downtime differences per hour vary greatly.
-
Typical SRE adoption steps include consulting, embedded, platform, slice & dice, and full SRE.
-
SRE ownership of common tools and platforms ('platform SRE') may be used.
-
Shared responsibility ('embedded SRE') development teams is a common strategy.
-
Automation saves SRE time for crucial development tasks.
-
The challenge of scale is always a good one.
-
Automation techniques (such as auto-scaling, containerization, and clustering), flexible platforms (such as public/private cloud), non-structural databases (such as NoSQL and MongoDB), and 'as-a-service' capabilities are critical to platform growth,
-
SRE owners have common tools and platforms which other devs use, SRE expertise 'shifts left' for dev teams ("embedded SRE"), and toil automation improves the time available for development.
-
Toil reduction mechanisms include automated ticket responses and "self-service" features, while DRY (don't repeat yourself) solutions prevent toil-related problem repetition.
Module 8: SRE, Other Frameworks, Trends
- SRE & Other Frameworks
- SRE Evolution
- SRE teams can operate in an agile way.
- SRE can help with ITSM compliance.
- SRE is part of a "system of systems" for delivery.
- SRE Evolution.
- A Network Reliability Engineer (NRE)
- A Database Reliability Engineer (DBRE)
- A Customer Reliability Engineer (CRE)
- A Heritage Reliability Engineer (HRE)
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on modern software development practices, particularly focusing on dependency scanning, vulnerability management, and Site Reliability Engineering (SRE). This quiz covers essential tools and concepts that are vital for optimizing the software delivery pipeline.