SRE Foundation v1.0 Course Slides PDF
Document Details
Uploaded by HumaneOliveTree
2020
Tags
Summary
This document provides an overview of Site Reliability Engineering (SRE). It details the course goals and content, principles of SRE, and what DevOps is.
Full Transcript
SITE RELIABILITY ENGINEERING FOUNDATION ℠ © DevOps Institute unless otherwise stated 30JUL2020 Site Reliability Engineer Foundation Course Goals Learn about SRE Understand its core vocabulary, principles, practices and Pass the...
SITE RELIABILITY ENGINEERING FOUNDATION ℠ © DevOps Institute unless otherwise stated 30JUL2020 Site Reliability Engineer Foundation Course Goals Learn about SRE Understand its core vocabulary, principles, practices and Pass the SRE Foundation Exam: automation 40 multiple choice questions Hear and share real life 60 minutes scenarios 65% is passing Have fun! Accredited by DevOps Institute Get your digital badge © DevOps Institute unless otherwise stated Site Reliability Engineering Foundation Course Content Module 1 SRE Principles & Practices Module 5 SRE Tools & Automation Module 2 Service Level Objectives & Error Module 6 Anti-Fragility & Learning from Budgets Failure Module 3 Reducing Toil Module 7 Organizational Impact of SRE Module 4 Monitoring & Service Level Module 8 SRE, Other Frameworks, The Indicators Future 3 © DevOps Institute unless otherwise stated Module 1 SRE PRINCIPLES & PRACTICES What is Site Reliability Engineering? SRE & DevOps: What is the Difference? SRE Principles & Practices © DevOps Institute unless otherwise stated What is Site Reliability Engineering? Site Reliability Engineering “What happens when (SRE) is a discipline that a software engineer is incorporates aspects of tasked with what software engineering and used to be called applies them to infrastructure and operations.” operations problems Ben Treynor, Google Created at Google around 2003 and publicized via SRE books Module 1: SRE Principles & Practices 5 © DevOps Institute unless otherwise stated What is Site Reliability Engineering? The goal is to create ultra-scalable and highly reliable distributed software systems SRE's spend 50% of their time doing "ops" related work such as issue resolution, on-call, and manual interventions SRE's spend 50% of their time on development tasks such as new features, scaling or automation Monitoring, alerting and automation are a large part of SRE Module 1: SRE Principles & Practices 6 © DevOps Institute unless otherwise stated SRE & DevOps – What is the Difference? DevOps (at Google) defines 5 key pillars of success: 1. Reduce organizational silos 2. Accept failure as normal 3. Implement gradual changes 4. Leverage tooling and automation 5. Measure everything SRE is a "specific implementation of DevOps with some extensions.” Google Module 1: SRE Principles & Practices 7 © DevOps Institute unless otherwise stated Module 1: SRE Principles & Practices 8 © DevOps Institute unless otherwise stated SRE Principles & Practices Module 1: SRE Principles & Practices 9 © DevOps Institute unless otherwise stated #1 Operations Is a Software Problem The basic tenet of SRE is that doing operations well is a software problem SRE should therefore use software engineering approaches to solve that problem Software engineering as a discipline focuses on designing and building rather than operating and maintaining Estimates suggest that anywhere between 40% and 90% of the total cost of ownership are incurred after launch Module 1: SRE Principles & Practices 10 © DevOps Institute unless otherwise stated #2 Service Levels A Service Level Objective (SLO) is an availability target for a product or service (this is never 100%) In SRE services are managed to the SLO SLOs need consequences if they are violated Module 1: SRE Principles & Practices 11 © DevOps Institute unless otherwise stated #3 Toil Any manual, mandated operational task is bad If a task can be automated then it should be automated Tasks can provide the "wisdom of production" that will inform better system design and behavior SREs must have time to make tomorrow better than today Module 1: SRE Principles & Practices 12 © DevOps Institute unless otherwise stated #4 Automation Automate what is currently done manually Decide what to automate, and how to automate it Take an engineering-based approach to problems rather than just toiling at them over and over This should dominate what an SRE does Don’t automate a bad process – fix the process first SRE teams have the ability to regulate their workload Module 1: SRE Principles & Practices 13 © DevOps Institute unless otherwise stated #5 Reduce the Cost of Failure Late problem (defect) discovery is expensive so SRE looks for ways to avoid this Look to improve MTTR (mean time to repair) Smaller changes help with this Canary deployments Failure is an opportunity to improve Module 1: SRE Principles & Practices 14 © DevOps Institute unless otherwise stated #6 Shared Ownership SRE's share skill sets with product development teams Boundaries between “application development” and “production” (Dev & Ops) should be removed SRE's "shift left" and provide "wisdom of production" to development teams Incentives across the organization are not currently aligned Module 1: SRE Principles & Practices 15 © DevOps Institute unless otherwise stated Module 2 SERVICE LEVEL OBJECTIVES & ERROR BUDGETS Service Level Objectives Error budgets Error budget policies © DevOps Institute unless otherwise stated What is an SLO? An SLO (“Service Level Objective”) is a goal for how well a product or service should operate SLO’s are tightly related to the user experience – if SLO’s are being met then the user will be happy Setting and measuring1 service level objectives is a key aspect of the SRE role The most widely tracked SLO is availability2 Products and services could (and should) have several SLO’s SLO’s are about making the user experience better 1. SLI’s will be covered in Monitoring 2. 2019 Catchpoint SRE Survey Module 1717 © DevOps Institute unless otherwise stated Module2:2:SLO's SLO's&&Error ErrorBudgets Budgets Example: SLO’s & Error Budgets We decide that 99.9% of web requests (www.....) per month should be successful – this is a “service level objective” If there are 1 million web requests in a particular month, then up to 1,000 of those are allowed to fail – this is an “error Error Budget budget” Failure to hit an SLO must have consequences – if more than 1,000 web requests fail in a month then some Yaroslav Molochko remediation work must take place – this is SRE Team Lead AnchorFree an “error budget policies” Module 2: SLO's & Error Budgets 18 © DevOps Institute unless otherwise stated Adoption of SLO’s According to the 2019 Catchpoint SRE Survey the most popular SLO’s are: Availability 72% Response time 47% Latency 46% We don’t have SLOs 27% Module 2: SLO's & Error Budgets 19 © DevOps Institute unless otherwise stated Error Budgets – Good and Bad Bad Good We have error budgets in SRE On the other hand SRE as going over budget usually practices encourage you to means someone somewhere strategically burn the budget will have to work over-time or to zero every month, whether respond to out-of-hours issues. it’s for feature launches or Not hitting 99.9% of HTTP architectural changes. This requests in a month usually way you know you are running means scalability issues so as fast as you can (velocity) “ops” need to do something without compromising availability Module 2: SLO's & Error Budgets 20 © DevOps Institute unless otherwise stated Error Budgets – Fixed? But watch out – high-risk deployments or large ”big-bang” changes have more likelihood of issues and therefore more chance of the error budget being blown This should encourage the Lean preference for small changes (”smaller batch size”) to stay within the error budget. In some cases the error budget may need to change to accommodate complex releases but this needs to be agreed between Dev and Ops and the Business Module 2: SLO's & Error Budgets 21 © DevOps Institute unless otherwise stated Consequence of Missed SLO’s Missed SLO’s have noticeable impacts on business performance Lost Revenue 70% Drop in Employee Productivity 57% Lost Customers 49% Social Media Backlash 36% Module 2: SLO's & Error Budgets 22 © DevOps Institute unless otherwise stated The VALET Dimensions of SLO Dimension SLO Budget Policy V Volume/traffic Does the service handle the right volumes of data or Budget: 99.99% of HTTP requests per month Address scalability issues traffic? succeed with 200 OK A Availability Is the service available to users when they need it? Budget: 99.9% availability/uptime Address downtime issues/outages, zero downtime deployments L Latency Does the service deliver in a user-acceptable period of Payload of 90% of HTTP responses returned in Address performance issues time? under 300ms E Errors Is the service delivering the capabilities being 0.01% of HTTP requests return 4xx or 5xx status Analyze and respond to main status codes, new requested? codes functionality or infrastructure may be required T Tickets Are our support services efficient? 75% of service tickets are automatically resolved Automate more manual processes Module 2: SLO's & Error Budgets 23 © DevOps Institute unless otherwise stated Module 3 REDUCING TOIL What is toil? Why toil is bad Doing something about toil © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Manual or semi-manual Repetitive releases Automatable Connecting to Tactical infrastructure to check No enduring value something Linear scaling Constant password resets Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 25 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Doing the same test over- Automatable and-over Acknowledging the same Tactical alert every morning No enduring value Dealing with interrupts Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 26 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Physical meetings to Repetitive approve production Automatable deployments Tactical Manual starts/resets of No enduring value equipment and components Linear scaling Creating users Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 27 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Known workarounds Tactical On call responses No enduring value Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 28 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Extracting some data Tactical No enduring value Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 29 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Manual scaling Tactical infrastructure No enduring value Linear scaling Toil is not “stuff I do not like doing”. Module 3: Reducing Toil 30 © DevOps Institute unless otherwise stated Why Toil is Bad Impact of High Toil Individual Organization Slow Progress Manual work and firefighting (toil) takes up the New features do not get released quickly, majority of time missed value opportunity. Shortage of team capacity Poor Quality Manual work often results in mistakes, time Excessive costs in support of services consuming to fix, impact on reputation Career Stagnation Career progress slows down due to working on the Reputational damage, not a great place to same things, no time for skills development. Best work. Staff attrition rates increase. engineers working on low-level requests Attritional Toil is demotivating meaning people start looking Staff turnover results in extra costs and lost elsewhere knowledge Unending Never ending deluge of manual tasks, no time to Toil requires engineering effort to fix, if there is find solutions, more time spent managing backlog no engineering time available it won’t be of tasks that fixing them fixed. SLA’s being breached Burnout Personal and heath problems due to overload Potential for litigation and negative publicity and disruptive work patterns Module 3: Reducing Toil 31 © DevOps Institute unless otherwise stated What Can be Done About Toil Reducing toil requires engineering time Engineering work needed to reduce toil will typically be a choice of: Creating external automation (i.e. scripts and automation tools outside of the service) Creating internal automation (i.e. automation delivered as part of the service), or Enhancing the service to not require intervention © DevOps Institute unless otherwise stated Module 3: Reducing Toil 32 Making Engineering Time Available Google has an advertised goal of keeping operational work (i.e. toil) below 50% of an engineer's time At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features The 50% rule ensures that one teams or person does not become the “ops” (team/person) 50% is an average to reflect real world scenarios Module 3: Reducing Toil 33 © DevOps Institute unless otherwise stated Module 4 MONITORING & SERVICE LEVEL INDICATORS SLI’s - Service Level Indicators Monitoring Observability © DevOps Institute unless otherwise stated SLI's for Measurement “SLI's are ways for engineers to communicate quantitative data about systems.” Ram Lyengar, Plumbr.io Module 4: Monitoring & SLI’s 35 © DevOps Institute unless otherwise stated Let’s Revisit an Earlier Example We decided that 99.9% of web requests (www.....) per month should be successful – this was the “service level objective” If there were 1 million web requests in a particular month, then up to 1,000 of those were allowed to fail – this was the “error budget” In this example the “service level indicator” (SLI) is “web requests” so we need a way to track and record this data Module 4: Monitoring & SLI’s 36 © DevOps Institute unless otherwise stated SLI Measurement While many numbers can function as an SLI, it is generally recommended to treat the SLI as the ratio of two numbers: the number of good events divided by the total number of events. For our example this is: Number of successful (HTTP) web requests / total (HTTP) requests (success rate) Many indicator metrics are naturally gathered on the server side, using a monitoring system such as Prometheus, or with periodic log analysis—for instance, HTTP 500 responses as a fraction of all requests. Some service level indicators may also need client-side data collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics. Module 4: Monitoring & SLI’s 37 © DevOps Institute unless otherwise stated SLI Measurement SLI measurement needs also to be time-bound in some way The time horizon may vary depending on the organization and the SLO For web requests per month, the time horizon is clear SLO’s such as “successful bank payments” may require a broader horizon if bank payments are only made once or twice per month Module 4: Monitoring & SLI’s 38 © DevOps Institute unless otherwise stated SLO’s & SLI’s We use monitoring tools to measure SLI’s constantly, aggregating across suitable time periods Our SLO’s are what we expect - monitoring our SLI’s will tell us if we are meeting a SLO or not – they also tell us how much of our error budget is left (if any) Module 4: Monitoring & SLI’s 39 © DevOps Institute unless otherwise stated Monitoring Module 4: Monitoring & SLI’s 40 © DevOps Institute unless otherwise stated Monitoring Definitions Monitoring is the use of a hardware or software Application Performance component to monitor the system Management (APM) resources and performance of a is the monitoring and management of computer system performance and availability of software applications. APM strives to Telemetry detect and diagnose application is the highly automated communications performance problems to maintain an process by which measurements are made expected level of service. and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. Module 4: Monitoring & SLI’s 41 © DevOps Institute unless otherwise stated Monitoring Anatomy What we need Installed on the hosts to be monitored Passes information AGENTS to the Core Holds configuration about hosts/services Dashboards and Distributed across number UI displays of SLO’s and of masters associated SLI’s Check execution (poke) CORE Result queue (poke response) Delivery of appropriate ALERT information to people in a position to respond GRAPHING ANOMALY Aggregation across a DETECTION time horizon graphing at an appropriate scale Check/thresholds against metrics collected Module 4: Monitoring & SLI’s 42 © DevOps Institute unless otherwise stated SLI Supporting Tools Catchpoint Nagios Prometheus Splunk Grafana Collectd Monitoring Graphing Logstash Rsyslogd Collectd Pager Logging Alerting duty Module 4: Monitoring & SLI’s 43 © DevOps Institute unless otherwise stated Observability © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 44 Monitoring & Observability Distributed, complex services running at scale with unpredictable users and variable throughput means there are millions of different ways that things can go wrong But we can’t anticipate them all (monitoring myth) Externalizing all the outputs of a service allows us to infer the internal state of that service (observable) © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 45 Monitoring & Observability “Monitoring is a verb; something we perform against our applications and systems to determine their state. From basic fitness tests and whether they’re up or down, to more proactive performance health checks. We monitor applications to detect problems and anomalies. “ Peter Waterhouse, CA Module 4: Monitoring & SLI’s 46 © DevOps Institute unless otherwise stated Monitoring & Observability “Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” Peter Waterhouse, CA Module 4: Monitoring & SLI’s 47 © DevOps Institute unless otherwise stated Why Observability is Important Bolting on monitoring tools after the event does not scale: Rapid rate of service growth Dynamic architectures Container workloads Dependencies between services Customer experience matters more © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 48 Observability = Better Alerting We need to improve our “signal” to “noise” ratio so we focus alerts on key issues 1. Generate One Alert for One Service (Versus One Metric) 2. Use Analytics to Learn Normal Behavior 3. Improve Alerting with Multi- Criteria Alerting Policies We need to infer what is “normal” about a service Module 4: Monitoring & SLI’s 49 © DevOps Institute unless otherwise stated What Observability looks like Technical Elements Human Elements Distributed tracing Identify individual user Event logging experiences Internal performance data Fewer paging alerts - not more Application instrumentation Inquisitive / what-if questions © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 50 SLO’s, SLI’s & Observability SLO’s are from a user perspective and help identify what is important E.g. 90% of users should complete the full payment transaction in less than one elapsed minute SLI’s give detail on how we are currently performing E.g. 98% of users in a month complete a payment transaction in less than one minute Observability gives use the normal state of the service 38 seconds is the “normal” time it takes users to complete a payment transaction when all monitors are healthy © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s Module 5 SRE TOOLS & AUTOMATION Automation Defined Automation Focus Hierarchy of Automation Types Secure Automation Automation Tools © DevOps Institute unless otherwise stated Automation Gives Us Automation Requires: Consistency – a machine A problem to be solved: will be more consistent Eliminating toil than a human Improving SLO’s A platform upon which to Appropriate tooling build, re-use, extend Engineering effort Faster action, faster fixes Measurable outcomes Time savings “For SRE, automation is a force multiplier, not a panacea.” – Niall Murphy, Google SRE © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 53 Automation Focus: Typical DevOps Delivery Pipeline (Dev Focused) Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Module 5: SRE Tools & Automation 54 © DevOps Institute unless otherwise stated Automation Focus: In SRE-Led Service Automation Automation effort is “Ops” led (“shifting left”), to ensure reliability engineering priorities Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Environments must be provisioned as Infrastructure- (and Configuration-) as-Code Module 5: SRE Tools & Automation 55 © DevOps Institute unless otherwise stated SRE-Led Service Automation All code can be rebuilt from a code repository e.g. GitLab, Azure DevOps, Bitbucket. Components Example Tools 1. Environments provisioned using Infrastructure as Servers, Terraform, AWS Infrastructure/Config as Code Code networks, CloudFormation, Azure 2. Automated functional and non- (IaC) storage Resource Manager functional tests in production Configuration as Software, Puppet, Chef, Ansible, 3. Versioned (& signed) artefacts to Code dependencies, Saltstack, Docker, GCP deploy system components (CaC) containers Deployment Manager 4. Instrumentation in place to make the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 56 © DevOps Institute unless otherwise stated SRE-Led Service Automation Environment progression includes prod – Dev, Test, Pre-Prod, Prod (inc. “hidden live”) Tests Example Tools 1. Environments provisioned using Extend build Automated Selenium, Cucumber, Infrastructure/Config as Code pipeline functional Jasmine, Mocha, 2. Automated functional and non- Zephyr, Mockito functional tests in production Extend test Automated JMeter, Sonatype Nexus 3. Versioned (& signed) artefacts to pipeline non- Lifecycle, SoapUI, deploy system components functional WhiteSource, 4. Instrumentation in place to make Veracode, Nagios the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Test history is recorded in the pipeline logs Module 5: SRE Tools & Automation 57 © DevOps Institute unless otherwise stated SRE-Led Service Automation All service components, libraries and dependencies (or containers) stored in a artifact repository Artifacts How Example Tools 1. Environments provisioned using Digitally versioned With Nexus, Artifactory Infrastructure/Config as Code semantic 2. Automated functional and non- versioning functional tests in production x.y.z 3. Versioned (& signed) artifacts to Digitally signed For security Nexus, Artifactory deploy system components and 4. Instrumentation in place to make auditability the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 58 © DevOps Institute unless otherwise stated SRE-Led Service Automation Alignment with SLAs, SLOs, SLIs and telemetry everywhere Consider What Example Tools 1. Environments provisioned using Service Level Are OpsGenie Infrastructure/Config as Code Indicators understood 2. Automated functional and non- and functional tests in production published 3. Versioned (& signed) artifacts to Instrumentation Provides Nagios, Dynatrace, deploy system components additional AppDynamics, 4. Instrumentation in place to make data and Prometheus the service externally viewable analytics 5. Future growth envelope outlined Log files Aggregated Splunk, LogStash 6. Clear anti-fragility strategy and ready for access Module 5: SRE Tools & Automation 59 © DevOps Institute unless otherwise stated SRE-Led Service Automation Current and future scale estimates Consider What Example Tools 1. Environments provisioned using Autoscaling In place Amazon Cloud Auto Infrastructure/Config as Code Scaling, Kubernetes Pod 2. Automated functional and non- Scaling functional tests in production Administrative Automated Custom built tooling 3. Versioned (& signed) artifacts to activities Cloud API’s deploy system components Databases Scalable Amazon Cloud RDS, 4. Instrumentation in place to make NoSQL-type databases the service externally viewable like MongoDB, 5. Future growth envelope outlined Couchbase 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 60 © DevOps Institute unless otherwise stated SRE-Led Service Automation Current and future scale estimates Consider What Example Tools 1. Environments provisioned using Disaster Recovery Tests Fire drills Infrastructure/Config as Code (DR) complete 2. Automated functional and non- Chaos engineering Practiced Chaos Monkey functional tests in production On call In place PagerDuty, VictorOps, 3. Versioned (& signed) artifacts to mechanisms Squadcast deploy system components 4. Instrumentation in place to make the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 61 © DevOps Institute unless otherwise stated Hierarchy of Automation Types Module 5: SRE Tools & Automation 62 © DevOps Institute unless otherwise stated Hierarchy of Automation Types The database notices problems, and Systems that require no intervention automatically fails over without human intervention. Internally maintained system- The database ships with its own failover specific automation script that is used if there is a problem Externally maintained The SRE adds database support to a generic automation "generic failover" script that everyone uses if there is a problem Externally maintained A SRE has a failover script in his or her system-specific home directory that is used if there is a automation problem None Database master is failed over manually if there is a problem Module 5: SRE Tools & Automation 63 © DevOps Institute unless otherwise stated Secure Automation Automation removes the chance of human error (or willful sabotage) and provides security opportunities We can secure automated steps in the pipeline – we cannot provenly secure manual steps Artifacts generated and used by the pipeline can be validated and checked for compliance DevSecOps introduced security into the build-test- deploy lifecycle SRE places extra emphasis on security of production Module 5: SRE Tools & Automation 64 © DevOps Institute unless otherwise stated Secure Build Application, infrastructure and configuration code changes run through code analysis tools that check for security issues Digitally signing build artifacts avoids possibility of “fake” code ”Having proper configuration Secure coding practices widely management does play a published and embraced huge role in compliance.” Secure code repositories with access control Coding done in the open (‘open Will Gregorian, source”) with community Director of Security Technical Operations feedback Omada Health Module 5: SRE Tools & Automation 65 © DevOps Institute unless otherwise stated Secure Test When infrastructure or configuration is changed then test environments do not mutate, they are ”immutable” and are-created, guaranteeing compliance ”The only system which is truly with code repository secure is one which is switched off and unplugged. Even then, I We deploy the artifacts wouldn’t stake my life on it.” securely built by our engineers to our test environments Eugene Howard Spafford, Test data and test scenarios Professor of Computer Science – Computer Security built to test security Purdue University Module 5: SRE Tools & Automation 66 © DevOps Institute unless otherwise stated Secure Staging Staging environments are also immutable Same artifact deployed to staging/pre-prod environment Production-like data introduced into staging ”I’m more and more convinced introduces data security that staging environments are like considerations e.g. GDPR, PCI mocks - at best a pale imitation of the genuine article and potentially Dedicated security scanning the worst form of confirmation will try and uncover security bias.” vulnerabilities Dependencies and integration to other services Cindy Sridharan may introduce vulnerabilities Author – Scaling Microservices or proxy security requirements O’Reilly Module 5: SRE Tools & Automation 67 © DevOps Institute unless otherwise stated Secure Production Production environments are also immutable Same artifact deployed to production environment ”Having proper configuration Production data requires data security compliance e.g. GDPR, PCI, management does play a huge SOX role in compliance.” Dedicated security scanning will try and uncover security vulnerabilities Will Gregorian, Regulatory compliance needs to be evidenced Director of Security Technical Failure testing can help with audit Operations compliance Omada Health Module 5: SRE Tools & Automation 68 © DevOps Institute unless otherwise stated Selecting Automation Tools Any discussion around tools quickly turns into a ”favorite tech” talk Tools are constantly changing Organizations will have bias towards certain types of tools, from “open source” to “big IT” vendors Autonomy to use the most appropriate tools to do a job is often best delegated to those with the most knowledge of the job Engineers will be more productive with the tools they are familiar You wouldn’t hire a plumber and give them a joiners tools, would you? © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Manage (1) Audit Management The use of automated tools to ensure products and services are auditable, including keeping audit logs of build, test and deploy activities, auditing configurations and users, as well as log files from production operations. Authentication & Authorization Mechanisms for ensuring appropriate access to products, services and tools. For example user and password management and two-factor authentication. Cloud providers use their own tools such as AWS IAM (Identity and Access Management) alongside cloud users tools. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Manage (2) DevOps Score A metric showing DevOps adoption across an organization and the corresponding impact on delivery velocity. Value Stream Management The ability to visualize the flow of value delivery through the DevOps lifecycle. Gitlab CI and the Jenkins extension (from Cloud Bees) DevOptics can provide this visualization. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Plan (1) Issue Tracking Tools like Jira, Trello, CA’s Agile Central and VersionOne can be used to capture incidents or backlogs of work Kanban Boards On the back of issue tracking, the same tools can represent delivery flow through Scum and Kanban workflow boards Time Tracking Similarly, issue tracking tools also allow for time to be tracked, either against individual issues or other work or project types. Agile Portfolio Management Involves evaluating in-flight projects and proposed future initiatives to shape and govern the ongoing investment in projects and discretionary work. CA’s Agile Central and VersionOne are examples. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Plan (2) Service Desk Service Now is a well used platform for managing the lifecycle of services as well as internal and external stakeholder engagement. Requirements Management Tools than handle requirements definition, traceability, hierarchies & dependency. Often also handles code requirements and test cases for requirements. Quality Management Tools that handle test case planning, test execution, defect tracking (often into backlogs), severity and priority analysis. CA’s Agile Central © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Create (1) Source Code Management Tools to securely store source code and make it available in a scalable, multi- user environment. Git and SVN are popular examples. Code Review The ability to perform peer code-reviews to check quality can be enforced through tools like Gerrit, TFS (Team Foundation Service), Crucible and Gitlab. Wiki Knowledge sharing can be enabled by using tools like Confluence which create a rich Wiki of content © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Create (2) Web IDE Tools that have a web client integrated development environment. Enables developer productivity without having to use a local development tool. Gitlab Snippets Stored and shared code snippets to allow collaboration around specific pieces of code. Also allows code snippets to be used in other code-bases. BitBucket and GitLab allow this. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Verify (1) Continuous Integration “Refers to integrating, building, and testing code within the development environment” – Martin Fowler, Chief Scientist, ThoughtWorks. Code Quality Also referred to as code analysis, Sonar and Checkmarks are examples of tools that automatically check the seven main dimensions of code quality – comments, architecture, duplication, unit test coverage, complexity, potential defects, language rules. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Verify (2) Performance Testing Performance testing is the process of determining the speed, responsiveness and stability of a computer, network, software program or device under a workload. Tools like Gatling can performance test services. Usability Testing Usability testing is a way to see how easy to use something is by testing it with real users. Tools can be used to track how a users works with a service e.g. with scroll recording, eye checking, mouse tracking. Crazy Egg, Optimizely are examples. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Package (1) Package Registry A repository for software packages, artifacts and their corresponding metadata. Can store files produced by an organization itself or for third party binaries. Artifactory and Nexus are amongst the most popular. Container Registry Secure and private registry for Container images. Typically allowing for easy upload and download of images from the build tools. Docker Hub, Artifactory, Nexus, Dependency Proxy For many organizations, it is desirable to have a local proxy for frequently used upstream images/packages. In the case of CI/CD, the proxy is responsible for receiving a request and returning the upstream image from a registry, acting as a pull-through cache. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Package (2) Helm Chart Registry Helm charts are what describe related Kubernetes resources. Artifactory and Codefresh support a registry for maintaining master records of Helm Charts. Dependency Firewall Many projects depend on packages that may come from unknown or unverified providers, introducing potential security vulnerabilities. There are tools to scan dependencies but that is after they are downloaded. These tools prevent those vulnerabilities from being downloaded to begin with. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (1) SAST Static Application Security Testing test applications from the “inside out” by looking a source code, byte code or binaries. Gitlab, CA’s Veracode, DAST Dynamic Application Security Testing tests applications from the “outside in” to detect security vulnerabilities. Gitlab, CA’s Veracode IAST Interactive Application Security Testing combines SAST and DAST approaches but involves application tests changing in “real time” based on information fed back from SAST and DAST, creating new test cases on the fly. Synopisis, Acunetix, Parasoft and Quotium are solutions evolving in this direction. Secret Detection Secret Detection aims to prevent that sensitive information, like passwords, authentication tokens, and private keys are unintentionally leaked as part of the repository content. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (2) Dependency Scanning Used to automatically find security vulnerabilities in your dependencies while you are developing and testing your applications. Synopisis, Gemnasium, Retire.js and bundler-audit are popular tools in this area. Container Scanning When building a Container image for your application, tools can run a security scan to ensure it does not have any known vulnerability in the environment where your code is shipped. Blackduck, Synopsis, Synk, Claire and klar are examples. License Compliance Tools, such as Blackduck and Synopsis, that check that licenses of your dependencies are compatible with your application, and approve or blacklist them. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (3) Vulnerability Database Is aimed at collecting, maintaining, and disseminating information about discovered computer security vulnerabilities. This is then checked as part of the delivery pipeline. Fuzzing Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a service and watching for the results. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Release (1) Continuous Delivery “Is a software development discipline where you build software in such a way that the software can be released to production at any time” – Martin Fowler, Chief Scientist, ThoughtWorks Release Orchestration Typically a deployment pipeline, used to detect any changes that will lead to problems in production. Orchestrating other tools will identify performance, security, or usability issues. Tools like Jenkins and Gitlab CI can “orchestrate” releases. Pages For creating supporting web pages automatically as part of a CI/CD pipeline. Review Apps Allow code to be committed and launched in real time – environments are spun up to allow developers to review their application. Gitlab CI has this capability. Incremental Rollout Incremental rollout means deploying many small, gradual changes to a service instead of a few large changes. Users are incrementally moved across to the new version of the service until eventually all users are moved across. Sometimes referred to by colored environments e.g. Blue/green deployment. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Release (2) Canary Deployments Similar to incremental rollout, it is where a small portion of the user base is updated to a new version first. This subset, the canaries, then serve as the proverbial “canary in the coal mine”. If something goes wrong then a release is rolled back and only a small subset of the users are impacted. Feature Flags Sometimes called feature toggles, a technique that allows system behavior to change without changing the underlying code, through the use of “flags” to decide which behavior is invoked. A programming practice primarily there are tools such as Launch Darkly which can help with flag management and invocation. Release Governance Release Governance is all about the controls and automation (security, compliance, or otherwise) that ensure your releases are managed in an auditable and trackable way, in order to meet the need of the business to understand what is changing. Gitlab CI. Secrets Management Secrets management refers to the tools and methods for managing digital authentication credentials (secrets), including passwords, keys, APIs, and tokens for use in applications, services, privileged accounts and other sensitive parts of the IT ecosystem © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Configure (1) Auto DevOps Auto DevOps brings DevOps best practices to your project by automatically configuring software development lifecycles. It automatically detects, builds, tests, deploys, and monitors applications. Gitlab and AWS Code Pipelines are strong examples. ChatOps The ability to execute common DevOpsactions directly from chat (build, deploy, test, incident management, rollback ,etc) with the output sent back to a channel. Runbooks A collection of procedures necessary for the smooth operation of a service. Previously manual in nature they are now usually automated with tools like Ansible. Serverless A code execution paradigm were no underlying infrastructure or dependencies are needed, moreover a piece of code is executed by a service provider (typically cloud) who takes over the creation of the execution environment. Lambda functions in AWS and Azure Functions are examples. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Configure (2) Auto DevOps Auto DevOps brings DevOps best practices to your project by automatically configuring software development lifecycles. It automatically detects, builds, tests, deploys, and monitors applications. Gitlab and AWS Code Pipelines are strong examples. ChatOps The ability to execute common DevOpsactions directly from chat (build, deploy, test, incident management, rollback ,etc) with the output sent back to a channel. Runbooks A collection of procedures necessary for the smooth operation of a service. Previously manual in nature they are now usually automated with tools like Ansible. Serverless A code execution paradigm were no underlying infrastructure or dependencies are needed, moreover a piece of code is executed by a service provider (typically cloud) who takes over the creation of the execution environment. Lambda functions in AWS and Azure Functions are examples. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Monitor (1) Metrics Tools that collect and display performance metrics for deployed apps, such as Prometheus. Logging The capture, aggregation and storage of all logs associated with system performance including, but not limited to, process calls, events, user data, responses, error and status codes. Logstash and Nagios are popular examples. Tracing Tracing provides insight into the performance and health of a deployed application, tracking each function or microservice which handles a given request. Cluster Monitoring Tools that let you know the health of your deployment environments running in clusters such as Kubernetes. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Monitor (2) Error Tracking Tools to easily discover and show the errors that application may be generating, along with the associated data. Incident Management Involves capturing the who, what, when of service incidents and the onward use of this data in ensuring service level objectives are being met. Synthetic Monitoring The ability to monitor service behavior by creating scripts to simulate the action or path taken by a customer/end-user and the associated outcome. Status Page Self-explanatory, service pages that easily communicate the status of services to customers and users. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Defend (1) RASP Runtime Application Self Protection (RASP) – tools that actively monitor and block threats in the production environment before they can exploit vulnerabilities. WAF Web Application Firewall – tools that examine traffic being sent to an application and can block anything that looks malicious. Threat Detection Refers to the ability to detect, report, and support the ability to respond to attacks. Intrusion detection systems and denial-of-service systems allow for for some level of threat detection and prevention. UEBA User and Entity Behavior Analytics (UEBA) is a machine learning technique to analyze normal and “abnormal” user behavior with the aim of preventing the latter. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Defend (2) Vulnerability Management Is about ensuring that assets and applications are scanned for vulnerabilities and then the subsequent processes to record, manage, and mitigate those vulnerabilities. DLP Data Loss Protection – tools that prevent files and content from being removed from within a service environment or organization. Storage Security a specialty area of security that is concerned with securing data storage systems and ecosystems and the data that resides on these systems. Container Network Security Used to prove that any app that can be run on a container cluster with any other app can be confident that there is no unintended use of the other app or any unintended network traffic between them. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Module 6 ANTI-FRAGILITY AND LEARNING FROM FAILURE Why learn from failure Benefits of anti-fragility Shifting the organizational balance © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) MTTR – Mean Time to Recover (Components) By introducing failure we MTRS – Mean Time to optimize our monitoring Recover (Service) making it more likely we will Service Level Objective detect real incidents (SLO) Recovery Point Objective (RPO) Module 6: Antifragility & Learning from Failure 92 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Simulating component MTTR – Mean Time to failure allows us to create Recover (Components) automation to try and auto- MTRS – Mean Time to recover Recover (Service) We can also build in more Service Level Objective (SLO) resilience to prevent failure Recovery Point Objective of single components (RPO) Module 6: Antifragility & Learning from Failure 93 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Chaos engineering MTTR – Mean Time to approaches identify key Recover (Components) interfaces & dependencies MTRS – Mean Time to across services pinpointing Recover (Service) areas where more resilience Service Level Objective (SLO) may be required Recovery Point Objective (RPO) Module 6: Antifragility & Learning from Failure 94 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) A fire drill where e.g. a MTTR – Mean Time to Recover (Components) database is taken down MTRS – Mean Time to may result in an SLO being Recover (Service) broken Service Level Objective Caching data in the case of (SLO) a database outage instead Recovery Point Objective could mean the SLO is met (RPO) Module 6: Antifragility & Learning from Failure 95 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Introducing failure to e.g. a MTTR – Mean Time to messaging queue may Recover (Components) indicate excessive data loss, MTRS – Mean Time to outside the RPO Recover (Service) More frequent back ups of Service Level Objective (SLO) the queue data may be Recovery Point Objective needed to meet the RPO (RPO) Module 6: Antifragility & Learning from Failure 96 © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure Shifting the Organizational Balance © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 97 Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - ”Chaos engineering” next steps “The only real mistake is the one from which we learn nothing.” - Henry Ford Module 6: Antifragility & Learning from Failure 98 © DevOps Institute unless otherwise stated The Third Way: Continual Experimentation and Learning The Third Way encourages a culture that fosters two things: 1. Continual experimentation, taking risks and learning from failure 2. Understanding that repetition and practice is the prerequisite to mastery. Allocate time for the improvement of daily work Create rituals that reward the team for taking risks Introduce faults into the system to increase resilience Plan time for safe experimentation and innovation (hackathons) Module 6: Antifragility & Learning from Failure 99 © DevOps Institute unless otherwise stated Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - ”Chaos engineering” next steps "We make mistakes, and we get back up … “ Jack Dorsey, Twitter Module 6: Antifragility & Learning from Failure 100 © DevOps Institute unless otherwise stated Pathological Bureaucratic Generative (Power-oriented) (Rule-oriented) (Performance-oriented) Information is hidden Information may be ignored Information is actively sought Messengers are ‘shot’ Messengers are isolated Messengers are trained Responsibilities are shirked Responsibility is compartmentalized Responsibilities are shared Bridging is discouraged Bridging is allowed but discouraged Bridging is rewarded Failure is covered up Organization is just and merciful Failure causes enquiry Novelty is crushed Novelty creates problems Novelty is implemented Source: Westrum, A Typology of Organizational Cultures High-trust organizations encourage good information flow, cross-functional collaboration, shared responsibilities, learning from failures and new ideas. Module 6: Antifragility & Learning from Failure 101 © DevOps Institute unless otherwise stated Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “fire drills” 4th Play – “Chaos Engineering” next steps "It’s fine to celebrate success, but it is more important to heed the lessons of failure.” Bill Gates Module 6: Antifragility & Learning from Failure 102 © DevOps Institute unless otherwise stated Introduce Fire Drills Fire drills build on the concepts of business continuity planning (BCP) and disaster recovery (DR) which have been around for decades Need to ensure a business can continue to operate during unforeseen events or failures, such as natural disasters or emergencies This is quite often an audit requirement For a lot of organisations this is an annual data centre failover test Fire drills are a good first step towards chaos engineering © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 103 Introduce Fire Drills “Fire Drills” can go beyond technology: 1. Loss of facility (datacentre) or region (cloud) 2. Loss of technology (e.g. database) 3. Loss of resources (e.g. key person) 4. Loss of critical third-party vendors © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 104 Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - Chaos engineering next steps “You shouldn’t be afraid of failure” Niklas Zennström, Skype Module 6: Antifragility & Learning from Failure 105 © DevOps Institute unless otherwise stated What is Chaos Engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions Most famous example is the Simian Army from Netflix https://www.oreilly.com/library/view/chaos-engineering/9781491988459/ © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 106 Chaos Engineering Next Steps 1. Segregate the system into key components 2. Test the system without key components being available 3. Break the system (in non-prod environments first) 4. Introduce failure of key components in prod 5. Introduce Database failure in prod 6. Introduce a total system failure in prod © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 107 Chaos Engineering Next Steps 1. Look at “holistic” logging (e.g. what keeps full service up?) 2. Identify dependencies 3. Improve by error handling and recovery (manual-to-automated) 4. Learn from “real” failures © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 108 Chaos Engineering Next Steps Chaos engineering also helps us to “minimize the blast radius” Any outage should effect as little of the ecosystem as possible Failure testing shows the current span of this radius © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 109 Chaos Engineering Next Steps What automated recovery looks like: 1. Create immutable infrastructure as code 2. Cover all system state and functional code with automated tests 3. Deploy holistic logging and monitoring systems – make services “observable” 4. Implement smart alerting, triggers and prescriptive analytics 5. Create the self healing infrastructure/applications where most appropriate 6. Test! Then Test again! Do not allow SSH / RDP in production (unless there is a real new problem) © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 110 Automation Can Help Tools like Kubernetes and AWS Auto-Scaling can: 1. Detect impaired instances of servers or containers and destroy/replace 2. Maintain infrastructure at a (code) defined level 3. Automatically scale infrastructure/applications up and down based on demand 4. Execute maintenance commands “on the fly” e.g. to re- index databases when queries are running slow 5. Integrate with monitoring services Look to leverage the capabilities of tools and platforms before building © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 111 Module 7 ORGANIZATIONAL IMPACT OF SRE Why organizations embrace SRE Patterns for SRE adoption SRE Job Description Sustainable Incident Response Blameless post mortems SRE & Scale © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Why Organizations Embrace SRE © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 113 Increased Service Resilience Courtesy of downdetector.co.uk As service usage grows then the volume of users grows Downtime is more widely published through social media channels Brand reputation can be compromised https://downdetector.co.uk/top10/ © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 114 Minimize Loss of Revenue According to Gartner…. The average cost of service downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end https://www.the20.com/blog/the-cost-of-it-downtime/ © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 115 Module 7: Organizational impact of SRE Patterns for SRE Adoption © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 116 Typical SRE Adoption Steps Consulting Specialist advice provided Embedded by experts Platform No “hands-on” delivery involvement Slice & Dice Ownership of SRE with the Full SRE service/delivery teams – consultants not “on call” These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 117 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps SRE experts embedded in Consulting service/delivery teams Embedded Co-working on SRE activities Platform Some movement towards Slice & Dice ”shared responsibility” Full SRE SRE’s take initial “on call” before responsibilities become shared These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 118 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting SRE “own” the deployment Embedded platform and tooling Platform Guardians of production environments (and maybe Slice & Dice others) – provide “on call” Full SRE One size fits all approach Little or no shared responsibility These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 119 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting SRE own sections of the Embedded service, typically Platform application and Slice & Dice infrastructure Some shared responsibility Full SRE although ”on-call” responsibility is also sliced These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 120 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting Full organization embraces Embedded SRE Platform Share responsibility and shared “on call” Slice &Dice Reliability a first class citizen Full SRE These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 121 © DevOps Institute unless otherwise stated True North Vision Set realistic expectations of SRE: Impossible to achieve 100% uptime – what appropriate SLO’s would you set? Can you afford complete end to end automation? Is automatic recovery and restore even possible? It may not all be achievable, the aim is to be as close as possible to a “true north” vision © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 122 Typical SRE Job Responsibilities Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless post-mortems. © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Module 7: Organizational impact of SRE Sustainable Incident Response © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 124 Incident Response: On Call Being on-call is a critical duty that operations and engineering teams must undertake: To support services during working and non-working hours To respond to issues (outages, incidents, etc.) To ensure SLO’s are being met Incident response must be "sustainable" as per the Google Job Description © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 125 On Call By Numbers (1) How much down time is allowed for service issues? SLO’s introduce a constraint on the amount of time available A service with a “three-nines” availability SLO requires all issues in a