SRE Foundation v1.1 Learner Manual PDF 2021

About Bloom’s Taxonomy 6. Evaluation 5. Synthesis 4. Analysis 3. Application Site 2. Comprehension Reliability Engineering 1. Knowledge Foundation Bloom’s Taxonomy is used to categorize learning objectives and, from there, assess learning achievements. © DevOps Institute unless otherwise stated 4 About DevOps Institute DevOps Institute is dedicated to advancing the human elements of DevOps success. As a global member association, DevOps Institute is the go-to hub connecting IT practitioners, industry thought leaders, talent acquisition, business executives and education partners to help pave the way to support digital transformation and the New IT. DevOps Institute helps advance careers and professional development within the DevOps community through recognized certifications, research and thought leadership, events and the fastest-growing DevOps member community. © DevOps Institute unless otherwise stated 5 Site Reliability Engineering Foundation Course Content Day 1 Day 2 Hello! Course & Class Welcome Warming Up Game Module 1 SRE Principles & Practices Module 5 SRE Tools & Automation Module 2 Service Level Objectives & Error Module 6 Anti-Fragility & Learning from Budgets Failure Module 3 Reducing Toil Module 7 Organizational Impact of SRE Module 4 Monitoring & Service Level Module 8 SRE, Other Frameworks, The Indicators Future © DevOps Institute unless otherwise stated 6 Module 1: SRE Principles & Practices What is Site Reliability Component Module 1 Content Engineering? Video DevOps & SRE (Google) SRE & DevOps: What is Case Story Bloomberg the Difference? Discussion Principles & Practices SRE Principles & Exercise What do we do all Practices day? Module 1: SRE Principles & Practices 8 © DevOps Institute unless otherwise stated What is Site Reliability Engineering? Site Reliability Engineering “What happens when (SRE) is a discipline that a software engineer is incorporates aspects of tasked with what software engineering and used to be called applies them to infrastructure and operations.” operations problems Ben Treynor, Google Created at Google around 2003 and publicized via SRE books Module 1: SRE Principles & Practices 10 © DevOps Institute unless otherwise stated What is Site Reliability Engineering? The goal is to create ultra-scalable and highly reliable distributed software systems SRE's spend 50% of their time doing "ops" related work such as issue resolution, on-call, and manual interventions SRE's spend 50% of their time on development tasks such as new features, scaling or automation Monitoring, alerting and automation are a large part of SRE Module 1: SRE Principles & Practices 11 © DevOps Institute unless otherwise stated What is Site Reliability Engineering? SRE has now spread beyond Google “SRE is an engineering Many organizations running discipline devoted to large-scale services are helping an embracing SRE. Case organization achieve stories in this course include: the appropriate level Standard Chartered Bank of reliability.” UK Dept Work & Pensions Bloomberg David N. Blank-Edelman, Home Depot Microsoft Trivago Sage Group Module 1: SRE Principles & Practices 12 © DevOps Institute unless otherwise stated What is Site Reliability Engineering? Scalability Availability Incident Response Automation Module Practices 1: SRE Principles & Module 1: SRE Principles & Practices 13 © DevOps Institute unless otherwise stated What's the Difference Between DevOps and SRE? withhtStp est:h//yVoaurtg uo.be a/nud TELL8izFfF1o Zvnkg-Jones (Google) (05:10) 15 © DevOps Institute unless otherwise stated Module 1: SRE Principles & Practices Module 1: SRE Principles & Practices 16 © DevOps Institute unless otherwise stated SRE & DevOps – What is the Difference? DevOps (at Google) defines 5 key pillars of success: 1. Reduce organizational silos 2. Accept failure as normal 3. Implement gradual changes 4. Leverage tooling and automation 5. Measure everything SRE is a "specific implementation of DevOps with some extensions.” Google 18 Module 1: SRE Principles & Practices 17 © DevOps Institute unless otherwise stated #1 Operations Is a Software Problem The basic tenet of SRE is that doing operations well is a software problem SRE should therefore use software engineering approaches to solve that problem Software engineering as a discipline focuses on designing and building rather than operating and maintaining Estimates suggest that anywhere between 40% and 90% of the total cost of ownership are incurred after launch Module 1: SRE Principles & Practices 20 © DevOps Institute unless otherwise stated #2 Service Levels A Service Level Objective (SLO) is an availability target for a product or service (this is never 100%) In SRE services are managed to the SLO SLOs need consequences if they are violated Module 1: SRE Principles & Practices 21 © DevOps Institute unless otherwise stated #3 Toil Any manual, mandated operational task is bad If a task can be automated then it should be automated Tasks can provide the "wisdom of production" that will inform better system design and behavior SREs must have time to make tomorrow better than today Module 1: SRE Principles & Practices 22 © DevOps Institute unless otherwise stated #4 Automation Automate what is currently done manual Decide what to automate, and how to automate it Take an engineering-based approach to problems rather than just toiling at them over and over This should dominate what an SRE does Don’t automate a bad process – fix the process first SRE teams have the ability to regulate their workload Module 1: SRE Principles & Practices 23 © DevOps Institute unless otherwise stated #5 Reduce the Cost of Failure Late problem (defect) discovery is expensive so SRE looks for ways to avoid this Look to improve MTTR (Mean Time to Recover/Repair) Smaller changes help with this Canary deployments Failure is an opportunity to improve Module 1: SRE Principles & Practices 24 © DevOps Institute unless otherwise stated #6 Shared Ownership SRE's share skill sets with product development teams Boundaries between “application development” and “production” (Dev & Ops) should be removed SRE's "shift left" and provide "wisdom of production" to development teams Incentives across the organization are not currently aligned Module 1: SRE Principles & Practices 25 © DevOps Institute unless otherwise stated CASE STORY: Bloomberg “Our SREs are united by a common vision of harnessing the power of automation through software “As an SRE, the development to deliver reliable, stable services to our challenge of clients. They care about how we can manage our scale is always infrastructure and applications more efficiently, and a good one.” they do that through software development.” Benefits Product stability improvements, through team collaboration Client cost savings, due to fewer outages Reduced daily grind managing servers and infrastructure, through more automation Module 1: SRE Principles & Practices 26 © DevOps Institute unless otherwise stated “I LOVE my job! I mean, I really love it. I enjoyed being a full-stack developer and getting to deliver major projects for clients, but I like being an SRE far more. Before, the majority of my day was spent building features that were requested by some of our clients. Now, the work I do affects the entire platform and ALL of our clients.” Molly Struve, Lead Site Reliability Engineer at Kenna Security Inc Module 1: SRE Principles & Practices 27 © DevOps Institute unless otherwise stated Module One Quiz 1 The term “site reliability engineering” was created a) Microsoft by which organization? b) Apple c) Google d) LinkedIn 2 Which of these is not a pillar of DevOps success, a) Leverage tooling and automation as defined by Google? b) Implement big-bang changes c) Measure Everything d) Reduce Organizational Silos 3 SLO is an acronym for: a) System Load Objective b) Service Life Objective c) Straight Line Organization d) Service Level Objectives 4 Toil is: a) An acronym for “Time Off In Lieu” b) Manual activities with no enduring value c) Google d) LinkedIn 5 Embracing SRE will bring value to organizations a) Loss of customers facing the challenge of what? b) Staff Attrition c) Scope Creep d) Scale Module 1: SRE Principles & Practices 30 © DevOps Institute unless otherwise stated Module One Quiz Answers 1 The term “site reliability engineering” was created a) Microsoft by which organization? b) Apple c) Google d) LinkedIn 2 Which of these is not a pillar of DevOps success, a) Leverage tooling and automation as defined by Google? b) Implement big-bang changes c) Measure Everything d) Reduce Organizational Silos 3 SLO is an acronym for: a) System Load Objective b) Service Life Objective c) Straight Line Organization d) Service Level Objectives 4 Toil is: a) An acronym for “Time Off In Lieu” b) Manual activities with no enduring value c) A Google product for managing SRE d) A KPI for SRE success 5 Embracing SRE will bring value to organizations a) Loss of customers facing the challenge of what? b) Staff Attrition c) Scope Creep d) Scale Module 1: SRE Principles & Practices 31 © DevOps Institute unless otherwise stated Module 2 SERVICE LEVEL OBJECTIVES & ERROR BUDGETS © DevOps Institute unless otherwise stated Module 2: Service Level Objectives & Error Budgets Service Level Component Module 2 Content Objectives Video Risk & Error Budgets (Google) Error budgets Case Story Evernote Home Depot Error budget policies Discussion Enforcing the Availability SLO Exercise SLO’s for Your Organization Module 2: SLO's & Error Budgets 33 © DevOps Institute unless otherwise stated What is an SLO? An SLO (“Service Level Objective”) is a goal for how well a product or service should operate SLO’s are tightly related to the user experience – if SLO’s are being met then the user will be happy Setting and measuring1 service level objectives is a key aspect of the SRE role The most widely tracked SLO is availability Products and services could (and should) have several SLO’s SLO’s are about making the user experience better © DevOps Institute unless otherwise stated Module 2: SLO's & Error Budgets 35 SLO’s Are for Business “Before getting into the technical details of a SLO, it is important to start the conversation from your customers’ point of view: what promises are you trying to uphold?” Ben McCormack, VP Operations, Evernote © DevOps Institute unless otherwise stated Module 2: SLO's & Error Budgets 36 Example 1: SLO’s & Error Budgets We decide that 99.9% of web requests (www.....) per month should be successful – this is a “service level objective” If there are 1 million web requests in a particular month, then up to 1,000 of those are allowed to fail – this is an “error Error Budget budget” Failure to hit an SLO must have consequences – if more than 1,000 web requests fail in a month then some Yaroslav Molochko remediation work must take place – this is SRE Team Lead AnchorFree an “error budget policy” Module 2: SLO's & Error Budgets 37 © DevOps Institute unless otherwise stated Example 2: SLO’s & Error Budgets Our service has an average login rate of 1,000 per hour in a rolling 31-day period (month) – or 744,000 per month (31 * 24 * 1000) We want 99% of logins each month to be successful - this is a “service level objective” This equates to ”losing” roughly 7,440 logins a month – this is the “error budget” If more 7,440 logins are lost in a month then we have breached the error budget. We use a service level indicator (SLI) to tell us how many actual logins we get in a month. For a particular month our actual logins were 726,560 - exceeding our error budget Failure to hit an SLO must have consequences In this case we instigated a business protection period preventing new releases – this is the “error budget policy” Module 2: SLO's & Error Budgets 38 © DevOps Institute unless otherwise stated Example 3: SLO’s & Error Budgets We decide that 75% of support tickets must complete automatically – this is a “service level objective” If there are 1,000 new support tickets raised each month 250 can be handled manually – this is an “error budget” Failure to hit an SLO must have consequences – if more than 250 support tickets in a month require manual effort then some engineering work must take place – this is an “error budget policies” Module 2: SLO's & Error Budgets 40 Module 2: SLO's & Error Budgets 39 © DevOps Institute unless otherwise stated Adoption of SLO’s According to the 2019 Catchpoint SRE Survey the Module 2: SLO's most popular SLO’s are: Availability 72% Response time 47% Latency 46% We don’t have SLOs 27% © DevOps Institute unless otherwise stated Error Budgets – Good and Bad Bad Good We have error budgets in SRE On the other hand SRE as going over budget usually practices encourage you to means someone somewhere strategically burn the budget will have to work over-time or to zero every month, whether respond to out-of-hours issues. it’s for feature launches or Not hitting 99.9% of HTTP architectural changes. This requests in a month usually way you know you are running means scalability issues so as fast as you can (velocity) “ops” need to do something without compromising availability Module 2: SLO's & Error Budgets 45 © DevOps Institute unless otherwise stated Error Budgets – Fixed? But watch out – high-risk deployments or large ”big-bang” changes have more likelihood of issues and therefore more chance of the error budget being blown This should encourage the Lean preference for small changes (”smaller batch size”) to stay within the error budget. In some cases the error budget may need to change to accommodate complex releases but this needs to be agreed between Dev and Ops and the Business Module 2: SLO's & Error Budgets 46 © DevOps Institute unless otherwise stated Consequence of Missed SLO’s Missed SLO’s have noticeable impacts on business performance Lost Revenue 70% Drop in Employee Productivity 57% Lost Customers 49% Social Media Backlash 36% Module 2: SLO's & Error Budgets 48 © DevOps Institute unless otherwise stated Consequences “There will be no new feature launches allowed. Sprint planning may only pull post- mortem action items from the backlog. Software Development Team must meet with SRE Team daily to outline their improvements.” Jennifer Petoff, Google © DevOps Institute unless otherwise stated Module 2: SLO's & Error Budgets 49 Another Example: Availability Your organization sets an availability SLO of 99.9% Every month this allows for 43 minutes of outages – the ”error budget” New feature releases, patches, planned and un- planned downtime need to fit into this 43 minutes Module 2: SLO's & Error Budgets 50 © DevOps Institute unless otherwise stated CASE STORY: Evernote “We needed to “We wanted to ensure we initially focused on the most ensure the move to important and common customer need: the GCP (Google Cloud availability of the Evernote service for users to access Platform) did not dilute or mask our and sync their content across multiple clients. Our SLO commitment to our journey started from that goal.” users.” Benefits Consistent focus on the user experience, whilst obtaining the benefits of cloud adoption Join clarity around service availability and downtime Monitoring the right things, from a user perspective Module 2: SLO's & Error Budgets 52 © DevOps Institute unless otherwise stated CASE STORY: Home Depot “Beforehand THD didn’t have a culture of SLO’s. Monitoring tools and dashboards were plentiful, but were scattered everywhere and didn’t “We established a track data over time. We weren’t always able to pinpoint the service at the root of a given outage. Often, we began troubleshooting at the catchy acronym user-facing service and worked backward until we found the problem, (VALET; as wasting countless hours. If a service required planned downtime, its discussed later) to dependent services were surprised. These disconnects caused confusion help the idea and disappointment between our software development and spread.” operations teams. We needed to address these disconnects by building a common culture of SLO’s.” Benefits Clearly understood SLO’s across the organization Wider involvement in setting SLO’s Joint responsibility model across Dev and Ops Module 2: SLO's & Error Budgets 53 © DevOps Institute unless otherwise stated The VALET Dimensions of SLO Dimension SLO Budget Policy V Volume/traffic Does the service handle the right volumes of data or Budget: 99.99% of HTTP requests per month Address scalability issues traffic? succeed with 200 OK A Availability Is the service available to users when they need it? Budget: 99.9% availability/uptime Address downtime issues/outages, zero downtime deployments L Latency Does the service deliver in a user-acceptable period of Payload of 90% of HTTP responses returned in Address performance issues time? under 300ms E Errors Is the service delivering the capabilities being 0.01% of HTTP requests return 4xx or 5xx status Analyze and respond to main status codes, new requested? codes functionality or infrastructure may be required T Tickets Are our support services efficient? 75% of service tickets are automatically resolved Automate more manual processes Module 2: SLO's & Error Budgets 54 © DevOps Institute unless otherwise stated Module Two Quiz 1 SLO is an acronym for: a) Serious Local Outage b) Service Level Outcome c) Stored Local Object d) Service Level Objective 2 Which if these is not a recognized SLO? a) Availability b) Latency c) Response time d) Total cost of ownership 3 Latency is: a) Another name for velocity b) The time taken for a response to be delivered to a user c) A web request that fails d) An indicator of how well tested a service is 4 Which two items give a higher risk of an error budget a) Automating the creation of users being exceeded? b) A big-bang release c) Rejecting all HTTP requests between 11pm and 12pm d) Optimizing the speed of your network 5 The benefit of adopting SLO’s in conjunction with your a) Less chance the user experience will be compromised users is what? b) Delivery velocity will be increased c) Development teams deliver better features d) Error budgets are easier to stay within Module 2: SLO's & Error Budgets 56 © DevOps Institute unless otherwise stated Module Two Quiz Answers 1 SLO is an acronym for: a) Serious Local Outage b) Service Level Outcome c) Stored Local Object d) Service Level Objective 2 Which if these is not a recognized SLO? a) Availability b) Latency c) Response time d) Total cost of ownership 3 Latency is: a) Another name for velocity b) The time taken for a response to be delivered to a user c) A web request that fails d) An indicator of how well tested a service is 4 Which two items give a higher risk of an error budget a) Automating the creation of users being exceeded? b) A big-bang release c) Rejecting all HTTP requests between 11pm and 12pm d) Optimizing the speed of your network 5 The benefit of adopting SLO’s in conjunction with your a) Less chance the user experience will be compromised users is what? b) Delivery velocity will be increased c) Development teams deliver better features d) Error budgets are easier to stay within Module 2: SLO's & Error Budgets 57 © DevOps Institute unless otherwise stated Module 3: Reducing toil What is toil? Component Module 3 Content Why toil is bad Video Pragmatic Automation (Google Cloud) Doing something about toil Case Story Accenture Discussion Benefits of toil reduction on individuals & teams Exercise Reducing Toil Module 3: Reducing Toil 59 © DevOps Institute unless otherwise stated What is Toil? “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” Vivek Rau, Google Module 3: Reducing Toil 61 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Manual or semi-manual Repetitive releases Automatable Connecting to Tactical infrastructure to check No enduring value something Linear scaling Constant password resets Toil: Finally, a name for a problem we’ve all felt Module 3: Reducing Toil 62 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Doing the same test over- Automatable and-over Acknowledging the same Tactical alert every morning No enduring value Dealing with interrupts Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 63 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Physical meetings to Repetitive approve production Automatable deployments Tactical Manual starts/resets of No enduring value equipment and components Linear scaling Creating users Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 64 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Known workarounds Tactical On call responses No enduring value Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 65 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Extracting some data Tactical No enduring value Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 66 © DevOps Institute unless otherwise stated Work is Toil If it is: Manual Repetitive Automatable Manual scaling Tactical infrastructure No enduring value Linear scaling Toil: Finally a name for a problem we’ve all felt Module 3: Reducing Toil 67 © DevOps Institute unless otherwise stated Toil is NOT Stuff I don’t like doing Regular work, such as setting up Meetings, community that new device, developing events, planning sessions, that new alerting configuration HR events for your service or working to remove clutter Module 3: Reducing Toil 68 © DevOps Institute unless otherwise stated Why Toil is Bad Impact of High Toil Individual Organization Slow Progress Manual work and firefighting (toil) takes up the New features do not get released quickly, majority of time missed value opportunity. Shortage of team capacity Poor Quality Manual work often results in mistakes, time Excessive costs in support of services consuming to fix, impact on reputation Career Stagnation Career progress slows down due to working on the Reputational damage, not a great place to same things, no time for skills development. Best work. Staff attrition rates increase. engineers working on low-level requests Attritional Toil is demotivating meaning people start looking Staff turnover results in extra costs and lost elsewhere knowledge Unending Never ending deluge of manual tasks, no time to Toil requires engineering effort to fix, if there is find solutions, more time spent managing backlog no engineering time available it won’t be of tasks that fixing them fixed. SLA’s being breached Burnout Personal and heath problems due to overload Potential for litigation and negative publicity and disruptive work patterns Module 3: Reducing Toil 70 © DevOps Institute unless otherwise stated Engineering Bankruptcy “If you aren't careful, the level of toil in an organization can increase to a point where the organization won’t have the capacity needed to stop it.” Damon Edwards, Rundeck Module 3: Reducing Toil 71 © DevOps Institute unless otherwise stated “SRE is what happens when you ask a software engineer to design an operations team.” Benjamin Treynor Sloss, VP 24x7 Engineer at Google © DevOps Institute unless otherwise stated Module 3: Reducing Toil 73 Reducing toil requires engineering time Engineering work needed to reduce toil will typically be a choice of: Creating external automation (i.e. scripts and automation tools outside of the service) Creating internal automation (i.e. automation delivered as part of the service), or Enhancing the service to not require intervention Module 3: Reducing Toil 76 © DevOps Institute unless otherwise stated Making Engineering Time Available Google have an advertised goal of keeping operational work (i.e. toil) below 50% of an engineer's time At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features The 50% rule ensures that one teams or person does not become the “ops” (team/person) 50% is an average to reflect real world scenarios Module 3: Reducing Toil 77 © DevOps Institute unless otherwise stated Moving Towards SRE at Slack Slack moved from 100 AWS instances to 15,000 instances over 4 years Excessive toil caused by low-quality, noisy alerting Ops teams were so consumed by interrupt- driven toil that they were un-able to make progress on improving reliability Slack explicitly committed to the importance of reliability over feature velocity Operational ownership of services pushed back into the dev teams resulting in the teams making the code fixes necessary to stop the incident alerts Module 3: Reducing Toil 78 © DevOps Institute unless otherwise stated Is it Worth Automating Everything? Module 3: Reducing Toil 79 © DevOps Institute unless otherwise stated Pragmatic Automation with Max Luebbe of Google Cloud (04:45) © DevOps Institute unless otherwise stated Module 3: Reducing Toil 80 CASE STORY: Accenture " So our initial exercise was to come up with a view about what constitutes toil for the ADOP team. We implemented a 100 percent use “The team of Jira tickets for tracking work. No gaps. We changed our Jira workflow supporting the to pop up the worklog window on every state transition. This allowed us platform were to capture the amount of time spent on each ticket. What did we do inundated with with this information? We built automation to remove the need for toil- toil to the point related work. But, crucially, because we were able to demonstrate just where they could how much above 50 percent of our time was spent on toil, we had a clear mandate to prioritize toil payback stories and to actually work on do little else.” them." Benefits Hugely positive in protecting the team Reducing staff turnover All work made visible © DevOps Institute unless otherwise stated Module 3: Reducing Toil 82 Module Three Quiz 1 Which of these is an example of toil? a) Auto scaling of cloud infrastructure b) Self service data queries c) Automated password resets d) Manual deployments 2 Engineering bankruptcy is when: a) Fixing toil is not prioritized by the business b) Engineers work 100% on toil c) The development team have no money d) There are no engineers available to work on a task 3 Which of these approaches can be used to reduce toil? a) Hire more engineers b) Shift ownership of toil to the development team c) Build a bigger datacenter d) Reading the Google SRE handbook 4 Which of these is not a way of fixing toil a) Outsourcing operations b) Internal automation c) External automation d) Service enhancements 5 Which of these can you automate? a) Brain storming sessions to solve a problem b) Taking time off c) Building infrastructure d) Promotion decisions Module 3: Reducing Toil 83 © DevOps Institute unless otherwise stated Module Three Quiz Answers 1 Which of these is an example of toil? a) Auto scaling of cloud infrastructure b) Self service data queries c) Automated password resets d) Manual deployments 2 Engineering bankruptcy is when: a) Fixing toil is not prioritized by the business b) Engineers work 100% on toil c) The development team have no money d) There are no engineers available to work on a task 3 Which of these approaches can be used to reduce toil? a) Hire more engineers b) Shift ownership of toil to the development team c) Build a bigger datacenter d) Reading the Google SRE handbook 4 Which of these is not a way of fixing toil a) Outsourcing operations b) Internal automation c) External automation d) Service enhancements 5 Which of these can you automate? a) Brain storming sessions to solve a problem b) Taking time off c) Building infrastructure d) Promotion decisions Module 3: Reducing Toil 8 © DevOps Institute unless otherwise stated Module 4: Monitoring & Service Level Indicators SLI’s - Service Level Component Module 4 Content Indicators Video SLI’s & Reliability Deep Dive (Microsoft) Monitoring Case Story Trivago Observability Discussion What do you monitor now? Exercise Set measurable objectives for your services Module 4: Monitoring & SLI’s 86 © DevOps Institute unless otherwise stated SLI's for Measurement “SLI's are ways for engineers to communicate quantitative data about systems.” Ram Lyengar, Plumbr.io Module 4: Monitoring & SLI’s 89 © DevOps Institute unless otherwise stated Let’s Revisit an Earlier Example We decided that 99.9% of web requests (www.....) per month should be successful – this was the “service level objective” If there were 1 million web requests in a particular month, then up to 1,000 of those were allowed to fail – this was the “error budget” In this example the “service level indicator” (SLI) is “web requests” so we need a way to track and record this data Module 4: Monitoring & SLI’s 91 © DevOps Institute unless otherwise stated SLI Measurement While many numbers can function as an SLI, it is generally recommended to treat the SLI as the ratio of two numbers: the number of good events divided by the total number of events. For our example this is: Number of successful (HTTP) web requests / total (HTTP) requests (success rate) Module 4: Monitoring & SLI’s 92 © DevOps Institute unless otherwise stated SLI Measurement Many indicator metrics are naturally gathered on the server side, using a monitoring system such as Prometheus, or with periodic log analysis—for instance, HTTP 500 responses as a fraction of all requests. Some service level indicators may also need client-side data collection, because not measuring behavior at the client can miss a range of problems that affect users but don’t affect server-side metrics. Module 4: Monitoring & SLI’s 93 © DevOps Institute unless otherwise stated SLI Measurement SLI measurement needs also to be time-bound in some way The time horizon may vary depending on the organization and the SLO For web requests per month, the time horizon is clear SLO’s such as “successful bank payments” may require a broader horizon if bank payments are only made once or twice per month Module 4: Monitoring & SLI’s 94 © DevOps Institute unless otherwise stated SLO’s & SLI’s We use monitoring tools to measure SLI’s constantly, aggregating across suitable time periods Our SLO’s are what we expect - monitoring our SLI’s will tell us if we are meeting a SLO or not – they also tell us how much of our error budget is left (if any) Module 4: Monitoring & SLI’s 95 © DevOps Institute unless otherwise stated SLI & Reliability Deep-Dive with David N. Blank-Edelman (Microsoft) (08:35) © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 96 Monitoring Definitions Monitoring System monitoring is the use of a hardware or software component to monitor the system resources and performance of a computer system Module 4: Monitoring & SLI’s 98 © DevOps Institute unless otherwise stated Monitoring Definitions Telemetry is the highly automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring. The word is derived from Greek roots: tele = remote, and metron = measure. Module 4: Monitoring & SLI’s 99 © DevOps Institute unless otherwise stated Monitoring Definitions Application Performance Management (APM) is the monitoring and management of performance and availability of software applications. APM strives to detect and diagnose application performance problems to maintain an expected level of service. Module 4: Monitoring & SLI’s 100 © DevOps Institute unless otherwise stated Monitoring Anatomy What we need Installed on the hosts to be monitored Passes information AGENTS to the Core Holds configuration about hosts/services Dashboards and Distributed across number UI displays of SLO’s and of masters associated SLI’s Check execution (poke) CORE Result queue (poke response) Delivery of appropriate ALERT information to people in a position to respond Anomoly GRAPHING Aggregation across a DETECTION time horizon graphing at an appropriate scale Check/thresholds against metrics collected Module 4: Monitoring & SLI’s 101 © DevOps Institute unless otherwise stated SLI Supporting Tools Catchpoint Nagios Prometheus Splunk Grafana Collectd Monitoring Graphing Logstash Rsyslogd Collectd Pager Logging Alerting duty Module 4: Monitoring & SLI’s 102 © DevOps Institute unless otherwise stated Monitor SLI’s “We need to make sure that monitoring is effective without drowning in a sea of non- actionable alerts. The path to success is to instrument everything, but only monitor what truly matters” Todd Palino, Senior Staff Engineer Site Reliability at LinkedIn © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 103 CASE STORY: Trivago "My job is to make trivago products faster for customers accessing our services anywhere on the globe. Our SLO’s focus “We have full visibility on hotel search response times, not on specific parts of our of how our services service (or technology stack). You can only improve [SLO’s] that and systems are you can measure and digital end-user perspective monitoring really delivering hotel [using tools like Catchpoint] allow us to check that our hotel search services internationally.” search response time indicator is delivering our SLO.” Benefits SLO’s focused on the user experience, not tech Can spot performance issues with an ISP or CDN anywhere in the world Data collected by monitoring is used in diagnostics/RCA Feedback loop to Dev teams results in optimal solution Module 4: Monitoring & SLI’s 104 © DevOps Institute unless otherwise stated Monitoring & Observability Distributed, complex services running at scale with unpredictable users and variable throughput means there are millions of different ways that things can go wrong But we can’t anticipate them all (monitoring myth) Externalizing all the outputs of a service allows us to infer the internal state of that service (observable) © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 106 Monitoring & Observability monitor : [ mon-i-ter ] verb (used with object) to observe, record, or detect (an operation or condition) with instruments that have no effect upon the operation or condition. verb (used without object) to serve as a monitor, detector, supervisor, etc. © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 107 Monitoring & Observability “Monitoring is a verb; something we perform against our applications and systems to determine their state. From basic fitness tests and whether they’re up or down, to more proactive performance health checks. We monitor applications to detect problems and anomalies. “ Peter Waterhouse, CA © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 108 Monitoring & Observability observable : [ uh b-zur-vuh-buh l ] noun capable of being or liable to be observed; noticeable; visible; discernible: deserving of attention; noteworthy. © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 109 Monitoring & Observability “Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” Peter Waterhouse, CA © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 110 Why Observability is Important Bolting on monitoring tools after the event does not scale: Rapid rate of service growth Dynamic architectures Container workloads Dependencies between services Customer experience matters more © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 111 Observability = Better Alerting We need to improve our “signal” to “noise” ratio so we focus alerts on key issues 1. Generate One Alert for One Service (Versus One Metric) 2. Use Analytics to Learn Normal Behavior 3. Improve Alerting with Multi- Criteria Alerting Policies We need to infer what is “normal” about a service Module 4: Monitoring & SLI’s 112 © DevOps Institute unless otherwise stated Why Observability looks like Distributed tracing Identify individual user Event logging experiences Internal performance data Fewer paging alerts - not more Application instrumentation Inquisitive / what-if questions © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 113 SLO’s, SLI’s & Observability SLO’s are from a user perspective and help identify what is important E.g. 90% of users should complete the full payment transaction in less than one elapsed minute SLI’s give detail on how we are currently performing E.g. 98% of users in a month complete a payment transaction in less than one minute Observability gives us the normal state of the service 38 seconds is the “normal” time it takes users to complete a payment transaction when all monitors are healthy © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 114 Be Proactive “This rich ecosystem of introspection and instrumentation is not particularly biased towards the traditional monitoring stack's concerns of actionable alerts and outages.” Charity Majors, CEO Honeycomb © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 115 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 1. Map your user journeys © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 117 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 2. Prioritize the most important user journey © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 118 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 3. Define what "good" means to users © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 119 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 4. Map out high-level system components © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 120 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 5. Define the SLIs © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 121 Module 4: Monitoring & SLI’s EXERCISE Set measurable objectives for your service 6. Make the service “observable” © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 122 Module Four Quiz 1 What does SLI stand for? a) Service Life Indicator b) Service Level Integration c) Service Level Indicator d) Service Light Indicator 2 SLI’s are best represented as: a) A ratio of two numbers b) A fixed list of allowable values c) The format HH:MM d) Hexadecimal 3 What approach would you use to measure the internal a) Canary Testing performance of a service b) Application Performance Management c) Ping d) Selenium 4 Which of these is a popular monitoring tools? a) Bladerunner b) Hal c) Yoda d) Prometheus 5 Which of these is not a technical element of a) Number of failed logins observability? b) Distributed tracing c) Event logging d) Internal performance data © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 123 Module Four Quiz Answers 1 What does SLI stand for? a) Service Life Indicator b) Service Level Integration c) Service Level Indicator d) Service Light Indicator 2 SLI’s are best represented as: a) A ratio of two numbers b) A fixed list of allowable values c) The format HH:MM d) Hexadecimal 3 What approach would you use to measure the internal a) Canary Testing performance of a service b) Application Performance Management c) Ping d) Selenium 4 Which of these is a popular monitoring tools? a) Bladerunner b) Hal c) Yoda d) Prometheus 5 Which of these is not a technical element of a) Number of failed logins observability? b) Distributed tracing c) Event logging d) Internal performance data © DevOps Institute unless otherwise stated Module 4: Monitoring & SLI’s 124 Site Reliability Engineering Foundation Course Content Day 1 Day 2 Hello! Course & Class Welcome Warming Up Game Module 1 SRE Principles & Practices Module 5 SRE Tools & Automation Module 2 Service Level Objectives & Error Module 6 Anti-Fragility & Learning from Budgets Failure Module 3 Reducing Toil Module 7 Organizational Impact of SRE Module 4 Monitoring & Service Level Module 8 SRE, Other Frameworks, The Indicators Future Sample Examination Review Examination Time Module 5: SRE Tools & Automation Automation Defined Component Module 5 Content Automation Focus Video Ironies of Automation Hierarchy of Automation Types (Microsoft) Secure Automation Case Story Standard Chartered Automation Tools Discussion Automation “Greatest Hits” Exercise How Much Automation Do You Have? © DevOps Institute unless otherwise stated Automation Gives Us Consistency – a machine will be more consistent than a human A platform upon which to build, re-use, extend Faster action, faster fixes Time savings “For SRE, automation is a force multiplier, not a panacea.” – Niall Murphy, Google SRE © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 130 Automation Requires: A problem to be solved: Eliminating toil Improving SLO’s Appropriate tooling Engineering effort Measurable outcomes “For SRE, automation is a force multiplier, not a panacea.” – Niall Murphy, Google SRE Module 5: SRE Tools & Automation 131 © DevOps Institute unless otherwise stated Automation Focus: Typical DevOps Delivery Pipeline Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Module 5: SRE Tools & Automation 134 © DevOps Institute unless otherwise stated Automation Focus A lot of automation effort is “Dev” led (left-to-right), DevOps with a “big Dev and small Ops”. Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Module 5: SRE Tools & Automation 135 © DevOps Institute unless otherwise stated Automation Focus Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Features are being pushed to those supporting production in ever increasing numbers Module 5: SRE Tools & Automation 136 © DevOps Institute unless otherwise stated Automation Focus Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Developers hope/assume that environments are consistent Module 5: SRE Tools & Automation 137 © DevOps Institute unless otherwise stated Automation Focus Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Testing steps introduce false confidence as production is always unique Module 5: SRE Tools & Automation 138 © DevOps Institute unless otherwise stated Automation Focus Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Monitoring and alerting is focused on things that are known to go wrong Module 5: SRE Tools & Automation 139 © DevOps Institute unless otherwise stated A Change in Focus: SRE-Led Service Automation Module 5: SRE Tools & Automation 140 © DevOps Institute unless otherwise stated Automation Focus: In SRE-Led Service Automation Automation effort is “Ops” led (“shifting left”), to ensure reliability engineering priorities Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Environments must be provisioned as Infrastructure- (and Configuration-) as-Code Module 5: SRE Tools & Automation 141 © DevOps Institute unless otherwise stated SRE-Led Service Automation All code can be rebuilt from a code repository e.g. GitLab, Azure DevOps, Bitbucket. Components Example Tools 1. Environments provisioned using Infrastructure as Servers, Terraform, AWS Infrastructure/Config as Code Code networks, CloudFormation, Azure 2. Automated functional and non- (IaC) storage Resource Manager functional tests in production Configuration as Software, Puppet, Chef, Ansible, 3. Versioned (& signed) artefacts to Code dependencies, Saltstack, Docker, GCP deploy system components (CaC) containers Deployment Manager 4. Instrumentation in place to make the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 142 © DevOps Institute unless otherwise stated SRE-Led Service Automation All code can be rebuilt from a code repository e.g. GitLab, Azure DevOps, Bitbucket. Components Example Tools 1. Environments provisioned using Infrastructure as Servers, Terraform, AWS Infrastructure/Config as Code Code networks, CloudFormation, Azure 2. Automated functional and non- (IaC) storage Resource Manager functional tests in production Configuration as Software, Puppet, Chef, Ansible, 3. Versioned (& signed) artefacts to Code dependencies, Saltstack, Docker, GCP deploy system components (CaC) containers Deployment Manager 4. Instrumentation in place to make the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 143 © DevOps Institute unless otherwise stated SRE-Led Service Automation Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Run Tests Run NFT’s Automated functional and non-functional tests in production Module 5: SRE Tools & Automation 144 © DevOps Institute unless otherwise stated SRE-Led Service Automation Environment progression includes prod – Dev, Test, Pre-Prod, Prod (inc. “hidden live”) Tests Example Tools 1. Environments provisioned using Extend build Automated Selenium, Cucumber, Infrastructure/Config as Code pipeline functional Jasmine, Mocha, 2. Automated functional and non- Zephyr, Mockito functional tests in production Extend test Automated JMeter, Sonatype Nexus 3. Versioned (& signed) artefacts to pipeline non- Lifecycle, SoapUI, deploy system components functional WhiteSource, 4. Instrumentation in place to make Veracode, Nagios the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Test history is recorded in the pipeline logs Module 5: SRE Tools & Automation 145 © DevOps Institute unless otherwise stated SRE-Led Service Automation Environment progression includes prod – Dev, Test, Pre-Prod, Prod (inc. “hidden live”) Tests Example Tools 1. Environments provisioned using Extend build Automated Selenium, Cucumber, Infrastructure/Config as Code pipeline functional Jasmine, Mocha, 2. Automated functional and non- Zephyr, Mockito functional tests in production Extend test Automated JMeter, Sonatype Nexus 3. Versioned (& signed) artefacts to pipeline non- Lifecycle, SoapUI, deploy system components functional WhiteSource, 4. Instrumentation in place to make Veracode, Nagios the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Test history is recorded in the pipeline logs Module 5: SRE Tools & Automation 146 © DevOps Institute unless otherwise stated SRE-Led Service Automation Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Run Tests Run NFT’s Versioned (& signed) artefacts to deploy system components Module 5: SRE Tools & Automation 147 © DevOps Institute unless otherwise stated SRE-Led Service Automation All service components, libraries and dependencies (or containers) stored in an artifact repository Artifacts How Example Tools 1. Environments provisioned using Digitally versioned With Nexus, Artifactory Infrastructure/Config as Code semantic 2. Automated functional and non- versioning functional tests in production x.y.z 3. Versioned (& signed) artifacts to Digitally signed For security Nexus, Artifactory deploy system components and 4. Instrumentation in place to make auditability the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 148 © DevOps Institute unless otherwise stated SRE-Led Service Automation All service components, libraries and dependencies (or containers) stored in a artifact repository Artifacts How Example Tools 1. Environments provisioned using Digitally versioned With Nexus, Artifactory Infrastructure/Config as Code semantic 2. Automated functional and non- versioning functional tests in production x.y.z 3. Versioned (& signed) artifacts to Digitally signed For security Nexus, Artifactory deploy system components and 4. Instrumentation in place to make auditability the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 149 © DevOps Institute unless otherwise stated SRE-Led Service Automation Commit ID: 113 Build Run Unit Tests Code Analysis Create Test Env Deploy Code Load Test Data Run Tests Committer: jdoe Create Pre-Prod Deploy Code Run Perf Test Run Security Test Check Monitors Create Prod Prod deploy Run Tests Run NFT’s Check Monitors Instrumentation in place to make the service externally observable Module 5: SRE Tools & Automation 150 © DevOps Institute unless otherwise stated SRE-Led Service Automation Alignment with SLAs, SLOs, SLIs and telemetry everywhere Consider What Example Tools 1. Environments provisioned using Service Level Are OpsGenie Infrastructure/Config as Code Indicators understood 2. Automated functional and non- and functional tests in production published 3. Versioned (& signed) artifacts to Instrumentation Provides Nagios, Dynatrace, deploy system components additional AppDynamics, 4. Instrumentation in place to make data and Prometheus the service externally viewable analytics 5. Future growth envelope outlined Log files Aggregated Splunk, LogStash 6. Clear anti-fragility strategy and ready for access Module 5: SRE Tools & Automation 151 © DevOps Institute unless otherwise stated SRE-Led Service Automation Alignment with SLAs, SLOs, SLIs and telemetry everywhere Consider What Example Tools 1. Environments provisioned using Service Level Are OpsGenie Infrastructure/Config as Code Indicators understood 2.

SRE Foundation v1.1 Learner Manual PDF 2021

Document Details

Tags

Related

Summary

Full Transcript