SRE-Led Service Automation PDF

SRE-Led Service Automation Consider What Example Tools 1. Environments provisioned using Disaster Recovery Tests Fire drills Infrastructure/Config as Code (DR)...

SRE-Led Service Automation Consider What Example Tools 1. Environments provisioned using Disaster Recovery Tests Fire drills Infrastructure/Config as Code (DR) complete 2. Automated functional and non- Chaos engineering Practiced Chaos Monkey functional tests in production On call In place PagerDuty, VictorOps, 3. Versioned (& signed) artifacts to mechanisms Squadcast deploy system components 4. Instrumentation in place to make the service externally viewable 5. Future growth envelope outlined 6. Clear anti-fragility strategy Module 5: SRE Tools & Automation 158 © DevOps Institute unless otherwise stated SRE-Led Service Automation jdoe Overall there is more focus on prod – DevOps gains “wisdom of production”, SRE can say “No” Module 5: SRE Tools & Automation 159 © DevOps Institute unless otherwise stated SRE Automation Is Not Just Service Automation “The team supporting the platform were inundated with toil to the point where they could do little else.” Mark Rendell, Accenture Module 5: SRE Tools & Automation 160 © DevOps Institute unless otherwise stated Hierarchy of Automation Types The database notices problems, and Systems that require no intervention automatically fails over without human intervention. Internally maintained system- The database ships with its own failover specific automation script that is used if there is a problem Externally maintained The SRE adds database support to a generic automation "generic failover" script that everyone uses if there is a problem Externally maintained An SRE has a failover script in his or her system-specific home directory that is used if there is a automation problem None Database master is failed over manually if there is a problem © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 162 Secure Automation Automation removes the chance of human error (or willful sabotage) and provides security opportunities We can secure automated steps in the pipeline – we cannot provenly secure manual steps Artifacts generated and used by the pipeline can be validated and checked for compliance DevSecOps introduced security into the build-test- deploy lifecycle SRE places extra emphasis on security of production Module 5: SRE Tools & Automation 164 © DevOps Institute unless otherwise stated Secure Build Application, infrastructure and configuration code changes run through code analysis tools that check for security issues Digitally signing build artifacts ”Having proper configuration avoids possibility of “fake” code management does play a Secure coding practices widely published and embraced huge role in compliance.” Secure code repositories with access control Will Gregorian, Coding done in the open (‘open Director of Security Technical Operations source”) with community feedback Omada Health Module 5: SRE Tools & Automation 165 © DevOps Institute unless otherwise stated Secure Test When infrastructure or configuration is changed then test environments do not mutate, they are ”immutable” and are-created, guaranteeing compliance with code repository We deploy the artifacts securely built by our engineers to our test environments Test data and test scenarios built to test security Module 5: SRE Tools & Automation 166 © DevOps Institute unless otherwise stated Secure Staging Staging environments are also immutable Same artifact deployed to staging/pre-prod environment Production-like data introduced into staging introduces data security considerations e.g. GDPR, PCI Dedicated security scanning will try and uncover security vulnerabilities Dependencies and integration to other services may introduce vulnerabilities O’Reilly or proxy security requirements Module 5: SRE Tools & Automation 167 © DevOps Institute unless otherwise stated Secure Production Production environments are also immutable Same artifact deployed to production environment management does play a huge role in compliance.” Production data requires data security compliance e.g. GDPR, PCI, SOX Dedicated security scanning will try and uncover security vulnerabilities Regulatory compliance needs to be evidenced Omada Health Failure testing can help with audit compliance Module 5: SRE Tools & Automation 168 © DevOps Institute unless otherwise stated Automation Tools Any discussion around tools quickly turns into a ”favorite tech” talk Tools are constantly changing Organizations will have bias towards certain types of tools, from “open source” to “big IT” vendors Autonomy to use the most appropriate tools to do a job is often best delegated to those with the most knowledge of the job Engineers will be more productive with the tools they are familiar You wouldn’t hire a plumber and give them a joiners tools, would you? © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 170 CASE STORY: Standard Chartered “We gave the “We established some fundamental SRE principles that are being applied across the bank’s global footprint, which include Cord’ so they can ‘everything as code’, ‘everything via API’s’, ‘one-pipeline’, ‘self- test and self-heal’ as well as ‘zero-touch automate and orchestrate’. We really now seeing the benefits of the organisation adopting these principles.” Benefits 28 person-years of effort saved 13,000+ manual reviews of production operations activities avoided Time to Repair reduced by 25 minutes on average Shaun Norris Global Head, Cloud 200 self-inflicted operations incidents avoided Standard Chartered Module 5: SRE Tools & Automation 171 © DevOps Institute unless otherwise stated How Much Automation Do You Have? Module 5: SRE Tools & Automation 172 © DevOps Institute unless otherwise stated Manage ( 174) Audit Management The use of automated tools to ensure products and services are auditable, including keeping audit logs of build, test and deploy activities, auditing configurations and users, as well as log files from production operations. Authentication & Authorization Mechanisms for ensuring appropriate access to products, services and tools. For example user and password management and two-factor authentication. Cloud providers use their own tools such as AWS IAM (Identity and Access Management) alongside cloud users tools. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 174 Manage ( 175) DevOps Score A metric showing DevOps adoption across an organization and the corresponding impact on delivery velocity. Value Stream Management The ability to visualize the flow of value delivery through the DevOps lifecycle. Gitlab CI and the Jenkins extension (from Cloud Bees) DevOptics can provide this visualization. © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 175 Plan (1) Issue Tracking Tools like Jira, Trello, CA’s Agile Central and VersionOne can be used to capture incidents or backlogs of work Kanban Boards On the back of issue tracking, the same tools can represent delivery flow through Scum and Kanban workflow boards Time Tracking Similarly, issue tracking tools also allow for time to be tracked, either against individual issues or other work or project types. Agile Portfolio Management Involves evaluating in-flight projects and proposed future initiatives to shape and govern the ongoing investment in projects and discretionary work. CA’s Agile Central and VersionOne are examples. Module 5: SRE Tools & Automation 176 © DevOps Institute unless otherwise stated Plan (2) Service Desk Service Now is a well used platform for managing the lifecycle of services as well as internal and external stakeholder engagement. Requirements Management Tools than handle requirements definition, traceability, hierarchies & dependency. Often also handles code requirements and test cases for requirements. Quality Management Tools that handle test case planning, test execution, defect tracking (often into backlogs), severity and priority analysis. CA’s Agile Central © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 177 Create ( 178) Source Code Management Tools to securely store source code and make it available in a scalable, multi-user environment. Git and SVN are popular examples. Code Review The ability to perform peer code-reviews to check quality can be enforced through tools like Gerrit, TFS (Team Foundation Service), Crucible and Gitlab. Wiki Knowledge sharing can be enabled by using tools like Confluence which create a rich Wiki of content 178 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Create ( 179) Web IDE Tools that have a web client integrated development environment. Enables developer productivity without having to use a local development tool. Gitlab Snippets Stored and shared code snippets to allow collaboration around specific pieces of code. Also allows code snippets to be used in other code-bases. BitBucket and GitLab allow this. 179 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Verify (1) Continuous Integration “Refers to integrating, building, and testing code within the development environment” – Martin Fowler, Chief Scientist, ThoughtWorks. Code Quality Also referred to as code analysis, Sonar and Checkmarks are examples of tools that automatically check the seven main dimensions of code quality – comments, architecture, duplication, unit test coverage, complexity, potential defects, language rules. 180 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Verify (2) Performance Testing Performance testing is the process of determining the speed, responsiveness and stability of a computer, network, software program or device under a workload. Tools like Gatling can performance test services. Usability Testing Usability testing is a way to see how easy to use something is by testing it with real users. Tools can be used to track how a users works with a service e.g. with scroll recording, eye checking, mouse tracking. Crazy Egg, Optimizely are examples. 181 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Package (1) Package Registry A repository for software packages, artifacts and their corresponding metadata. Can store files produced by an organization itself or for third party binaries. Artifactory and Nexus are amongst the most popular. Container Registry Secure and private registry for Container images. Typically allowing for easy upload and download of images from the build tools. Docker Hub, Artifactory, Nexus, Dependency Proxy For many organizations, it is desirable to have a local proxy for frequently used upstream images/packages. In the case of CI/CD, the proxy is responsible for receiving a request and returning the upstream image from a registry, acting as a pull-through cache. 182 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Package (2) Helm Chart Registry Helm charts are what describe related Kubernetes resources. Artifactory and Codefresh support a registry for maintaining master records of Helm Charts. Dependency Firewall Many projects depend on packages that may come from unknown or unverified providers, introducing potential security vulnerabilities. There are tools to scan dependencies but that is after they are downloaded. These tools prevent those vulnerabilities from being downloaded to begin with. 183 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (1) SAST Static Application Security Testing test applications from the “inside out” by looking a source code, byte code or binaries. Gitlab, CA’s Veracode, DAST Dynamic Application Security Testing tests applications from the “outside in” to detect security vulnerabilities. Gitlab, CA’s Veracode IAST Interactive Application Security Testing combines SAST and DAST approaches but involves application tests changing in “real time” based on information fed back from SAST and DAST, creating new test cases on the fly. Synopisis, Acunetix, Parasoft and Quotium are solutions evolving in this direction. Secret Detection Secret Detection aims to prevent that sensitive information, like passwords, authentication tokens, and private keys are unintentionally leaked as part of the repository content. 184 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (2) Dependency Scanning Used to automatically find security vulnerabilities in your dependencies while you are developing and testing your applications. Synopisis, Gemnasium, Retire.js and bundler-audit are popular tools in this area. Container Scanning When building a Container image for your application, tools can run a security scan to ensure it does not have any known vulnerability in the environment where your code is shipped. Blackduck, Synopsis, Synk, Claire and klar are examples. License Compliance Tools, such as Blackduck and Synopsis, that check that licenses of your dependencies are compatible with your application, and approve or blacklist them. 185 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Secure (3) Vulnerability Database Is aimed at collecting, maintaining, and disseminating information about discovered computer security vulnerabilities. This is then checked as part of the delivery pipeline. Fuzzing Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a service and watching for the results. 3 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Release (4) Continuous Delivery “Is a software development discipline where you build software in such a way that the software can be released to production at any time” – Martin Fowler, Chief Scientist, ThoughtWorks Release Orchestration Typically a deployment pipeline, used to detect any changes that will lead to problems in production. Orchestrating other tools will identify performance, security, or usability issues. Tools like Jenkins and Gitlab CI can “orchestrate” releases. Pages For creating supporting web pages automatically as part of a CI/CD pipeline. Review Apps Allow code to be committed and launched in real time – environments are spun up to allow developers to review their application. Gitlab CI has this capability. Incremental Rollout Incremental rollout means deploying many small, gradual changes to a service instead of a few large changes. Users are incrementally moved across to the new version of the service until eventually all users are moved across. Sometimes referred to by colored environments e.g. Blue/green deployment. 4 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Release (5) Canary Deployments Similar to incremental rollout, it is where a small portion of the user base is updated to a new version first. This subset, the canaries, then serve as the proverbial “canary in the coal mine”. If something goes wrong then a release is rolled back and only a small subset of the users are impacted. Feature Flags Sometimes called feature toggles, a technique that allows system behavior to change without changing the underlying code, through the use of “flags” to decide which behavior is invoked. A programming practice primarily there are tools such as Launch Darkly which can help with flag management and invocation. Release Governance Release Governance is all about the controls and automation (security, compliance, or otherwise) that ensure your releases are managed in an auditable and trackable way, in order to meet the need of the business to understand what is changing. Gitlab CI. Secrets Management Secrets management refers to the tools and methods for managing digital authentication credentials (secrets), including passwords, keys, APIs, and tokens for use in applications, services, privileged accounts and other sensitive parts of the IT ecosystem 188 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Configure (1) Auto DevOps Auto DevOps brings DevOps best practices to your project by automatically configuring software development lifecycles. It automatically detects, builds, tests, deploys, and monitors applications. Gitlab and AWS Code Pipelines are strong examples. ChatOps The ability to execute common DevOps actions directly from chat (build, deploy, test, incident management, rollback ,etc) with the output sent back to a channel. Runbooks A collection of procedures necessary for the smooth operation of a service. Previously manual in nature they are now usually automated with tools like Ansible. Serverless A code execution paradigm where no underlying infrastructure or dependencies are needed, moreover a piece of code is executed by a service provider (typically cloud) who takes over the creation of the execution environment. Lambda functions in AWS and Azure Functions are examples. 189 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Configure (2) Module 5: SRE Tools & Automation 2 © DevOps Institute unless otherwise stated Monitor (3) Metrics Tools that collect and display performance metrics for deployed apps, such as Prometheus. Logging The capture, aggregation and storage of all logs associated with system performance including, but not limited to, process calls, events, user data, responses, error and status codes. Logstash and Nagios are popular examples. Tracing Tracing provides insight into the performance and health of a deployed application, tracking each function or microservice which handles a given request. Cluster Monitoring Tools that let you know the health of your deployment environments running in clusters such as Kubernetes. 3 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Monitor (4) Error Tracking Tools to easily discover and show the errors that application may be generating, along with the associated data. Incident Management Involves capturing the who, what, when of service incidents and the onward use of this data in ensuring service level objectives are being met. Synthetic Monitoring The ability to monitor service behavior by creating scripts to simulate the action or path taken by a customer/end-user and the associated outcome. Status Page Self-explanatory, service pages that easily communicate the status of services to customers and users. 192 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Defend (1) RASP Runtime Application Self Protection (RASP) – tools that actively monitor and block threats in the production environment before they can exploit vulnerabilities. WAF Web Application Firewall – tools that examine traffic being sent to an application and can block anything that looks malicious. Threat Detection Refers to the ability to detect, report, and support the ability to respond to attacks. Intrusion detection systems and denial-of-service systems allow for some level of threat detection and prevention. UEBA User and Entity Behavior Analytics (UEBA) is a machine learning technique to analyze normal and “abnormal” user behavior with the aim of preventing the latter. 193 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Defend (2) Vulnerability Management Is about ensuring that assets and applications are scanned for vulnerabilities and then the subsequent processes to record, manage, and mitigate those vulnerabilities. DLP Data Loss Protection – tools that prevent files and content from being removed from within a service environment or organization. Storage Security a specialty area of security that is concerned with securing data storage systems and ecosystems and the data that resides on these systems. Container Network Security Used to prove that any app that can be run on a container cluster with any other app can be confident that there is no unintended use of the other app or any unintended network traffic between them. 194 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Ironies of Automation: A Comedy in Three Parts with Tanner Lund (Microsoft) (18:32) 195 © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation Module Five Quiz 1 Chaos monkey is used for: a) Automated incident response b) The development of games c) The accuracy of Google searches d) Anti-fragility testing 2 Which two pattern ensure services are consistent across a) Configuration as code all environments b) Cloud a code c) Security as code d) Infrastructure as code 3 What does automation not give us? a) Time savings b) Consistency c) Re-use d) Better meetings 4 What are two benefits of using signed artifacts? a) Prevent viruses being spread b) Avoids the possibility of fake code being deployed c) Compliance verification d) Enhances user access security 5 What do we need in place to make a service externally a) Instrumentation observable? b) Container scanning c) Secrets management d) Code analysis © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 196 Module Five Quiz Answers 1 Chaos monkey is used for: a) Automated incident response b) The development of games c) The accuracy of Google searches d) Anti-fragility testing 2 Which two pattern ensure services are consistent across a) Configuration as code all environments b) Cloud a code c) Security as code d) Infrastructure as code 3 What does automation not give us? a) Time savings b) Consistency c) Re-use d) Better meetings 4 What are two benefits of using signed artifacts? a) Prevent viruses being spread b) Avoids the possibility of fake code being deployed c) Compliance verification d) Enhances user access security 5 What do we need in place to make a service externally a) Instrumentation observable? b) Container scanning c) Secrets management d) Code analysis © DevOps Institute unless otherwise stated Module 5: SRE Tools & Automation 197 Module 6: Antifragility & Learning from Failure Why learn from failure Component Module 6 Content Video Introducing network failures Benefits of anti-fragility (Indeed.com) Shifting the organizational Case Story Netflix Simian Army balance Discussion Failure is bad – for organizations and individuals Exercise What’s the difference between outages? © DevOps Institute unless otherwise stated Module Changing the Reputation of Failure “There is no such thing as failure. There are only results. It's time to stop beating yourself up and start realizing that everything you do is a success or a learning experience.” Tony Robbins, author & life coach © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 201 Why Learning from Failure is Important Ask why, 5 times: We need to be better at breaking things We need to understand how things work Risk together versus So when things break, we know how to Benefits fix them So we can prevent things from breaking New features To reduce downtime and to ensure we versus make money* Anti-fragility © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 203 Firefighting How often do you end up firefighting (as we call it) in the workplace? Firefighting real fires usually requires a lot of training: 6 – 28 weeks of basic training Specialist training (woodland) another 6 - 12 weeks Specialist training (city) another 6 – 12 weeks Plan for the unexpected in training (what if scenarios?) Do we do this in tech? © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 204 Or Alternatively © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 205 Failure Happens, Use It To Your Advantage “Failure happens, there is no way around it so stop pointing fingers. Embracing failure will help improve MTTD and MTTR metrics. Proactively addressing failure leads to more robust systems.” Jennifer Petoff, Global Program Manager for SRE Education, Google Module 6: Antifragility & Learning from Failure 207 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) MTTR – Mean Time to Recover/Repair (Components) MTRS – Mean Time to Recover (Service) Service Level Objective (SLO) Recovery Point Objective (RPO) Module 6: Antifragility & Learning from Failure 208 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) MTTR – Mean Time to Recover/Repair(Componen By introducing failure we ts) optimize our monitoring MTRS – Mean Time to Recover (Service) making it more likely we will Service Level Objective detect real incidents (SLO) Recovery Point Objective (RPO) Module 6: Antifragility & Learning from Failure 209 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Simulating component MTTR – Mean Time to failure allows us to create Recover/Repair (Components) automation to try and auto- MTRS – Mean Time to recover Recover (Service) We can also build in more Service Level Objective resilience to prevent failure (SLO) Recovery Point Objective of single components (RPO) Module 6: Antifragility & Learning from Failure 210 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Chaos engineering MTTR – Mean Time to approaches identify key Recover/Repair (Components) interfaces & dependencies MTRS – Mean Time to across services pinpointing Recover (Service) areas where more resilience Service Level Objective may be required (SLO) Recovery Point Objective (RPO) Module 6: Antifragility & Learning from Failure 211 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) A fire drill where e.g. a MTTR – Mean Time to database is taken down Recover/Repair (Components) may result in an SLO being MTRS – Mean Time to broken Recover (Service) Caching data in the case of Service Level Objective a database outage instead (SLO) Recovery Point Objective could mean the SLO is met (RPO) Module 6: Antifragility & Learning from Failure 212 © DevOps Institute unless otherwise stated Improving Metrics through Antifragility MTTD – Mean Time to Detect (Failure/Incidents) Introducing failure to e.g. a MTTR – Mean Time to messaging queue may Recover/Repair (Components) indicate excessive data loss, MTRS – Mean Time to outside the RPO Recover (Service) More frequent back ups of Service Level Objective the queue data may be (SLO) Recovery Point Objective needed to meet the RPO (RPO) Module 6: Antifragility & Learning from Failure 213 © DevOps Institute unless otherwise stated Creating a Culture of Learning from Failure “You’re either a learning organization or you’re losing to somebody who is...” Andrew Shafer quoted in ‘Beyond the Phoenix Project’ Andrew Clay-Shafer is a foundational voice in the DevOps movement © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 215 Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - ”Chaos engineering” next steps “The only real mistake is the one from which we learn nothing.” - Henry Ford Module 6: Antifragility & Learning from Failure 216 © DevOps Institute unless otherwise stated The Third Way: Continual Experimentation and Learning The Third Way encourages a culture that fosters two things: 1. Continual experimentation, taking risks and learning from failure 2. Understanding that repetition and practice is the prerequisite to mastery. Allocate time for the improvement of daily work Create rituals that reward the team for taking risks Introduce faults into the system to increase resilience Plan time for safe experimentation and innovation (hackathons) Module 6: Antifragility & Learning from Failure 217 © DevOps Institute unless otherwise stated Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - ”Chaos engineering” next steps "We make mistakes, and we get back up … “ Jack Dorsey, Twitter Module 6: Antifragility & Learning from Failure 218 © DevOps Institute unless otherwise stated Pathological Bureaucratic Generative (Power- (Rule-oriented) (Performance-oriented) oriented) Information is hidden Information may be ignored Information is actively sought Messengers are ‘shot’ Messengers are isolated Messengers are trained Responsibilities are shirked Responsibility is compartmentalized Responsibilities are shared Bridging is discouraged Bridging is allowed but discouraged Bridging is rewarded Failure is covered up Organization is just and merciful Failure causes enquiry Novelty is crushed Novelty creates problems Novelty is implemented Source: Westrum, A Typology of Organizational Cultures High-trust organizations encourage good information flow, cross-functional collaboration, shared responsibilities, learning from failures and new ideas. Module 6: Antifragility & Learning from Failure 219 © DevOps Institute unless otherwise stated Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “fire drills” 4th Play – “Chaos Engineering” next steps "It’s fine to celebrate success, but it is more important to heed the lessons of failure.” Bill Gates Module 6: Antifragility & Learning from Failure 220 © DevOps Institute unless otherwise stated Introduce Fire Drills Fire drills build on the concepts of business continuity planning (BCP) and disaster recovery (DR) which have been around for decades Need to ensure a business can continue to operate during unforeseen events or failures, such as natural disasters or emergencies This is quite often an audit requirement For a lot of organisations this is an annual data centre failover test Fire drills are a good first step towards chaos engineering © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 221 Introduce Fire Drills “Fire Drills” can go beyond technology: 1. Loss of facility (datacentre) or region (cloud) 2. Loss of technology (e.g. database) 3. Loss of resources (e.g. key person) 4. Loss of critical third-party vendors © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 222 Shifting the balance 1st Play – enable the 3rd of the “three ways” 2nd Play – Benchmark v “Westrum Model” 3rd Play – Introduce “Fire Drills” 4th Play - Chaos engineering next steps “You shouldn’t be afraid of failure” Niklas Zennström, Skype Module 6: Antifragility & Learning from Failure 223 © DevOps Institute unless otherwise stated What is Chaos Engineering Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions Most famous example is the Simian Army from Netflix https://www.oreilly.com/library/view/chaos-engineering/9781491988459/ © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 224 CASE STORY: Netflix “The Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services* infrastructure and includes Chaos Monkey, which disables we won't even production instances at random, Latency Gorilla, which simulates notice.” network delays and Chaos Gorilla, which brings down whole Amazon datacenters (AZ’s), amongst others.” Benefits Minimize impact on customer experience and hence revenue Automated “self healing” improves staff morale due to less “on call” incidents Greg Orzell Software Engineer Netflix © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 225 Chaos Engineering Next Steps 1. Segregate the system into key components 2. Test the system without key components being available 3. Break the system (in non-prod environments first) 4. Introduce failure of key components in prod 5. Introduce Database failure in prod 6. Introduce a total system failure in prod © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 226 Chaos Engineering Next Steps 1. Look at “holistic” logging (e.g. what keeps full service up?) 2. Identify dependencies 3. Improve by error handling and recovery (manual-to-automated) 4. Learn from “real” failures © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 227 Chaos Engineering Next Steps Chaos engineering also helps us to “minimize the blast radius” Any outage should effect as little of the ecosystem as possible Failure testing shows the current span of this radius © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 228 Chaos Engineering Next Steps What automated recovery looks like: 1. Create immutable infrastructure as code 2. Cover all system state and functional code with automated tests 3. Deploy holistic logging and monitoring systems – make services “observable” 4. Implement smart alerting, triggers and prescriptive analytics 5. Create the self healing infrastructure/applications where most appropriate 6. Test! Then Test again! Do not allow SSH / RDP in production (unless there is a real new problem) © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 229 Automation Can Help Tools like Kubernetes and AWS Auto-Scaling can: 1. Detect impaired instances of servers or containers and destroy/replace 2. Maintain infrastructure at a (code) defined level 3. Automatically scale infrastructure/applications up and down based on demand 4. Execute maintenance commands “on the fly” e.g. to re- index databases when queries are running slow 5. Integrate with monitoring services Look to leverage the capabilities of tools and platforms before building © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 230 ‘Sloth, a Tool for Inducing Network Failures’ with Preetha Appan Indeed.com (04:45) 231 © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure Module Six Quiz 1 Who created the Chaos Monkey service? a) Amazon Prime b) BBC iplayer c) Netflix d) Disney 2 Which metric is most concerned with early failure a) MTTR detection? b) MTTD c) MTTA d) MTTF 3 Which approach can help understand the impact of a a) Squad health check key person dependence? b) Value stream map c) Chaos Monkey d) Fire Drill 4 If service failures are usually covered up by an a) Hedonistic organization then that organization is what? b) Pessimistic c) Neurologic d) Pathologic 5 If a service recovery point objective (RPO) is ten minutes a) At less than ten minute intervals then how frequently must the service be backed up b) Every ten minutes c) At least daily d) Backups do not need to be taken © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 233 Module Six Quiz Answers 1 Who created the Chaos Monkey service? a) Amazon Prime b) BBC iplayer c) Netflix d) Disney 2 Which metric is most concerned with early failure a) MTTR detection? b) MTTD c) MTTA d) MTTF 3 Which approach can help understand the impact of a a) Squad health check key person dependence? b) Value stream map c) Chaos Monkey d) Fire Drill 4 If service failures are usually covered up by an a) Hedonistic organization then that organization is what? b) Pessimistic c) Neurologic d) Pathologic 5 If a service recovery point objective (RPO) is ten minutes a) At less than ten minute intervals then how frequently must the service be backed up b) Every ten minutes c) At least daily d) Backups do not need to be taken © DevOps Institute unless otherwise stated Module 6: Antifragility & Learning from Failure 234 Module 7: Organizational impact of SRE Component Module 7 Content Why organizations embrace SRE Video A history of SRE (Uber) Patterns for SRE adoption Case Story Sage Group & DWP SRE Job Description Discussion Why do you want to Sustainable Incident Response adopt SRE? Who in your Blameless post mortems organization currently SRE & Scale provides SRE? Exercise Your organizational plan for SRE Module 7: Organizational impact of SRE 236 © DevOps Institute unless otherwise stated Increased Service Resilience Courtesy of downdetector.co.uk As service usage grows then the volume of users grows Downtime is more widely published through social media channels Brand reputation can be compromised https://downdetector.co.uk/top10/ © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 238 Minimize Loss of Revenue According to Gartner…. The average cost of service downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end https://www.the20.com/blog/the-cost-of-it-downtime/ © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 239 Because Its Cool How often do you hear this: “I was just at a conference and...” “I read this online and...” “Such and such an organization are doing….” Do not change your current approach without a good reason © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 240 Typical SRE Adoption Steps Consulting Specialist advice provided Embedded by experts Platform No “hands-on” delivery involvement Slice & Dice Ownership of SRE with the Full SRE service/delivery teams – consultants not “on call” These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 244 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps SRE experts embedded in Consulting service/delivery teams Embedded Co-working on SRE activities Platform Some movement towards Slice & Dice ”shared responsibility” Full SRE SRE’s take initial “on call” before responsibilities become shared These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 245 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting SRE “own” the deployment Embedded platform and tooling Platform Guardians of production environments (and maybe Slice & Dice others) – provide “on call” Full SRE One size fits all approach Little or no shared responsibility These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 246 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting SRE own sections of the Embedded service, typically Platform application and Slice & Dice infrastructure Some shared responsibility Full SRE although ”on-call” responsibility is also sliced These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 247 © DevOps Institute unless otherwise stated Typical SRE Adoption Steps Consulting Full organization embraces Embedded SRE Platform Share responsibility and shared “on call” Slice &Dice Reliability a first class citizen Full SRE These steps are not necessarily sequential….. Module 7: Organizational impact of SRE 248 © DevOps Institute unless otherwise stated True North Vision Set realistic expectations of SRE: Impossible to achieve 100% uptime – what appropriate SLO’s would you set? Can you afford complete end to end automation? Is automatic recovery and restore even possible? It may not all be achievable, the aim is to be as close as possible to a “true north” vision © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 249 A history of SRE at Uber, with Rick Boone (06:24) 250 © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Typical Responsibilities Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless post-mortems. Module 7: Organizational impact of SRE 252 © DevOps Institute unless otherwise stated Incident Response: On Call Being on-call is a critical duty that operations and engineering teams must undertake: To support services during working and non-working hours To respond to issues (outages, incidents, etc.) To ensure SLO’s are being met Incident response must be "sustainable" as per the Google Job Description © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 254 On Call By Numbers ( 255)How much down time is allowed for service issues? SLO’s introduce a constraint on the amount of time available A service with a “three-nines” availability SLO requires all issues in a month to be fixed inside 43 minutes This time includes issue identification, alerting, messaging, triage and fix You can see why appropriate SLO’s for services are so important © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 255 On Call By Numbers (1) 256) How often are SRE’s on call? As well as the 50% toil limit Google also advocate a 25% on-call rule To avoid single point of failure at least two SRE’s are on-call Providing 24/7 support will require eight SRE’s SRE’s are on call one week every month Spreading on call across multiple sites can help but beware of under-load © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 256 Checklist For Effective On Call Allocated individuals – giving organizational clarity around who is “on call” Suitable devices – providing a mechanism for receiving all relevant information Alert delivery systems – capturing and delivering the right information and any background context Documented procedures – minimizing the risk of on call response activities Blameless postmortems – Making sure that the same issue does not repeat Responses to on call issues should be rational, focused and deliberate – SRE’s need to feel “safe” © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 257 Using Automation to Replace On-Call Being on call sucks – there is no avoiding that. However, we can look to use automation to replace the on- call burden Self healing services allow us to move away from traditional operator repair service Automation avoids the issue of impaired judgement for called-out operators Tools like Kubernetes, AWS auto-scaling groups provide self-healing capabilities out of the box Self-healing actions can be reviewed in blameless post-mortems Responses to on call issues should be rational, focused and deliberate – SRE’s need to feel “safe” © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 258 “So, failure happens. This is a foregone conclusion when working with complex systems.” John Allspaw, (former) CTO, Etsy 260 © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Reasons for a Blameless Post Mortem User-visible downtime or degradation beyond a certain threshold (e.g. SLO) Data loss of any kind On-call engineer intervention (release rollback, rerouting of traffic, etc.) A resolution time above some threshold A monitoring failure (which usually implies manual incident discovery) Define criteria before an incident occurs so that everyone knows when a postmortem is necessary. 261 © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Blameless Post Mortems Those involved in the failure can give a detailed account of…. What actions they took at what time, What effects they observed, Expectations they had, Assumptions they had made, Their understanding of timeline of events as they occurred. Without fear of punishment or retribution. © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 262 Pathological Bureaucratic Generative (Power-oriented) (Rule-oriented) (Performance-oriented) Information is hidden Information may be ignored Information is actively sought Messengers are ‘shot’ Messengers are isolated Messengers are trained Responsibilities are shirked Responsibility is Responsibilities are shared compartmentalized Bridging is discouraged Bridging is allowed but Bridging is rewarded discouraged Failure is covered up Organization is just and Failure causes enquiry merciful Novelty is crushed Novelty creates problems Novelty is implemented Source: Westrum, A Typology of Organizational Cultures High-trust organizations encourage good information flow, cross-functional collaboration, shared responsibilities, learning from failures and new ideas. Module 7: Organizational impact of SRE 263 © DevOps Institute unless otherwise stated Creating a Safe Environment “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand” Norm Kerth, Project Retrospectives: A Handbook for Team Review © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 264 Creating a Safe Environment Need to balance safety and accountability We give engineers the We encourage people who requisite authority to do make mistakes to be the improve safety by allowing experts on educating the them to give detailed rest of the organization how accounts of their not to make them in the contributions to failures. future. “To be effective we need more people to access production, rather than less”, Damon Edwards, Rundeck © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 265 Changing Failure Reaction “In order to understand how failures happen, we first have to understand our reactions to failure.” John Allspaw, Etsy Assume the single cause is Look at how the incident incompetence and scream actually happened, treat at engineers to make them the engineers involved with “pay attention!” or “be respect, and learn from the more careful!” event. Which side would you prefer? Changing Failure Reaction © DevOps Institute unless otherwise stated 266 Module 7: Organizational impact of SRE Post Mortem Outputs Details of the incident or failure, summary, impact, trigger, detection, resolution, participants A list of follow up actions to mitigate future chances of this incident, or similar ones, happening again Lessons learned from the incident A time-line of what happened Any supporting information Incident management automation can help streamline the creation of these outputs © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 267 CASE STORY: Sage Group “When an incident occurs a post is made in a Slack channel called “We want to “incidents”. This triggers an automated form to capture more details about the incident, all of which is fed into our service management automate the tool, Service Now. A dedicated Slack channel is also created process around the automatically adding interested parties – the channel also posts incident as much as regular reminders to keep our customers/stakeholders updated. For we can so our high impact incidents a Zoom “bridge” is created so participants engineers can focus can collaborate in real time about the incident. Finally the incident fully on fixing the cannot be closed without a post mortem taking place.” incident.” Benefits “Chat Ops” approach ensures everyone is kept up- to-date Less admin time dealing with incidents Module 7: Organizational impact of SRE 268 © DevOps Institute unless otherwise stated Workflow ensures all incidents are followed up and learning shared. Jon Noble SRE, Sage Group Module 7: Organizational impact of SRE 269 © DevOps Institute unless otherwise stated SRE – A Reminder Google didn’t invent “SRE” to write some cool books or to add another acronym to the IT world They did it to solve a problem: how do you handle problems in massively distributed systems operating at mind-blowing scale? 270 © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE Key Success Factors for SRE Adoption Exec support and buy-in Funding Good relationships across the delivery spectrum – engineers, testing, infrastructure, etc. An organization that is growing…. An organization that is not growing (or scaling) will not face the kinds of challenges that SRE is trying solve © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 271 An organization that is growing…. Are you witnessing growth across any of these domains? Platform growth (large volumes of users, irregular data flows, legacy-to-modern architectures ) Scope growth (new products/services) Ticket growth (volume of incidents/outages/requests/toil) “The challenge of scale is always a good one” Stig Sorensen, Bloomberg © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 272 Use Engineering Approaches to Scale organisation” Jennifer Petoff, Google © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 273 SRE Approaches to Platform Growth There are a range of automation capabilities to safely handle platform growth Automation techniques such as auto scaling, containerization, clustering Flexible platforms such as public/private cloud Non-structure databases, NoSQL, MongoDB “As-a-service” capabilities around build, deploy, test, monitoring © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 274 SRE Approaches to Scope Growth SRE ownership of common tools and platforms (“platform SRE”) which other development use SRE expertise “shifting left” into development teams (“embedded SRE”) Automating toil makes more SRE time available for development “Team size should not scale directly with service growth” Betsey Beyer, Google © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 275 SRE Approaches to Ticket Growth Toil reduction approaches such as automated ticket responses and ”self service” features Prioritizing toil reduction activities DRY – don’t repeat yourself – solutions to eliminate repetition of problems © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 276 Module 7: Organizational impact of SRE EXERCISE Your organizational plan for SRE Module 7: Organizational impact of SRE 277 © DevOps Institute unless otherwise stated CASE STORY: Department for Work & Pensions “A major focus of SREs is the aspiration to never see “We’re not the same issue twice, often using automation as a competing with resolution. We spend a large amount of time on reducing human labor, sharing knowledge among teams, and creating a blameless environment” same.” Benefits Appropriate SLO’s by matching business criticality with target availability Reduced number of high impact outages Proven scalability of digital services provided to 20+ million Sean Lukel users across the UK Chief SRE, DWP Module 7: Organizational impact of SRE 278 © DevOps Institute unless otherwise stated Module Seven Quiz 1 When is it best to hold a blameless post mortem? a) After the daily stand-up b) Every Monday morning c) After every incident d) When an incident matches pre-set criteria 2 What can be done to ensure staff are not being “burnt a) Hire more SRE’s out” providing on-call support b) Pay SRE’s more c) Set a 25% on call limit for SRE’s d) Make some SRE’s 100% on call 3 A geographically spread team are involved in fixing a a) DevOps live issue – what approach can help out? b) Align teams in one country with a service c) Fly everyone to the same location d) Chat Ops 4 What outputs should be produced after every post a) Disciplinary procedures against the on-call engineer mortem? b) A new product backlog c) Some follow up actions to mitigate future incidents d) A flexing of the error budget 5 What is a key factor behind SRE success? a) A growing organization b) A shrinking organization c) An organizational restructure d) Embracing agile principles © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 279 Module Seven Quiz 1 When is it best to hold a blameless post mortem? a) After the daily stand-up b) Every Monday morning c) After every incident d) When an incident matches pre-set criteria 2 What can be done to ensure staff are not being “burnt a) Hire more SRE’s out” providing on-call support b) Pay SRE’s more c) Set a 25% on call limit for SRE’s d) Make some SRE’s 100% on call 3 A geographically spread team are involved in fixing a a) DevOps live issue – what approach can help out? b) Align teams in one country with a service c) Fly everyone to the same location d) Chat Ops 4 What outputs should be produced after every post a) Disciplinary procedures against the on-call engineer mortem? b) A new product backlog c) Some follow up actions to mitigate future incidents d) A flexing of the error budget 5 What is a key factor behind SRE success? a) A growing organization b) A shrinking organization c) An organizational restructure d) Embracing agile principles © DevOps Institute unless otherwise stated Module 7: Organizational impact of SRE 280 Module 8: SRE, Other Frameworks, Trends SRE & Other Frameworks Component Module 8 Content SRE Evolution Video A Look at ITIL4 & SRE (DevOps Institute) Case Story Victor Ops Discussion How does agile inform what we do in SRE? Where do you see the future of SRE heading? Exercise Sketch board your understanding of SRE Module 8: SRE, Other Frameworks, Trends 282 © DevOps Institute unless otherwise stated SRE & Other Frameworks Sometimes the plethora of competing frameworks can seem like a maze Agile Build pipelines Continuous delivery DevOps Etc…. and of course ITSM If you have services running in production, then you may benefit from SRE adoption Module 8: SRE, Other Frameworks, Trends 284 © DevOps Institute unless otherwise stated SRE & Other Frameworks Organizations are looking at their whole value stream and optimizing their approach Framework opportunities: Agile unifies the business with delivery DevOps advocates mechanisms like Continuous Delivery & Continuous Deployment for driving velocity and flow SRE provides business wide focus on stability & reliability ITSM builds organizational learning across the value stream Module 8: SRE, Other Frameworks, Trends 285 © DevOps Institute unless otherwise stated SRE Does Not Stand Alone Agile DevOps & Lean Module 8: SRE, Other Frameworks, Trends 286 © DevOps Institute unless otherwise stated SRE Does Not Stand Alone SRE teams can operate in an agile way – using frameworks like Scrum and Kanban Backlogs of toil make work visible – and automation can be prioritized Ceremonies ensure co-ordination, visibility and prioritization Definition of done more production focused Value delivered through “working software” (agile principle 1) Module 8: SRE, Other Frameworks, Trends 288 © DevOps Institute unless otherwise stated SRE Does Not Stand Alone SRE and DevOps/Lean are complimentary approaches Organizational silos are further broken down Pipelines of delivery go further DevOps metrics and measures are further improved Automation more widespread and consistent Module 8: SRE, Other Frameworks, Trends 289 © DevOps Institute unless otherwise stated CASE STORY: VictorOps feature” Benefits Dan Jones CTO Module 8: SRE, Other Frameworks, Trends 290 © DevOps Institute unless otherwise stated SRE Does Not Stand Alone SRE can help with ITSM compliance activities through automation & engineering Like SRE, ITIL processes are underpinned by automation particularly during transition and operation processes as part of continuous testing and delivery In SRE failure is a learning opportunity, continuous learning is embedded in ITSM ITIL Provides guidance and structure to processes such as Change, Configuration, Release, Incident and Problem Management –areas that SRE are involved in AXELOS®, ITIL® and IT Infrastructure Library® are registered trade marks of AXELOS Limited. Module 8: SRE, Other Frameworks, Trends 291 © DevOps Institute unless otherwise stated ITSM Process Models Support SRE Predefined procedures Examples Steps to be taken Chronological order and dependencies ▪ Change models Responsibilities ▪ Release models Timescales and thresholds Escalation procedures ▪ Test models Define steps for handling specific ▪ Incident models types of transaction ▪ Problem models Ensure a defined path or timeline is ▪ Request models followed Can be automated © DevOps Institute unless otherwise stated Module 8: SRE, Other Frameworks, Trends 292 ITSM Process Models Support SRE Predefined procedures Examples Steps to be taken Chronological order and dependencies ▪ Change models ReSsRpEohnesilb pislitcieosmpliance with ITSM by using engineer▪ ingRelease models Timeaspcparlo easc ahneds tthoreresh mooldvse the human factor from the Escaelaqtuio an tiopnro.Ace utdoum reastion leaves a documented and▪ Test models Deafiundeitasbteleptsrafiol, rthhisannod tolin nlg recaisfiecs confidence tha▪t ITISnM y isnpce cident models typruelsesoaf rterna’n t bsaeicng tioignnored, but it also increases the de▪plPoryoblem models Ensuraendaredle eafisneevdelp oc aittyhboyrre timmoevlinngethise time needed▪ toRequest models followed make human decisions. Can be automated AXELOS®, ITIL® and IT Infrastructure Library® are registered trade marks of AXELOS Limited. Based on ITIL Text - ST 4.2.4.5 © DevOps Institute unless otherwise stated Module 8: SRE, Other Frameworks, Trends 293 SRE Does Not Stand Alone SRE emphasizes the development of systems and software that increase the reliability and performance of applications and services. DevOps integrates various teams and processes across the development and delivery of software ITIL 4 emphasizes service quality and consistency and aims for improved stakeholders’ satisfaction through ensuring value from the perspective of the stakeholders. Stop the Arguments: ITIL 4 and SRE and DevOps All Are Transformation Aids Module 8: SRE, Other Frameworks, Trends 294 © DevOps Institute unless otherwise stated Value Stream IDEA PLAN DESIGN BUILD DEPLOY TEST RELEASE OPERATE User Centered Design Agile Development Continuous Integration Continuous Delivery DevOps ITSM Site Reliability Engineering Wisdom of production feedback Module 8: SRE, Other Frameworks, Trends 295 © DevOps Institute unless otherwise stated A Look at ITIL4 & SRE, with Jayne Groll (11:25) 296 © DevOps Institute unless otherwise stated Module 8: SRE, Other Frameworks, Trends Trends in Site Reliability Engineering “The five new trends that I see emerging are: failure as the new normal, automation as a service, cloud is king, observe & learn and the evolution of the network engineer (NRE).” Michael Kehoe, LinkedIn © DevOps Institute unless otherwise stated Module 8: SRE, Other Frameworks, Trends 299 A Network Reliability Engineer (NRE) Applies an engineering approach to measure and automate the reliability of networks Codifies software-defined networks the network (SDN’s) and applies SDLC principles to b

SRE-Led Service Automation PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue