Untitled Quiz
47 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does MTTR stand for in the context of performance metrics?

  • Mean Time to Detect
  • Mean Time to React
  • Mean Time to Repair/Recover (correct)
  • Mean Time to Restore Service
  • Which metric indicates the maximum acceptable amount of data loss in a service?

  • Recovery Point Objective (correct)
  • Service Level Objective
  • Mean Time to Recover
  • Mean Time to Detect
  • What is a primary goal of introducing 'Chaos engineering' in a system?

  • To test how systems handle unexpected failures (correct)
  • To eliminate the need for backups
  • To ensure a 100% uptime
  • To prevent any service disruptions
  • What concept states that organizations must learn from failures to remain competitive?

    <p>Antifragility</p> Signup and view all the answers

    Which metric is used to measure how quickly an organization can detect failures or incidents?

    <p>Mean Time to Detect</p> Signup and view all the answers

    What is the primary focus of the 'The Third Way' in a learning organization?

    <p>Continual experimentation and learning</p> Signup and view all the answers

    Which statement best describes the concept of Service Level Objective (SLO)?

    <p>It specifies the expected performance level of a service.</p> Signup and view all the answers

    What is the disadvantage of excessive data loss in a messaging queue?

    <p>It indicates exceeding the Recovery Point Objective.</p> Signup and view all the answers

    Who developed the Chaos Monkey service?

    <p>Netflix</p> Signup and view all the answers

    Which metric primarily focuses on the detection of early failures?

    <p>MTTD</p> Signup and view all the answers

    What approach can help analyze the implications of relying on a key person?

    <p>Value stream map</p> Signup and view all the answers

    What does it indicate if an organization consistently covers up service failures?

    <p>Fragile</p> Signup and view all the answers

    What is the main purpose of integrating with monitoring services?

    <p>To leverage existing tools and platforms</p> Signup and view all the answers

    What is the primary purpose of tracing in application performance management?

    <p>To track the performance and health of an application.</p> Signup and view all the answers

    Which tool type is specifically designed to report and support responsive actions against attacks?

    <p>Threat Detection systems</p> Signup and view all the answers

    What does synthetic monitoring simulate to evaluate service behavior?

    <p>Customer or end-user interactions.</p> Signup and view all the answers

    What component is essential for capturing details of service incidents in incident management?

    <p>Who, what, when of service incidents.</p> Signup and view all the answers

    What is the function of a Web Application Firewall (WAF)?

    <p>To examine traffic and block malicious content.</p> Signup and view all the answers

    What technique does User and Entity Behavior Analytics (UEBA) employ?

    <p>Machine learning to analyze user behavior.</p> Signup and view all the answers

    What is the primary function of error tracking tools?

    <p>To discover and show application errors.</p> Signup and view all the answers

    What do status pages provide to users?

    <p>Real-time status communication of services.</p> Signup and view all the answers

    What is emphasized as crucial for creating a safe environment in an organization?

    <p>Balancing safety and accountability</p> Signup and view all the answers

    According to the content, how should engineers be treated during a failure analysis?

    <p>They should be given respect and learned from</p> Signup and view all the answers

    What is the main purpose of allowing engineers to contribute to failure discussions?

    <p>To improve safety and educate others</p> Signup and view all the answers

    What does the quote from John Allspaw imply about understanding failures?

    <p>To understand failures, one must analyze reactions to them</p> Signup and view all the answers

    What is the consequence of reacting negatively to failure, as suggested in the content?

    <p>It creates a fear-based environment</p> Signup and view all the answers

    What is conveyed by the statement regarding access to production?

    <p>More access to production leads to better safety</p> Signup and view all the answers

    What does the phrase 'to be effective we need more people to access production' suggest?

    <p>Fostering engagement improves safety</p> Signup and view all the answers

    What is a key belief expressed in the content regarding mistakes made by individuals?

    <p>Everyone tries their best given the circumstances</p> Signup and view all the answers

    What is the primary focus when creating a blameless environment?

    <p>Encouraging open discussion without fear of punishment</p> Signup and view all the answers

    What outcome is associated with appropriate Service Level Objectives (SLOs)?

    <p>Greater alignment between service availability and business needs</p> Signup and view all the answers

    Which practice can help prevent SREs from experiencing burnout while on call?

    <p>Setting realistic on-call limits, such as 25%</p> Signup and view all the answers

    What is the best approach to facilitate a geographically spread team in resolving a live issue?

    <p>Using remote collaborative tools like Chat Ops</p> Signup and view all the answers

    What should be avoided during the implementation of post mortem meetings?

    <p>Assigning disciplinary actions based on incident outcomes</p> Signup and view all the answers

    How does the scalability of digital services benefit large user bases?

    <p>It ensures services can handle increases in user demand</p> Signup and view all the answers

    What is a primary goal of sharing knowledge among teams?

    <p>To enhance collaboration and improve problem-solving capabilities</p> Signup and view all the answers

    What should be done when an incident matches pre-set criteria for a post mortem?

    <p>Conduct a thorough analysis to prevent future occurrences</p> Signup and view all the answers

    What is a key focus of Site Reliability Engineering (SRE)?

    <p>Enhancing the reliability and performance of applications</p> Signup and view all the answers

    How does ITIL 4 aim to ensure value for stakeholders?

    <p>Through improved service quality and consistency</p> Signup and view all the answers

    Which of the following trends in Site Reliability Engineering highlights the importance of adapting to failures?

    <p>Failure as the new normal</p> Signup and view all the answers

    What role does a Network Reliability Engineer (NRE) perform?

    <p>Measures and automates network reliability</p> Signup and view all the answers

    What is the primary purpose of DevOps in relation to software delivery?

    <p>To integrate various teams and processes across development and delivery</p> Signup and view all the answers

    What methodology emphasizes user-centered design within the software development process?

    <p>Agile Development</p> Signup and view all the answers

    Which of the following best describes the goal of Continuous Delivery?

    <p>To ensure software can be released any time reliably</p> Signup and view all the answers

    What does automation as a service entail according to emerging trends in SRE?

    <p>Providing automation tools on-demand</p> Signup and view all the answers

    How does SRE relate to ITIL and DevOps?

    <p>SRE functions as an aid for transformations like ITIL and DevOps</p> Signup and view all the answers

    What indicates the evolution of the network engineer as a trend in SRE?

    <p>A shift away from traditional networking roles</p> Signup and view all the answers

    Study Notes

    SRE-Led Service Automation

    • Environments are provisioned using Infrastructure/Config as Code.
    • Automated functional and non-functional tests are conducted in production.
    • Versioned and signed artifacts are used to deploy system components.
    • Instrumentation is in place to view the service externally.
    • Future growth is anticipated.
    • Clear anti-fragility is evident.
    • Tools used include Fire drills, Chaos Monkey, PagerDuty, VictorOps, and Squadcast.

    SRE-Led Service Automation (Flowchart)

    • The flow starts with Commit ID: 113, followed by Build, Run Unit Tests, Code Analysis, Create Test Env, Deploy Code.
    • Then it moves to Load Test Data, Run Tests, Create Pre-Prod, Deploy Code, Run Perf Test
    • The next steps are Run Security Test, Check Monitors, Create Prod, Prod Deploy, Run Tests, Run NFT's, Check Monitors, to Failure Tests.
    • The overall emphasis is on production. DevOps gains “wisdom of production”. SRE can say “No”.

    SRE Automation is Not Just Service Automation

    • The team supporting the platform was inundated with "toil," reaching a point where they could do little else.
    • 30% of respondents said maintenance tasks are their main source of toil.
    • Automation has been used to reduce toil by 8.5%.

    Hierarchy of Automation Types

    • The database notices problems and automatically fails over without human intervention.
    • The database comes with its own failover script for handling issues.
    • SREs add database support to a "generic failover" script used by everyone for problems.
    • An SRE has a failover script in their home directory used in case of a problem.
    • The database master is manually failed over if there's a problem.

    Secure Automation

    • Secure automation in pipelines helps prevent insecure manual steps.
    • Generated artifacts are checked and validated for compliance to ensure security.
    • DevSecOps is introduced into the build, test, and deploy cycle to promote security.
    • SRE emphasizes extra security measures in production.

    Secure Build

    • Application, infrastructure, and configuration code changes are run through code analysis tools to check for security problems.
    • Digitally signing build artifacts prevents "fake" code.
    • Secure coding practices are widely published and embraced.
    • Secure code repositories with access control are used in development.
    • Open-source coding with community feedback is implemented.

    Secure Test

    • Test environments are immutable to prevent mutation during configuration changes and ensuring compliance with the code repository.
    • Properly built testing data along with testing scenarios helps test security.

    Secure Staging

    • Staging environments remain immutable to maintain consistency.
    • The same artifacts are deployed to staging for a pre-production environment.
    • Testing data in security contexts addresses security considerations, such as GDPR and PCI.
    • Security scanning is dedicated for finding security vulnerabilities.
    • Dependencies and integrations with other services are checked for vulnerabilities.

    Secure Production

    • Immutable production environments are maintained for consistency.
    • Production data must meet security standards (e.g., GDPR, PCI, and SOX).
    • Security scanning is used for recognizing vulnerabilities.
    • Regulatory compliance is critically important.
    • Failure testing can be helpful for demonstrating audit procedures related to compliance.

    Automation Tools

    • Discussions on tools often focus on favorite technological solutions.
    • Tools are often constantly evolving.
    • Organizations often display bias towards certain tools, from open-source to big IT solutions.
    • Automating jobs is usually best delegated to individuals with deep expertise in their respective tasks.
    • Engineers are more productive using the tools they are familiar with.

    Case Story: Standard Chartered

    • Fundamental SRE principles are implemented throughout the organization.
    • These principles are: everything as code, everything via APIs, one pipeline, self-test and self-heal.
    • 28 person-years of effort have been saved through these principles.
    • Over 13,000 manual reviews have been avoided.
    • Time to repair has decreased by 25 minutes on average.
    • The organization avoided 200 self-inflicted operations incidents.

    How Much Automation Do You Have?

    • A variety of tools are used for managing the software development lifecycle, including planning, creating, verifying, packaging, securing, and releasing.
    • Development tools cover areas like Audit Management (Audit Management, Authentication and Authorization, DevOps Score, Value Stream Management), Plan (Issue Tracking, Kanban Boards, Time Tracking and Agile Portfolio Management), Create (Source Code Management, Code Review and Wiki and Web IDE), Verify (Continuous Integration, Code Quality and Performance and Usability Testing), Package (Package Registry, Helm Charts and Dependency Firewall), Secure (SAST, DAST and IAST), Release (Continuous Delivery and release orchestration and Review Apps and Incremental Rollout).

    Manage (Audit Management)

    • Automated tools are used to ensure products and services are auditable.
    • Audit logging of build, test, deploy activities, configurations and users is stored alongside production operation logs.
    • Secure processes for authentication and authorization including User and password management and two-factor authentication are implemented.
    • Tools like AWS IAM (Identity and Access Management) are used along with cloud user tools.

    Manage (DevOps Score)

    • This is a metric that shows DevOps adoption throughout the organization and its impact on delivery velocity.
    • Value stream management shows the flow of value delivery within the DevOps lifecycle.
    • Tools like Gitlab CI and Jenkins extension and other DevOps tools like DevOptics provide the visualization.

    Plan (1)

    • Tools like Jira, Trello, CA's Agile Central and VersionOne are used to capture issues or backlogs of work.
    • Kanban boards map out workflows for managing delivery flows, which support issue tracking.
    • Tools like Jira or Trello show time spent and effort related to issues and tasks.
    • Agile Portfolio Management evaluates in-flight projects and future initiatives.

    Plan (2)

    • ServiceNow acts as a platform for managing the lifecycle of services including interactions with stakeholders, both inside and outside the organization.
    • Requirements Management tools are used to define, track, and manage requirements, ensuring traceability and handling dependencies.
    • Quality Management tools handle planning, execution, and tracking of testing to assist in identifying and addressing issues.

    Create (Source Code Management)

    • Tools are used to securely maintain source code in a multi-user environment, such as Git and SVN.
    • Code reviews can verify quality using tools such as Gerrit, TFS, Crucible, and GitLab.
    • Confluence is used for content-rich Wiki style knowledge sharing.

    Create (Web IDE)

    • Web IDE tools use web-based clients integrated with development environments that boost developer efficiency and productivity without a need for local tools.
    • Snippets for code are stored and shared to support collaboration around specific code pieces.

    Verify (1)

    • Continuous Integration (CI) refers to integrating, building, and testing code in development environments.
    • Sonar and Checkmarks perform code analysis, reviewing comments, architecture, duplication, unit test coverage, complexity, potential defects, and language rules.

    Verify (2)

    • Performance testing helps measure the speed, responsiveness and stability of applications under load. Tools like Gatling can be used.
    • Usability testing assesses how easily users interact with a service. Tools like Crazy Egg, and Optimizely offer ways to support user flows.

    Package (1)

    • Software packages, artifacts, and metadata are managed in a repository called Package Registry.
    • Popular examples include Artifactory and Nexus.
    • Container Registry is a secure storage for Container images, supporting upload and download. Docker Hub, Artifactory, and Nexus are examples.
    • Dependency Proxy allows for proxying calls to other sources for frequently used images and packages.

    Package (2)

    • Helm Charts are used to represent or describe Kubernetes resources.
    • Dependency Firewall scans dependencies to prevent security threats.

    Secure (1)

    • Static Application Security Testing (SAST) examines application code to find problems.
    • Dynamic Application Security Testing (DAST) scans applications from the outside to look for vulnerabilities.
    • Interactive Application Security Testing (IAST) combines SAST and DAST approaches to discover vulnerabilities in real time.
    • Secret detection tools prevent sensitive information like passwords and tokens from getting unintentionally released.

    Secure (2)

    • Dependency Scanning is used to find vulnerabilities like vulnerabilities in dependencies while developing applications. Tools such as Synopsys, Gemnasium, Retire.js and bundler-audit are often used.
    • Tools like Blackduck, Synopsys, Synk, Claire, and Klar are used for container scanning to look for known vulnerabilities and problems in containers prior to being deployed.
    • License Compliance tools like Blackduck and Synopsis ensure your dependency licenses are appropriate for the application.

    Secure (3)

    • Vulnerability Database tools gather and maintain records of vulnerabilities for checking throughout the deployment pipeline.
    • Fuzzing techniques automatically check software for weakness by inputting unexpected data and observing crashes.

    Release (4)

    • Continuous Delivery practice promotes releasing software to production continuously.
    • Release Orchestration tools like Jenkins and GitLab CI can be used for detecting, orchestrating, and testing changes to ensure there are no negative impacts on applications.
    • Pages can be created in automatic fashion, as part of a CI/CD pipeline.
    • Review process allows code committing in real time that enables developers to test updates, using environments spun up specifically for this task.
    • Incremental Rollouts use to gradually deploy changes to services instead of larger changes.

    Release (5)

    • Canary deployment releases the program to a small subset of users to test updates before a full-blown release.
    • Feature flags modify system behavior without code changes. Tools like Launch Darkly support these flags.
    • Release Governance processes create reliable control and automated processes (for security, compliance, etc) to meet organizational needs for understanding changes.
    • Secrets Management tools and methods securely manage sensitive credentials like passwords, keys, APIs, and tokens.

    Configure (1)

    • Auto DevOps integrates DevOps practices. Automatically configure software development lifecycles, detect, build, test, and deploy, and monitor applications.
    • ChatOps provides ways for DevOps actions like build and deployments, to occur using chat, in real-time, to directly chat based on specific actions.
    • Runbooks help create procedures for efficient service operations.
    • Serverless paradigms for code execution remove dependencies, as a piece of code may be executed by cloud service providers.

    Monitor (3)

    • Metrics tools gather and display performance data for applications.
    • Logging captures and stores system activity. Logs contain information about process calls, events, user data, responses, and errors.
    • Tracing tools provide detailed analysis of application performance, identifying and investigating issues within systems.
    • Cluster monitoring provides updates regarding health of deployment environments and clusters.

    Monitor (4)

    • Error tracking tools discover and show errors.
    • Incident management handles events and captures issues (who what when).
    • Synthetic monitoring simulates user actions to monitor service behavior.
    • Status pages provide users with information on service availability and status.

    Defend (1)

    • Runtime Application Self Protection (RASP) tools monitor for and block security threats immediately as they occur.
    • Web Application Firewall (WAF) examines traffic to block malicious traffic.
    • Threat detection systems detect, report, and provide response capabilities for threats like DOS (denial-of-service) attacks.
    • User and Entity Behavior Analytics (UEBA) are machine learning techniques that look for abnormal user behavior to alert security teams to potential issues.

    Defend (2)

    • Vulnerability Management helps identify, record, manage, and mitigate vulnerabilities in assets and applications automatically.
    • Data Loss Prevention (DLP) tools prevent data from being shared beyond authorized channels within the environment.
    • Data is protected from unauthorized access from outside organizations by securing storage systems and data storage ecosystems.
    • Container network security protects connections between containers to stop unintentional data movement or traffic.

    Case Story: Netflix

    • Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of their infrastructure.
    • Results of these tests can minimize significant negative impact on the consumer experience and staff response.

    Chaos Engineering Next Steps

    • Segregating the system to test individual components without key components is one of the first steps.
    • Testing without key components can happen in non-production environments first to minimize impact to real customers.
    • Introduce failures at a component level in production as a test and learn strategy.
    • Simulating database failure in production to discover how databases handle failures.
    • Simulate total system failure.

    Chaos Engineering Next Steps (2)

    • Examine holistic logging to find the root cause of outages.
    • Identifying and improving dependencies in services.
    • Automate the process of error handling.
    • Learning from "real" failures is an essential part of the improvement process.

    Chaos Engineering Next Steps (3)

    • Chaos engineering helps contain the effects of issues to smaller portions of services and deployments.

    Chaos Engineering Next Steps (4)

    • Automating recovery processes
    • Logging
    • Monitoring
    • Alerting systems
    • Security

    Typical Responsibilities

    • Engineering across the full lifecycle of a service or product.
    • Providing consultation regarding service design, development, and platform creation.
    • Maintaining services through ongoing monitoring.
    • Scaling systems through automation.
    • Providing incident response and performing blameless post-mortems.

    Incident Response: On Call

    • Providing support and handling issues for services during working and non-working hours is an important aspect of on-call support.
    • Support involves handling issues such as outages, incidents, and similar issues, and requires availability to address important problems.

    On Call by Numbers (1)

    • Google advocates for maintaining a 25% percentage on-call responsibility to prevent single points of failure.
    • Providing 24/7 support requires several SRE to cover various time zones and ensure service availability.
    • Spreading responsibility for on-call across several sites or locations can be useful against issues from under-utilization, or over-utilization of on-call staff.

    Checklist for Effective On Call

    • Clearly assigning responsibilities related to "on-call", or "on-call time".
    • Appropriate monitoring of devices/tools to allow teams to identify, isolate, and fix events or issues.
    • Ensuring well-documented procedures for handling incidents, and providing context to staff to enable efficient resolution without repeat issues.
    • Creating a safe environment where employees feel comfortable documenting issues without fear of being blamed.

    Using Automation to Replace On-Call

    • Automated self-healing processes replace the need for manual intervention by operators when problems occur, reducing downtime.
    • Automation tools like Kubernetes and AWS auto-scaling groups automate the process and increase service reliability and decrease the need for on-call support.
    • Processes for self-healing capabilities can be reviewed to improve incident response and safety.

    Reasons for a Blameless Post Mortem

    • User visible downtime or service degradation beyond a specified service level objective (SLO)
    • Data Loss
    • On-call engineer intervention for resolving the issue.
    • Resolution times that exceed a set threshold.
    • Monitoring failures (requiring manual intervention).

    Blameless Post Mortems

    • Detailed accounts of actions taken during incidents should be captured for improving the process.
    • Identifying steps contributing to the issue and causes should be captured.
    • Expectations and assumptions should also be documented to help team members better anticipate and avoid future problems.

    Culture and the Flow of Information

    • Information flows vary in pathological, bureaucratic, and generative organizations.
    • Pathological cultures hide information and discourage communication; bureaucratic cultures may ignore or compartmentalize information; generative cultures actively seek information and reward communication.

    Creating a Safe Environment

    • Engineers should have the authority to explain their work to address contributing factors to issues and failures.
    • Individuals accountable for mistakes are also good at teaching the rest of the team members what to do in similar situations.

    Changing Failure Reactions

    • Organizational culture influences how incidents are handled and addressed.
    • An organizational culture where people feel safe to discuss their mistakes and learn from them is useful for fostering a positive and effective working environment.

    Post Mortem Outputs

    • Post mortems should outline details of incidents or failures, summary, impact, triggers, detection, resolution, and participants.
    • Lists of follow-up actions are helpful in avoiding future similar incidents.
    • Lessons learned are important to prevent future issues.
    • Timelines of events from the incident are valuable in helping the team understand what happened.
    • Supporting data or documentation should always be included when recording lessons learned and similar activities.

    Case Story: Sage Group

    • They use an automated incident reporting process to improve handling of the incidents.
    • Information and relevant information about the issue are provided in a shared channel for timely communication and coordination.

    SRE - A Reminder

    • SRE is a system of activities that handle the proper implementation of systems and applications.
    • The need for an organization that addresses the operation of large-scale services is covered by the principles of SRE.
    • Understanding the implications of issues and making sure there is an adequate response procedure is a core methodology of SRE.

    Key Success Factors for SRE Adoption

    • Executive support is critical for a company to implement SRE initiatives.
    • Strong relationships between engineers who work within the same organization or on similar types of tasks.
    • An organization that is constantly growing can benefit greatly from the implementation and use of SRE.

    An organization that is growing...

    • Platform growth (large volumes of users, irregular data flows)
    • Scope growth (introduction of new products or services)
    • Ticket growth (higher volume of incidents, requests, and toil)

    Use Engineering Approaches to Scale

    • By incorporating engineering approaches, the rate of service operations can be increased and made more scalable.

    SRE Approaches to Platform Growth

    • Techniques for handling and managing growing platforms
    • Auto-scaling
    • Containerization
    • Clustering
    • Public/private cloud
    • NoSQL, MongoDB, and similar "as-a-service" solutions
    • Tools, Automation, and similar techniques will help handle increased systems needs.

    SRE Approaches to Scope Growth

    • Shared responsibility for tools and platforms.
    • Expanding reach of SRE expertise into development teams.
    • Efficient toil automation can allow SRE teams to handle growth demands.

    SRE Approaches to Ticket Growth

    • Toil reduction is encouraged using automated responses for tickets and self-service features.
    • Prioritization of toil reduction efforts.
    • DRY ("Don't Repeat Yourself") - avoid the repetition of effort/problem resolution.

    Module 7: Exercise

    • Organizations should develop a plan for organizational implementation to adopt SRE.

    Case Story: Department for Work & Pensions

    • The organization is focused on implementing processes that help eliminate the recurrence of issues and failures and automate resolutions.
    • An SRE initiative can help improve the stability, reliability, and performance of services across an organization.

    Module Seven Quiz

    • Questions cover the benefits, processes, and responsibilities of SREs and organizations.

    Module Eight Quiz

    • Questions cover SRE trends, frameworks, and other related topics.

    Summary

    • Site Reliability Engineering (SRE) is an organization of activities to develop and operate systems that handle issues, and problems regarding the quality and quantity of service that an organization can provide to users.
    • SRE principles such as automation, monitoring, and engineering help provide reliable operations for applications and services.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    SRE-Led Service Automation PDF

    More Like This

    Untitled Quiz
    6 questions

    Untitled Quiz

    AdoredHealing avatar
    AdoredHealing
    Untitled Quiz
    55 questions

    Untitled Quiz

    StatuesquePrimrose avatar
    StatuesquePrimrose
    Untitled Quiz
    18 questions

    Untitled Quiz

    RighteousIguana avatar
    RighteousIguana
    Untitled Quiz
    50 questions

    Untitled Quiz

    JoyousSulfur avatar
    JoyousSulfur
    Use Quizgecko on...
    Browser
    Browser