Untitled Quiz
47 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does MTTR stand for in the context of performance metrics?

  • Mean Time to Detect
  • Mean Time to React
  • Mean Time to Repair/Recover (correct)
  • Mean Time to Restore Service

Which metric indicates the maximum acceptable amount of data loss in a service?

  • Recovery Point Objective (correct)
  • Service Level Objective
  • Mean Time to Recover
  • Mean Time to Detect

What is a primary goal of introducing 'Chaos engineering' in a system?

  • To test how systems handle unexpected failures (correct)
  • To eliminate the need for backups
  • To ensure a 100% uptime
  • To prevent any service disruptions

What concept states that organizations must learn from failures to remain competitive?

<p>Antifragility (D)</p> Signup and view all the answers

Which metric is used to measure how quickly an organization can detect failures or incidents?

<p>Mean Time to Detect (C)</p> Signup and view all the answers

What is the primary focus of the 'The Third Way' in a learning organization?

<p>Continual experimentation and learning (B)</p> Signup and view all the answers

Which statement best describes the concept of Service Level Objective (SLO)?

<p>It specifies the expected performance level of a service. (C)</p> Signup and view all the answers

What is the disadvantage of excessive data loss in a messaging queue?

<p>It indicates exceeding the Recovery Point Objective. (C)</p> Signup and view all the answers

Who developed the Chaos Monkey service?

<p>Netflix (A)</p> Signup and view all the answers

Which metric primarily focuses on the detection of early failures?

<p>MTTD (A)</p> Signup and view all the answers

What approach can help analyze the implications of relying on a key person?

<p>Value stream map (A)</p> Signup and view all the answers

What does it indicate if an organization consistently covers up service failures?

<p>Fragile (A)</p> Signup and view all the answers

What is the main purpose of integrating with monitoring services?

<p>To leverage existing tools and platforms (B)</p> Signup and view all the answers

What is the primary purpose of tracing in application performance management?

<p>To track the performance and health of an application. (D)</p> Signup and view all the answers

Which tool type is specifically designed to report and support responsive actions against attacks?

<p>Threat Detection systems (D)</p> Signup and view all the answers

What does synthetic monitoring simulate to evaluate service behavior?

<p>Customer or end-user interactions. (D)</p> Signup and view all the answers

What component is essential for capturing details of service incidents in incident management?

<p>Who, what, when of service incidents. (D)</p> Signup and view all the answers

What is the function of a Web Application Firewall (WAF)?

<p>To examine traffic and block malicious content. (D)</p> Signup and view all the answers

What technique does User and Entity Behavior Analytics (UEBA) employ?

<p>Machine learning to analyze user behavior. (B)</p> Signup and view all the answers

What is the primary function of error tracking tools?

<p>To discover and show application errors. (B)</p> Signup and view all the answers

What do status pages provide to users?

<p>Real-time status communication of services. (C)</p> Signup and view all the answers

What is emphasized as crucial for creating a safe environment in an organization?

<p>Balancing safety and accountability (A)</p> Signup and view all the answers

According to the content, how should engineers be treated during a failure analysis?

<p>They should be given respect and learned from (A)</p> Signup and view all the answers

What is the main purpose of allowing engineers to contribute to failure discussions?

<p>To improve safety and educate others (B)</p> Signup and view all the answers

What does the quote from John Allspaw imply about understanding failures?

<p>To understand failures, one must analyze reactions to them (C)</p> Signup and view all the answers

What is the consequence of reacting negatively to failure, as suggested in the content?

<p>It creates a fear-based environment (A)</p> Signup and view all the answers

What is conveyed by the statement regarding access to production?

<p>More access to production leads to better safety (D)</p> Signup and view all the answers

What does the phrase 'to be effective we need more people to access production' suggest?

<p>Fostering engagement improves safety (D)</p> Signup and view all the answers

What is a key belief expressed in the content regarding mistakes made by individuals?

<p>Everyone tries their best given the circumstances (C)</p> Signup and view all the answers

What is the primary focus when creating a blameless environment?

<p>Encouraging open discussion without fear of punishment (B)</p> Signup and view all the answers

What outcome is associated with appropriate Service Level Objectives (SLOs)?

<p>Greater alignment between service availability and business needs (A)</p> Signup and view all the answers

Which practice can help prevent SREs from experiencing burnout while on call?

<p>Setting realistic on-call limits, such as 25% (A)</p> Signup and view all the answers

What is the best approach to facilitate a geographically spread team in resolving a live issue?

<p>Using remote collaborative tools like Chat Ops (B)</p> Signup and view all the answers

What should be avoided during the implementation of post mortem meetings?

<p>Assigning disciplinary actions based on incident outcomes (C)</p> Signup and view all the answers

How does the scalability of digital services benefit large user bases?

<p>It ensures services can handle increases in user demand (A)</p> Signup and view all the answers

What is a primary goal of sharing knowledge among teams?

<p>To enhance collaboration and improve problem-solving capabilities (D)</p> Signup and view all the answers

What should be done when an incident matches pre-set criteria for a post mortem?

<p>Conduct a thorough analysis to prevent future occurrences (B)</p> Signup and view all the answers

What is a key focus of Site Reliability Engineering (SRE)?

<p>Enhancing the reliability and performance of applications (C)</p> Signup and view all the answers

How does ITIL 4 aim to ensure value for stakeholders?

<p>Through improved service quality and consistency (D)</p> Signup and view all the answers

Which of the following trends in Site Reliability Engineering highlights the importance of adapting to failures?

<p>Failure as the new normal (D)</p> Signup and view all the answers

What role does a Network Reliability Engineer (NRE) perform?

<p>Measures and automates network reliability (C)</p> Signup and view all the answers

What is the primary purpose of DevOps in relation to software delivery?

<p>To integrate various teams and processes across development and delivery (A)</p> Signup and view all the answers

What methodology emphasizes user-centered design within the software development process?

<p>Agile Development (A)</p> Signup and view all the answers

Which of the following best describes the goal of Continuous Delivery?

<p>To ensure software can be released any time reliably (A)</p> Signup and view all the answers

What does automation as a service entail according to emerging trends in SRE?

<p>Providing automation tools on-demand (C)</p> Signup and view all the answers

How does SRE relate to ITIL and DevOps?

<p>SRE functions as an aid for transformations like ITIL and DevOps (A)</p> Signup and view all the answers

What indicates the evolution of the network engineer as a trend in SRE?

<p>A shift away from traditional networking roles (B)</p> Signup and view all the answers

Flashcards

Safe Environment in SRE

A culture where engineers feel comfortable admitting mistakes and learning from failures.

Engineer Responsibility in SRE

Engineers take ownership for their contributions to failures and contribute knowledge how to avoid them in the future.

Accountability & Safety

The delicate balance between holding individuals accountable for their actions and creating a safe space for learning from mistakes.

Production Access

More people having access to production systems is important for improving safety and efficiency to fix problems quickly.

Signup and view all the flashcards

Analyzing Failures

Understanding how failures happen and looking at the process, not assuming incompetence.

Signup and view all the flashcards

Incompetence Assumption

The wrong approach to handling failures by assuming the cause is lack of skill, which results in blaming instead of learning.

Signup and view all the flashcards

Incident Investigation

A process of examining failures to understand how they occurred and what can be learned to prevent them in the future.

Signup and view all the flashcards

Respectful Handling in Failure

Treat engineers with respect during incident analysis. Learning from events involves constructive feedback.

Signup and view all the flashcards

Tracing

Provides insights into application performance and health by tracking requests through functions/microservices.

Signup and view all the flashcards

Cluster Monitoring

Tools for assessing the health of deployments in clustered environments like Kubernetes.

Signup and view all the flashcards

Error Tracking

Tools for identifying and displaying application errors with associated data.

Signup and view all the flashcards

Incident Management

Capturing incident details (who, what, when) to ensure service level objectives are met.

Signup and view all the flashcards

Synthetic Monitoring

Simulates user actions to monitor service behavior and expected outcomes.

Signup and view all the flashcards

Status Page

Communicates the status of services to customers and users.

Signup and view all the flashcards

RASP

Runtime Application Self Protection; proactively monitors and blocks threats in production before exploitation.

Signup and view all the flashcards

WAF

Web Application Firewall; examines incoming traffic and blocks malicious activity.

Signup and view all the flashcards

MTTR

Mean Time to Recover/Repair - measures how long it takes to fix a failed component.

Signup and view all the flashcards

MTRS

Mean Time to Recover (Service) - measures how long it takes to restore a service after a failure.

Signup and view all the flashcards

MTTD

Mean Time to Detect (Failure/Incidents) - measures how long it takes to identify a failure.

Signup and view all the flashcards

Introducing Failure

Deliberately causing failures to test systems and identify weaknesses.

Signup and view all the flashcards

Culture of Learning

An environment where failures are seen as opportunities for improvement.

Signup and view all the flashcards

The Third Way

A DevOps principle emphasizing continuous experimentation and learning.

Signup and view all the flashcards

Chaos Monkey

A tool that simulates random service failures to test system resilience and identify weaknesses.

Signup and view all the flashcards

Squad Health Check

A process to assess the impact of key person dependence on a team's performance.

Signup and view all the flashcards

Hedonistic Organization

An organization that hides service failures to avoid negative consequences, hindering learning from mistakes.

Signup and view all the flashcards

Leveraging Existing Tools

Reusing existing tools and platforms before building new ones to save time and resources.

Signup and view all the flashcards

SRE & ITIL 4

ITIL 4 focuses on service quality and stakeholder satisfaction, while SRE emphasizes reliability and performance of applications. They both aim for improved service delivery through different approaches.

Signup and view all the flashcards

Value Stream in SRE & ITIL

The value stream outlines the complete journey from idea conception to service operation. Both ITIL 4 and SRE contribute to different stages of this stream, ensuring efficient service delivery.

Signup and view all the flashcards

DevOps & SRE

DevOps focuses on the integration of development and operations teams, aiming for faster and more frequent deployments. SRE plays a vital role by ensuring the reliability and performance of the deployed systems.

Signup and view all the flashcards

ITSM & SRE

ITSM manages IT services through processes and tools to ensure service quality. SRE complements ITSM by focusing on operational excellence and ensuring high-quality service delivery through reliability and performance.

Signup and view all the flashcards

New Trends in SRE

New trends in SRE include accepting failure as normal, automating tasks through services, adopting cloud technologies, focusing on learning from failures, and evolving the role of network engineers.

Signup and view all the flashcards

Failure as the New Normal

Accepting that failures are inevitable and focusing on mitigating their impact and learning from them is a key principle in SRE.

Signup and view all the flashcards

Automation as a Service

Automating tasks like monitoring, deployment, and recovery to reduce manual errors and increase efficiency is becoming increasingly crucial in SRE.

Signup and view all the flashcards

Cloud is King

Cloud technologies are becoming increasingly important for SRE, providing scalability, flexibility, and cost-effectiveness for infrastructure and services.

Signup and view all the flashcards

Observe & Learn

Analyzing data and learning from failures is key in SRE, helping to improve system reliability and performance over time.

Signup and view all the flashcards

Network Reliability Engineer (NRE)

NREs apply engineering principles to enhance network reliability, automate tasks, and integrate software-defined networks (SDNs) using SDLC principles.

Signup and view all the flashcards

Blameless Post Mortem

A structured process of examining an incident without assigning blame to individuals. Focuses on identifying root causes, implementing corrective actions, and learning from failures.

Signup and view all the flashcards

SRE On-Call Burn Out

The exhaustion and stress experienced by SREs who are frequently on call to handle incidents, leading to decreased productivity and potential burnout.

Signup and view all the flashcards

ChatOps for Distributed Teams

Using a chat platform to coordinate and collaborate on incident response, enabling efficient communication and knowledge sharing between geographically dispersed teams.

Signup and view all the flashcards

Incident Post Mortem Outputs

Outputs should include documented root causes, corrective actions, and lessons learned to prevent similar incidents in the future.

Signup and view all the flashcards

SLOs for Business Criticality

Defining Service Level Objectives (SLOs) that align with the business criticality of services, ensuring appropriate levels of availability and performance.

Signup and view all the flashcards

Reduced Outages with SRE

Implement SRE practices to minimize high-impact outages, reducing the impact on business operations and user experience.

Signup and view all the flashcards

Scaling Digital Services

Building and maintaining systems that can handle increasing load and user traffic, ensuring sustainable service performance.

Signup and view all the flashcards

Creating a Blameless Culture

Promoting a work environment where mistakes are viewed as opportunities for learning, encouraging open communication and collaboration to improve processes.

Signup and view all the flashcards

Study Notes

SRE-Led Service Automation

  • Environments are provisioned using Infrastructure/Config as Code.
  • Automated functional and non-functional tests are conducted in production.
  • Versioned and signed artifacts are used to deploy system components.
  • Instrumentation is in place to view the service externally.
  • Future growth is anticipated.
  • Clear anti-fragility is evident.
  • Tools used include Fire drills, Chaos Monkey, PagerDuty, VictorOps, and Squadcast.

SRE-Led Service Automation (Flowchart)

  • The flow starts with Commit ID: 113, followed by Build, Run Unit Tests, Code Analysis, Create Test Env, Deploy Code.
  • Then it moves to Load Test Data, Run Tests, Create Pre-Prod, Deploy Code, Run Perf Test
  • The next steps are Run Security Test, Check Monitors, Create Prod, Prod Deploy, Run Tests, Run NFT's, Check Monitors, to Failure Tests.
  • The overall emphasis is on production. DevOps gains “wisdom of production”. SRE can say “No”.

SRE Automation is Not Just Service Automation

  • The team supporting the platform was inundated with "toil," reaching a point where they could do little else.
  • 30% of respondents said maintenance tasks are their main source of toil.
  • Automation has been used to reduce toil by 8.5%.

Hierarchy of Automation Types

  • The database notices problems and automatically fails over without human intervention.
  • The database comes with its own failover script for handling issues.
  • SREs add database support to a "generic failover" script used by everyone for problems.
  • An SRE has a failover script in their home directory used in case of a problem.
  • The database master is manually failed over if there's a problem.

Secure Automation

  • Secure automation in pipelines helps prevent insecure manual steps.
  • Generated artifacts are checked and validated for compliance to ensure security.
  • DevSecOps is introduced into the build, test, and deploy cycle to promote security.
  • SRE emphasizes extra security measures in production.

Secure Build

  • Application, infrastructure, and configuration code changes are run through code analysis tools to check for security problems.
  • Digitally signing build artifacts prevents "fake" code.
  • Secure coding practices are widely published and embraced.
  • Secure code repositories with access control are used in development.
  • Open-source coding with community feedback is implemented.

Secure Test

  • Test environments are immutable to prevent mutation during configuration changes and ensuring compliance with the code repository.
  • Properly built testing data along with testing scenarios helps test security.

Secure Staging

  • Staging environments remain immutable to maintain consistency.
  • The same artifacts are deployed to staging for a pre-production environment.
  • Testing data in security contexts addresses security considerations, such as GDPR and PCI.
  • Security scanning is dedicated for finding security vulnerabilities.
  • Dependencies and integrations with other services are checked for vulnerabilities.

Secure Production

  • Immutable production environments are maintained for consistency.
  • Production data must meet security standards (e.g., GDPR, PCI, and SOX).
  • Security scanning is used for recognizing vulnerabilities.
  • Regulatory compliance is critically important.
  • Failure testing can be helpful for demonstrating audit procedures related to compliance.

Automation Tools

  • Discussions on tools often focus on favorite technological solutions.
  • Tools are often constantly evolving.
  • Organizations often display bias towards certain tools, from open-source to big IT solutions.
  • Automating jobs is usually best delegated to individuals with deep expertise in their respective tasks.
  • Engineers are more productive using the tools they are familiar with.

Case Story: Standard Chartered

  • Fundamental SRE principles are implemented throughout the organization.
  • These principles are: everything as code, everything via APIs, one pipeline, self-test and self-heal.
  • 28 person-years of effort have been saved through these principles.
  • Over 13,000 manual reviews have been avoided.
  • Time to repair has decreased by 25 minutes on average.
  • The organization avoided 200 self-inflicted operations incidents.

How Much Automation Do You Have?

  • A variety of tools are used for managing the software development lifecycle, including planning, creating, verifying, packaging, securing, and releasing.
  • Development tools cover areas like Audit Management (Audit Management, Authentication and Authorization, DevOps Score, Value Stream Management), Plan (Issue Tracking, Kanban Boards, Time Tracking and Agile Portfolio Management), Create (Source Code Management, Code Review and Wiki and Web IDE), Verify (Continuous Integration, Code Quality and Performance and Usability Testing), Package (Package Registry, Helm Charts and Dependency Firewall), Secure (SAST, DAST and IAST), Release (Continuous Delivery and release orchestration and Review Apps and Incremental Rollout).

Manage (Audit Management)

  • Automated tools are used to ensure products and services are auditable.
  • Audit logging of build, test, deploy activities, configurations and users is stored alongside production operation logs.
  • Secure processes for authentication and authorization including User and password management and two-factor authentication are implemented.
  • Tools like AWS IAM (Identity and Access Management) are used along with cloud user tools.

Manage (DevOps Score)

  • This is a metric that shows DevOps adoption throughout the organization and its impact on delivery velocity.
  • Value stream management shows the flow of value delivery within the DevOps lifecycle.
  • Tools like Gitlab CI and Jenkins extension and other DevOps tools like DevOptics provide the visualization.

Plan (1)

  • Tools like Jira, Trello, CA's Agile Central and VersionOne are used to capture issues or backlogs of work.
  • Kanban boards map out workflows for managing delivery flows, which support issue tracking.
  • Tools like Jira or Trello show time spent and effort related to issues and tasks.
  • Agile Portfolio Management evaluates in-flight projects and future initiatives.

Plan (2)

  • ServiceNow acts as a platform for managing the lifecycle of services including interactions with stakeholders, both inside and outside the organization.
  • Requirements Management tools are used to define, track, and manage requirements, ensuring traceability and handling dependencies.
  • Quality Management tools handle planning, execution, and tracking of testing to assist in identifying and addressing issues.

Create (Source Code Management)

  • Tools are used to securely maintain source code in a multi-user environment, such as Git and SVN.
  • Code reviews can verify quality using tools such as Gerrit, TFS, Crucible, and GitLab.
  • Confluence is used for content-rich Wiki style knowledge sharing.

Create (Web IDE)

  • Web IDE tools use web-based clients integrated with development environments that boost developer efficiency and productivity without a need for local tools.
  • Snippets for code are stored and shared to support collaboration around specific code pieces.

Verify (1)

  • Continuous Integration (CI) refers to integrating, building, and testing code in development environments.
  • Sonar and Checkmarks perform code analysis, reviewing comments, architecture, duplication, unit test coverage, complexity, potential defects, and language rules.

Verify (2)

  • Performance testing helps measure the speed, responsiveness and stability of applications under load. Tools like Gatling can be used.
  • Usability testing assesses how easily users interact with a service. Tools like Crazy Egg, and Optimizely offer ways to support user flows.

Package (1)

  • Software packages, artifacts, and metadata are managed in a repository called Package Registry.
  • Popular examples include Artifactory and Nexus.
  • Container Registry is a secure storage for Container images, supporting upload and download. Docker Hub, Artifactory, and Nexus are examples.
  • Dependency Proxy allows for proxying calls to other sources for frequently used images and packages.

Package (2)

  • Helm Charts are used to represent or describe Kubernetes resources.
  • Dependency Firewall scans dependencies to prevent security threats.

Secure (1)

  • Static Application Security Testing (SAST) examines application code to find problems.
  • Dynamic Application Security Testing (DAST) scans applications from the outside to look for vulnerabilities.
  • Interactive Application Security Testing (IAST) combines SAST and DAST approaches to discover vulnerabilities in real time.
  • Secret detection tools prevent sensitive information like passwords and tokens from getting unintentionally released.

Secure (2)

  • Dependency Scanning is used to find vulnerabilities like vulnerabilities in dependencies while developing applications. Tools such as Synopsys, Gemnasium, Retire.js and bundler-audit are often used.
  • Tools like Blackduck, Synopsys, Synk, Claire, and Klar are used for container scanning to look for known vulnerabilities and problems in containers prior to being deployed.
  • License Compliance tools like Blackduck and Synopsis ensure your dependency licenses are appropriate for the application.

Secure (3)

  • Vulnerability Database tools gather and maintain records of vulnerabilities for checking throughout the deployment pipeline.
  • Fuzzing techniques automatically check software for weakness by inputting unexpected data and observing crashes.

Release (4)

  • Continuous Delivery practice promotes releasing software to production continuously.
  • Release Orchestration tools like Jenkins and GitLab CI can be used for detecting, orchestrating, and testing changes to ensure there are no negative impacts on applications.
  • Pages can be created in automatic fashion, as part of a CI/CD pipeline.
  • Review process allows code committing in real time that enables developers to test updates, using environments spun up specifically for this task.
  • Incremental Rollouts use to gradually deploy changes to services instead of larger changes.

Release (5)

  • Canary deployment releases the program to a small subset of users to test updates before a full-blown release.
  • Feature flags modify system behavior without code changes. Tools like Launch Darkly support these flags.
  • Release Governance processes create reliable control and automated processes (for security, compliance, etc) to meet organizational needs for understanding changes.
  • Secrets Management tools and methods securely manage sensitive credentials like passwords, keys, APIs, and tokens.

Configure (1)

  • Auto DevOps integrates DevOps practices. Automatically configure software development lifecycles, detect, build, test, and deploy, and monitor applications.
  • ChatOps provides ways for DevOps actions like build and deployments, to occur using chat, in real-time, to directly chat based on specific actions.
  • Runbooks help create procedures for efficient service operations.
  • Serverless paradigms for code execution remove dependencies, as a piece of code may be executed by cloud service providers.

Monitor (3)

  • Metrics tools gather and display performance data for applications.
  • Logging captures and stores system activity. Logs contain information about process calls, events, user data, responses, and errors.
  • Tracing tools provide detailed analysis of application performance, identifying and investigating issues within systems.
  • Cluster monitoring provides updates regarding health of deployment environments and clusters.

Monitor (4)

  • Error tracking tools discover and show errors.
  • Incident management handles events and captures issues (who what when).
  • Synthetic monitoring simulates user actions to monitor service behavior.
  • Status pages provide users with information on service availability and status.

Defend (1)

  • Runtime Application Self Protection (RASP) tools monitor for and block security threats immediately as they occur.
  • Web Application Firewall (WAF) examines traffic to block malicious traffic.
  • Threat detection systems detect, report, and provide response capabilities for threats like DOS (denial-of-service) attacks.
  • User and Entity Behavior Analytics (UEBA) are machine learning techniques that look for abnormal user behavior to alert security teams to potential issues.

Defend (2)

  • Vulnerability Management helps identify, record, manage, and mitigate vulnerabilities in assets and applications automatically.
  • Data Loss Prevention (DLP) tools prevent data from being shared beyond authorized channels within the environment.
  • Data is protected from unauthorized access from outside organizations by securing storage systems and data storage ecosystems.
  • Container network security protects connections between containers to stop unintentional data movement or traffic.

Case Story: Netflix

  • Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of their infrastructure.
  • Results of these tests can minimize significant negative impact on the consumer experience and staff response.

Chaos Engineering Next Steps

  • Segregating the system to test individual components without key components is one of the first steps.
  • Testing without key components can happen in non-production environments first to minimize impact to real customers.
  • Introduce failures at a component level in production as a test and learn strategy.
  • Simulating database failure in production to discover how databases handle failures.
  • Simulate total system failure.

Chaos Engineering Next Steps (2)

  • Examine holistic logging to find the root cause of outages.
  • Identifying and improving dependencies in services.
  • Automate the process of error handling.
  • Learning from "real" failures is an essential part of the improvement process.

Chaos Engineering Next Steps (3)

  • Chaos engineering helps contain the effects of issues to smaller portions of services and deployments.

Chaos Engineering Next Steps (4)

  • Automating recovery processes
  • Logging
  • Monitoring
  • Alerting systems
  • Security

Typical Responsibilities

  • Engineering across the full lifecycle of a service or product.
  • Providing consultation regarding service design, development, and platform creation.
  • Maintaining services through ongoing monitoring.
  • Scaling systems through automation.
  • Providing incident response and performing blameless post-mortems.

Incident Response: On Call

  • Providing support and handling issues for services during working and non-working hours is an important aspect of on-call support.
  • Support involves handling issues such as outages, incidents, and similar issues, and requires availability to address important problems.

On Call by Numbers (1)

  • Google advocates for maintaining a 25% percentage on-call responsibility to prevent single points of failure.
  • Providing 24/7 support requires several SRE to cover various time zones and ensure service availability.
  • Spreading responsibility for on-call across several sites or locations can be useful against issues from under-utilization, or over-utilization of on-call staff.

Checklist for Effective On Call

  • Clearly assigning responsibilities related to "on-call", or "on-call time".
  • Appropriate monitoring of devices/tools to allow teams to identify, isolate, and fix events or issues.
  • Ensuring well-documented procedures for handling incidents, and providing context to staff to enable efficient resolution without repeat issues.
  • Creating a safe environment where employees feel comfortable documenting issues without fear of being blamed.

Using Automation to Replace On-Call

  • Automated self-healing processes replace the need for manual intervention by operators when problems occur, reducing downtime.
  • Automation tools like Kubernetes and AWS auto-scaling groups automate the process and increase service reliability and decrease the need for on-call support.
  • Processes for self-healing capabilities can be reviewed to improve incident response and safety.

Reasons for a Blameless Post Mortem

  • User visible downtime or service degradation beyond a specified service level objective (SLO)
  • Data Loss
  • On-call engineer intervention for resolving the issue.
  • Resolution times that exceed a set threshold.
  • Monitoring failures (requiring manual intervention).

Blameless Post Mortems

  • Detailed accounts of actions taken during incidents should be captured for improving the process.
  • Identifying steps contributing to the issue and causes should be captured.
  • Expectations and assumptions should also be documented to help team members better anticipate and avoid future problems.

Culture and the Flow of Information

  • Information flows vary in pathological, bureaucratic, and generative organizations.
  • Pathological cultures hide information and discourage communication; bureaucratic cultures may ignore or compartmentalize information; generative cultures actively seek information and reward communication.

Creating a Safe Environment

  • Engineers should have the authority to explain their work to address contributing factors to issues and failures.
  • Individuals accountable for mistakes are also good at teaching the rest of the team members what to do in similar situations.

Changing Failure Reactions

  • Organizational culture influences how incidents are handled and addressed.
  • An organizational culture where people feel safe to discuss their mistakes and learn from them is useful for fostering a positive and effective working environment.

Post Mortem Outputs

  • Post mortems should outline details of incidents or failures, summary, impact, triggers, detection, resolution, and participants.
  • Lists of follow-up actions are helpful in avoiding future similar incidents.
  • Lessons learned are important to prevent future issues.
  • Timelines of events from the incident are valuable in helping the team understand what happened.
  • Supporting data or documentation should always be included when recording lessons learned and similar activities.

Case Story: Sage Group

  • They use an automated incident reporting process to improve handling of the incidents.
  • Information and relevant information about the issue are provided in a shared channel for timely communication and coordination.

SRE - A Reminder

  • SRE is a system of activities that handle the proper implementation of systems and applications.
  • The need for an organization that addresses the operation of large-scale services is covered by the principles of SRE.
  • Understanding the implications of issues and making sure there is an adequate response procedure is a core methodology of SRE.

Key Success Factors for SRE Adoption

  • Executive support is critical for a company to implement SRE initiatives.
  • Strong relationships between engineers who work within the same organization or on similar types of tasks.
  • An organization that is constantly growing can benefit greatly from the implementation and use of SRE.

An organization that is growing...

  • Platform growth (large volumes of users, irregular data flows)
  • Scope growth (introduction of new products or services)
  • Ticket growth (higher volume of incidents, requests, and toil)

Use Engineering Approaches to Scale

  • By incorporating engineering approaches, the rate of service operations can be increased and made more scalable.

SRE Approaches to Platform Growth

  • Techniques for handling and managing growing platforms
  • Auto-scaling
  • Containerization
  • Clustering
  • Public/private cloud
  • NoSQL, MongoDB, and similar "as-a-service" solutions
  • Tools, Automation, and similar techniques will help handle increased systems needs.

SRE Approaches to Scope Growth

  • Shared responsibility for tools and platforms.
  • Expanding reach of SRE expertise into development teams.
  • Efficient toil automation can allow SRE teams to handle growth demands.

SRE Approaches to Ticket Growth

  • Toil reduction is encouraged using automated responses for tickets and self-service features.
  • Prioritization of toil reduction efforts.
  • DRY ("Don't Repeat Yourself") - avoid the repetition of effort/problem resolution.

Module 7: Exercise

  • Organizations should develop a plan for organizational implementation to adopt SRE.

Case Story: Department for Work & Pensions

  • The organization is focused on implementing processes that help eliminate the recurrence of issues and failures and automate resolutions.
  • An SRE initiative can help improve the stability, reliability, and performance of services across an organization.

Module Seven Quiz

  • Questions cover the benefits, processes, and responsibilities of SREs and organizations.

Module Eight Quiz

  • Questions cover SRE trends, frameworks, and other related topics.

Summary

  • Site Reliability Engineering (SRE) is an organization of activities to develop and operate systems that handle issues, and problems regarding the quality and quantity of service that an organization can provide to users.
  • SRE principles such as automation, monitoring, and engineering help provide reliable operations for applications and services.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

SRE-Led Service Automation PDF

More Like This

Untitled Quiz
37 questions

Untitled Quiz

WellReceivedSquirrel7948 avatar
WellReceivedSquirrel7948
Untitled Quiz
18 questions

Untitled Quiz

RighteousIguana avatar
RighteousIguana
Untitled Quiz
50 questions

Untitled Quiz

JoyousSulfur avatar
JoyousSulfur
Untitled Quiz
48 questions

Untitled Quiz

StraightforwardStatueOfLiberty avatar
StraightforwardStatueOfLiberty
Use Quizgecko on...
Browser
Browser