Podcast
Questions and Answers
What does MTTR stand for in the context of performance metrics?
What does MTTR stand for in the context of performance metrics?
Which metric indicates the maximum acceptable amount of data loss in a service?
Which metric indicates the maximum acceptable amount of data loss in a service?
What is a primary goal of introducing 'Chaos engineering' in a system?
What is a primary goal of introducing 'Chaos engineering' in a system?
What concept states that organizations must learn from failures to remain competitive?
What concept states that organizations must learn from failures to remain competitive?
Signup and view all the answers
Which metric is used to measure how quickly an organization can detect failures or incidents?
Which metric is used to measure how quickly an organization can detect failures or incidents?
Signup and view all the answers
What is the primary focus of the 'The Third Way' in a learning organization?
What is the primary focus of the 'The Third Way' in a learning organization?
Signup and view all the answers
Which statement best describes the concept of Service Level Objective (SLO)?
Which statement best describes the concept of Service Level Objective (SLO)?
Signup and view all the answers
What is the disadvantage of excessive data loss in a messaging queue?
What is the disadvantage of excessive data loss in a messaging queue?
Signup and view all the answers
Who developed the Chaos Monkey service?
Who developed the Chaos Monkey service?
Signup and view all the answers
Which metric primarily focuses on the detection of early failures?
Which metric primarily focuses on the detection of early failures?
Signup and view all the answers
What approach can help analyze the implications of relying on a key person?
What approach can help analyze the implications of relying on a key person?
Signup and view all the answers
What does it indicate if an organization consistently covers up service failures?
What does it indicate if an organization consistently covers up service failures?
Signup and view all the answers
What is the main purpose of integrating with monitoring services?
What is the main purpose of integrating with monitoring services?
Signup and view all the answers
What is the primary purpose of tracing in application performance management?
What is the primary purpose of tracing in application performance management?
Signup and view all the answers
Which tool type is specifically designed to report and support responsive actions against attacks?
Which tool type is specifically designed to report and support responsive actions against attacks?
Signup and view all the answers
What does synthetic monitoring simulate to evaluate service behavior?
What does synthetic monitoring simulate to evaluate service behavior?
Signup and view all the answers
What component is essential for capturing details of service incidents in incident management?
What component is essential for capturing details of service incidents in incident management?
Signup and view all the answers
What is the function of a Web Application Firewall (WAF)?
What is the function of a Web Application Firewall (WAF)?
Signup and view all the answers
What technique does User and Entity Behavior Analytics (UEBA) employ?
What technique does User and Entity Behavior Analytics (UEBA) employ?
Signup and view all the answers
What is the primary function of error tracking tools?
What is the primary function of error tracking tools?
Signup and view all the answers
What do status pages provide to users?
What do status pages provide to users?
Signup and view all the answers
What is emphasized as crucial for creating a safe environment in an organization?
What is emphasized as crucial for creating a safe environment in an organization?
Signup and view all the answers
According to the content, how should engineers be treated during a failure analysis?
According to the content, how should engineers be treated during a failure analysis?
Signup and view all the answers
What is the main purpose of allowing engineers to contribute to failure discussions?
What is the main purpose of allowing engineers to contribute to failure discussions?
Signup and view all the answers
What does the quote from John Allspaw imply about understanding failures?
What does the quote from John Allspaw imply about understanding failures?
Signup and view all the answers
What is the consequence of reacting negatively to failure, as suggested in the content?
What is the consequence of reacting negatively to failure, as suggested in the content?
Signup and view all the answers
What is conveyed by the statement regarding access to production?
What is conveyed by the statement regarding access to production?
Signup and view all the answers
What does the phrase 'to be effective we need more people to access production' suggest?
What does the phrase 'to be effective we need more people to access production' suggest?
Signup and view all the answers
What is a key belief expressed in the content regarding mistakes made by individuals?
What is a key belief expressed in the content regarding mistakes made by individuals?
Signup and view all the answers
What is the primary focus when creating a blameless environment?
What is the primary focus when creating a blameless environment?
Signup and view all the answers
What outcome is associated with appropriate Service Level Objectives (SLOs)?
What outcome is associated with appropriate Service Level Objectives (SLOs)?
Signup and view all the answers
Which practice can help prevent SREs from experiencing burnout while on call?
Which practice can help prevent SREs from experiencing burnout while on call?
Signup and view all the answers
What is the best approach to facilitate a geographically spread team in resolving a live issue?
What is the best approach to facilitate a geographically spread team in resolving a live issue?
Signup and view all the answers
What should be avoided during the implementation of post mortem meetings?
What should be avoided during the implementation of post mortem meetings?
Signup and view all the answers
How does the scalability of digital services benefit large user bases?
How does the scalability of digital services benefit large user bases?
Signup and view all the answers
What is a primary goal of sharing knowledge among teams?
What is a primary goal of sharing knowledge among teams?
Signup and view all the answers
What should be done when an incident matches pre-set criteria for a post mortem?
What should be done when an incident matches pre-set criteria for a post mortem?
Signup and view all the answers
What is a key focus of Site Reliability Engineering (SRE)?
What is a key focus of Site Reliability Engineering (SRE)?
Signup and view all the answers
How does ITIL 4 aim to ensure value for stakeholders?
How does ITIL 4 aim to ensure value for stakeholders?
Signup and view all the answers
Which of the following trends in Site Reliability Engineering highlights the importance of adapting to failures?
Which of the following trends in Site Reliability Engineering highlights the importance of adapting to failures?
Signup and view all the answers
What role does a Network Reliability Engineer (NRE) perform?
What role does a Network Reliability Engineer (NRE) perform?
Signup and view all the answers
What is the primary purpose of DevOps in relation to software delivery?
What is the primary purpose of DevOps in relation to software delivery?
Signup and view all the answers
What methodology emphasizes user-centered design within the software development process?
What methodology emphasizes user-centered design within the software development process?
Signup and view all the answers
Which of the following best describes the goal of Continuous Delivery?
Which of the following best describes the goal of Continuous Delivery?
Signup and view all the answers
What does automation as a service entail according to emerging trends in SRE?
What does automation as a service entail according to emerging trends in SRE?
Signup and view all the answers
How does SRE relate to ITIL and DevOps?
How does SRE relate to ITIL and DevOps?
Signup and view all the answers
What indicates the evolution of the network engineer as a trend in SRE?
What indicates the evolution of the network engineer as a trend in SRE?
Signup and view all the answers
Study Notes
SRE-Led Service Automation
- Environments are provisioned using Infrastructure/Config as Code.
- Automated functional and non-functional tests are conducted in production.
- Versioned and signed artifacts are used to deploy system components.
- Instrumentation is in place to view the service externally.
- Future growth is anticipated.
- Clear anti-fragility is evident.
- Tools used include Fire drills, Chaos Monkey, PagerDuty, VictorOps, and Squadcast.
SRE-Led Service Automation (Flowchart)
- The flow starts with Commit ID: 113, followed by Build, Run Unit Tests, Code Analysis, Create Test Env, Deploy Code.
- Then it moves to Load Test Data, Run Tests, Create Pre-Prod, Deploy Code, Run Perf Test
- The next steps are Run Security Test, Check Monitors, Create Prod, Prod Deploy, Run Tests, Run NFT's, Check Monitors, to Failure Tests.
- The overall emphasis is on production. DevOps gains “wisdom of production”. SRE can say “No”.
SRE Automation is Not Just Service Automation
- The team supporting the platform was inundated with "toil," reaching a point where they could do little else.
- 30% of respondents said maintenance tasks are their main source of toil.
- Automation has been used to reduce toil by 8.5%.
Hierarchy of Automation Types
- The database notices problems and automatically fails over without human intervention.
- The database comes with its own failover script for handling issues.
- SREs add database support to a "generic failover" script used by everyone for problems.
- An SRE has a failover script in their home directory used in case of a problem.
- The database master is manually failed over if there's a problem.
Secure Automation
- Secure automation in pipelines helps prevent insecure manual steps.
- Generated artifacts are checked and validated for compliance to ensure security.
- DevSecOps is introduced into the build, test, and deploy cycle to promote security.
- SRE emphasizes extra security measures in production.
Secure Build
- Application, infrastructure, and configuration code changes are run through code analysis tools to check for security problems.
- Digitally signing build artifacts prevents "fake" code.
- Secure coding practices are widely published and embraced.
- Secure code repositories with access control are used in development.
- Open-source coding with community feedback is implemented.
Secure Test
- Test environments are immutable to prevent mutation during configuration changes and ensuring compliance with the code repository.
- Properly built testing data along with testing scenarios helps test security.
Secure Staging
- Staging environments remain immutable to maintain consistency.
- The same artifacts are deployed to staging for a pre-production environment.
- Testing data in security contexts addresses security considerations, such as GDPR and PCI.
- Security scanning is dedicated for finding security vulnerabilities.
- Dependencies and integrations with other services are checked for vulnerabilities.
Secure Production
- Immutable production environments are maintained for consistency.
- Production data must meet security standards (e.g., GDPR, PCI, and SOX).
- Security scanning is used for recognizing vulnerabilities.
- Regulatory compliance is critically important.
- Failure testing can be helpful for demonstrating audit procedures related to compliance.
Automation Tools
- Discussions on tools often focus on favorite technological solutions.
- Tools are often constantly evolving.
- Organizations often display bias towards certain tools, from open-source to big IT solutions.
- Automating jobs is usually best delegated to individuals with deep expertise in their respective tasks.
- Engineers are more productive using the tools they are familiar with.
Case Story: Standard Chartered
- Fundamental SRE principles are implemented throughout the organization.
- These principles are: everything as code, everything via APIs, one pipeline, self-test and self-heal.
- 28 person-years of effort have been saved through these principles.
- Over 13,000 manual reviews have been avoided.
- Time to repair has decreased by 25 minutes on average.
- The organization avoided 200 self-inflicted operations incidents.
How Much Automation Do You Have?
- A variety of tools are used for managing the software development lifecycle, including planning, creating, verifying, packaging, securing, and releasing.
- Development tools cover areas like Audit Management (Audit Management, Authentication and Authorization, DevOps Score, Value Stream Management), Plan (Issue Tracking, Kanban Boards, Time Tracking and Agile Portfolio Management), Create (Source Code Management, Code Review and Wiki and Web IDE), Verify (Continuous Integration, Code Quality and Performance and Usability Testing), Package (Package Registry, Helm Charts and Dependency Firewall), Secure (SAST, DAST and IAST), Release (Continuous Delivery and release orchestration and Review Apps and Incremental Rollout).
Manage (Audit Management)
- Automated tools are used to ensure products and services are auditable.
- Audit logging of build, test, deploy activities, configurations and users is stored alongside production operation logs.
- Secure processes for authentication and authorization including User and password management and two-factor authentication are implemented.
- Tools like AWS IAM (Identity and Access Management) are used along with cloud user tools.
Manage (DevOps Score)
- This is a metric that shows DevOps adoption throughout the organization and its impact on delivery velocity.
- Value stream management shows the flow of value delivery within the DevOps lifecycle.
- Tools like Gitlab CI and Jenkins extension and other DevOps tools like DevOptics provide the visualization.
Plan (1)
- Tools like Jira, Trello, CA's Agile Central and VersionOne are used to capture issues or backlogs of work.
- Kanban boards map out workflows for managing delivery flows, which support issue tracking.
- Tools like Jira or Trello show time spent and effort related to issues and tasks.
- Agile Portfolio Management evaluates in-flight projects and future initiatives.
Plan (2)
- ServiceNow acts as a platform for managing the lifecycle of services including interactions with stakeholders, both inside and outside the organization.
- Requirements Management tools are used to define, track, and manage requirements, ensuring traceability and handling dependencies.
- Quality Management tools handle planning, execution, and tracking of testing to assist in identifying and addressing issues.
Create (Source Code Management)
- Tools are used to securely maintain source code in a multi-user environment, such as Git and SVN.
- Code reviews can verify quality using tools such as Gerrit, TFS, Crucible, and GitLab.
- Confluence is used for content-rich Wiki style knowledge sharing.
Create (Web IDE)
- Web IDE tools use web-based clients integrated with development environments that boost developer efficiency and productivity without a need for local tools.
- Snippets for code are stored and shared to support collaboration around specific code pieces.
Verify (1)
- Continuous Integration (CI) refers to integrating, building, and testing code in development environments.
- Sonar and Checkmarks perform code analysis, reviewing comments, architecture, duplication, unit test coverage, complexity, potential defects, and language rules.
Verify (2)
- Performance testing helps measure the speed, responsiveness and stability of applications under load. Tools like Gatling can be used.
- Usability testing assesses how easily users interact with a service. Tools like Crazy Egg, and Optimizely offer ways to support user flows.
Package (1)
- Software packages, artifacts, and metadata are managed in a repository called Package Registry.
- Popular examples include Artifactory and Nexus.
- Container Registry is a secure storage for Container images, supporting upload and download. Docker Hub, Artifactory, and Nexus are examples.
- Dependency Proxy allows for proxying calls to other sources for frequently used images and packages.
Package (2)
- Helm Charts are used to represent or describe Kubernetes resources.
- Dependency Firewall scans dependencies to prevent security threats.
Secure (1)
- Static Application Security Testing (SAST) examines application code to find problems.
- Dynamic Application Security Testing (DAST) scans applications from the outside to look for vulnerabilities.
- Interactive Application Security Testing (IAST) combines SAST and DAST approaches to discover vulnerabilities in real time.
- Secret detection tools prevent sensitive information like passwords and tokens from getting unintentionally released.
Secure (2)
- Dependency Scanning is used to find vulnerabilities like vulnerabilities in dependencies while developing applications. Tools such as Synopsys, Gemnasium, Retire.js and bundler-audit are often used.
- Tools like Blackduck, Synopsys, Synk, Claire, and Klar are used for container scanning to look for known vulnerabilities and problems in containers prior to being deployed.
- License Compliance tools like Blackduck and Synopsis ensure your dependency licenses are appropriate for the application.
Secure (3)
- Vulnerability Database tools gather and maintain records of vulnerabilities for checking throughout the deployment pipeline.
- Fuzzing techniques automatically check software for weakness by inputting unexpected data and observing crashes.
Release (4)
- Continuous Delivery practice promotes releasing software to production continuously.
- Release Orchestration tools like Jenkins and GitLab CI can be used for detecting, orchestrating, and testing changes to ensure there are no negative impacts on applications.
- Pages can be created in automatic fashion, as part of a CI/CD pipeline.
- Review process allows code committing in real time that enables developers to test updates, using environments spun up specifically for this task.
- Incremental Rollouts use to gradually deploy changes to services instead of larger changes.
Release (5)
- Canary deployment releases the program to a small subset of users to test updates before a full-blown release.
- Feature flags modify system behavior without code changes. Tools like Launch Darkly support these flags.
- Release Governance processes create reliable control and automated processes (for security, compliance, etc) to meet organizational needs for understanding changes.
- Secrets Management tools and methods securely manage sensitive credentials like passwords, keys, APIs, and tokens.
Configure (1)
- Auto DevOps integrates DevOps practices. Automatically configure software development lifecycles, detect, build, test, and deploy, and monitor applications.
- ChatOps provides ways for DevOps actions like build and deployments, to occur using chat, in real-time, to directly chat based on specific actions.
- Runbooks help create procedures for efficient service operations.
- Serverless paradigms for code execution remove dependencies, as a piece of code may be executed by cloud service providers.
Monitor (3)
- Metrics tools gather and display performance data for applications.
- Logging captures and stores system activity. Logs contain information about process calls, events, user data, responses, and errors.
- Tracing tools provide detailed analysis of application performance, identifying and investigating issues within systems.
- Cluster monitoring provides updates regarding health of deployment environments and clusters.
Monitor (4)
- Error tracking tools discover and show errors.
- Incident management handles events and captures issues (who what when).
- Synthetic monitoring simulates user actions to monitor service behavior.
- Status pages provide users with information on service availability and status.
Defend (1)
- Runtime Application Self Protection (RASP) tools monitor for and block security threats immediately as they occur.
- Web Application Firewall (WAF) examines traffic to block malicious traffic.
- Threat detection systems detect, report, and provide response capabilities for threats like DOS (denial-of-service) attacks.
- User and Entity Behavior Analytics (UEBA) are machine learning techniques that look for abnormal user behavior to alert security teams to potential issues.
Defend (2)
- Vulnerability Management helps identify, record, manage, and mitigate vulnerabilities in assets and applications automatically.
- Data Loss Prevention (DLP) tools prevent data from being shared beyond authorized channels within the environment.
- Data is protected from unauthorized access from outside organizations by securing storage systems and data storage ecosystems.
- Container network security protects connections between containers to stop unintentional data movement or traffic.
Case Story: Netflix
- Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of their infrastructure.
- Results of these tests can minimize significant negative impact on the consumer experience and staff response.
Chaos Engineering Next Steps
- Segregating the system to test individual components without key components is one of the first steps.
- Testing without key components can happen in non-production environments first to minimize impact to real customers.
- Introduce failures at a component level in production as a test and learn strategy.
- Simulating database failure in production to discover how databases handle failures.
- Simulate total system failure.
Chaos Engineering Next Steps (2)
- Examine holistic logging to find the root cause of outages.
- Identifying and improving dependencies in services.
- Automate the process of error handling.
- Learning from "real" failures is an essential part of the improvement process.
Chaos Engineering Next Steps (3)
- Chaos engineering helps contain the effects of issues to smaller portions of services and deployments.
Chaos Engineering Next Steps (4)
- Automating recovery processes
- Logging
- Monitoring
- Alerting systems
- Security
Typical Responsibilities
- Engineering across the full lifecycle of a service or product.
- Providing consultation regarding service design, development, and platform creation.
- Maintaining services through ongoing monitoring.
- Scaling systems through automation.
- Providing incident response and performing blameless post-mortems.
Incident Response: On Call
- Providing support and handling issues for services during working and non-working hours is an important aspect of on-call support.
- Support involves handling issues such as outages, incidents, and similar issues, and requires availability to address important problems.
On Call by Numbers (1)
- Google advocates for maintaining a 25% percentage on-call responsibility to prevent single points of failure.
- Providing 24/7 support requires several SRE to cover various time zones and ensure service availability.
- Spreading responsibility for on-call across several sites or locations can be useful against issues from under-utilization, or over-utilization of on-call staff.
Checklist for Effective On Call
- Clearly assigning responsibilities related to "on-call", or "on-call time".
- Appropriate monitoring of devices/tools to allow teams to identify, isolate, and fix events or issues.
- Ensuring well-documented procedures for handling incidents, and providing context to staff to enable efficient resolution without repeat issues.
- Creating a safe environment where employees feel comfortable documenting issues without fear of being blamed.
Using Automation to Replace On-Call
- Automated self-healing processes replace the need for manual intervention by operators when problems occur, reducing downtime.
- Automation tools like Kubernetes and AWS auto-scaling groups automate the process and increase service reliability and decrease the need for on-call support.
- Processes for self-healing capabilities can be reviewed to improve incident response and safety.
Reasons for a Blameless Post Mortem
- User visible downtime or service degradation beyond a specified service level objective (SLO)
- Data Loss
- On-call engineer intervention for resolving the issue.
- Resolution times that exceed a set threshold.
- Monitoring failures (requiring manual intervention).
Blameless Post Mortems
- Detailed accounts of actions taken during incidents should be captured for improving the process.
- Identifying steps contributing to the issue and causes should be captured.
- Expectations and assumptions should also be documented to help team members better anticipate and avoid future problems.
Culture and the Flow of Information
- Information flows vary in pathological, bureaucratic, and generative organizations.
- Pathological cultures hide information and discourage communication; bureaucratic cultures may ignore or compartmentalize information; generative cultures actively seek information and reward communication.
Creating a Safe Environment
- Engineers should have the authority to explain their work to address contributing factors to issues and failures.
- Individuals accountable for mistakes are also good at teaching the rest of the team members what to do in similar situations.
Changing Failure Reactions
- Organizational culture influences how incidents are handled and addressed.
- An organizational culture where people feel safe to discuss their mistakes and learn from them is useful for fostering a positive and effective working environment.
Post Mortem Outputs
- Post mortems should outline details of incidents or failures, summary, impact, triggers, detection, resolution, and participants.
- Lists of follow-up actions are helpful in avoiding future similar incidents.
- Lessons learned are important to prevent future issues.
- Timelines of events from the incident are valuable in helping the team understand what happened.
- Supporting data or documentation should always be included when recording lessons learned and similar activities.
Case Story: Sage Group
- They use an automated incident reporting process to improve handling of the incidents.
- Information and relevant information about the issue are provided in a shared channel for timely communication and coordination.
SRE - A Reminder
- SRE is a system of activities that handle the proper implementation of systems and applications.
- The need for an organization that addresses the operation of large-scale services is covered by the principles of SRE.
- Understanding the implications of issues and making sure there is an adequate response procedure is a core methodology of SRE.
Key Success Factors for SRE Adoption
- Executive support is critical for a company to implement SRE initiatives.
- Strong relationships between engineers who work within the same organization or on similar types of tasks.
- An organization that is constantly growing can benefit greatly from the implementation and use of SRE.
An organization that is growing...
- Platform growth (large volumes of users, irregular data flows)
- Scope growth (introduction of new products or services)
- Ticket growth (higher volume of incidents, requests, and toil)
Use Engineering Approaches to Scale
- By incorporating engineering approaches, the rate of service operations can be increased and made more scalable.
SRE Approaches to Platform Growth
- Techniques for handling and managing growing platforms
- Auto-scaling
- Containerization
- Clustering
- Public/private cloud
- NoSQL, MongoDB, and similar "as-a-service" solutions
- Tools, Automation, and similar techniques will help handle increased systems needs.
SRE Approaches to Scope Growth
- Shared responsibility for tools and platforms.
- Expanding reach of SRE expertise into development teams.
- Efficient toil automation can allow SRE teams to handle growth demands.
SRE Approaches to Ticket Growth
- Toil reduction is encouraged using automated responses for tickets and self-service features.
- Prioritization of toil reduction efforts.
- DRY ("Don't Repeat Yourself") - avoid the repetition of effort/problem resolution.
Module 7: Exercise
- Organizations should develop a plan for organizational implementation to adopt SRE.
Case Story: Department for Work & Pensions
- The organization is focused on implementing processes that help eliminate the recurrence of issues and failures and automate resolutions.
- An SRE initiative can help improve the stability, reliability, and performance of services across an organization.
Module Seven Quiz
- Questions cover the benefits, processes, and responsibilities of SREs and organizations.
Module Eight Quiz
- Questions cover SRE trends, frameworks, and other related topics.
Summary
- Site Reliability Engineering (SRE) is an organization of activities to develop and operate systems that handle issues, and problems regarding the quality and quantity of service that an organization can provide to users.
- SRE principles such as automation, monitoring, and engineering help provide reliable operations for applications and services.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.