Summary

This document discusses various aspects of monitoring, including application performance management, infrastructure monitoring, and challenges in distributed architectures. Key concepts such as latency, traffic, errors, and saturation are presented, along with discussion of tools and challenges related to microservices, containers, and polyglot technologies.

Full Transcript

Application/Infrastructure Monitoring Metrics represent the raw measurements of resource usage or behavior that can be observed and collected throughout your systems. Monitoring is the process of collecting, aggregating, and analyzing those values to improve awareness of your c...

Application/Infrastructure Monitoring Metrics represent the raw measurements of resource usage or behavior that can be observed and collected throughout your systems. Monitoring is the process of collecting, aggregating, and analyzing those values to improve awareness of your components' characteristics and behavior Alerting is the responsive component of a monitoring system that performs actions based on changes in metric values. Application Performance Monitoring/Management crosses many disciplines and domains: software development, applications in production, desktops, mainframes, web, mobile, cloud services, virtualization, application testing, network, databases, and storage. When applications are virtualized and collapsed inside a single piece of hardware, the “one-to- one” relationship between applications and hardware becomes a “many-to-one” relationship, and legacy monitoring solutions lose their analytical capabilities Application Performance Management (APM) Key challenges include: Classic APM technologies do not translate to the Cloud Limited visibility into transactions, especially between VMs on the same Hypervisor host Limited visibility into the physical-to-virtual relationship between hardware and applications Difficulty understanding the performance impact of Virtual Machine Managers (VMM). Mobility BYOD What type of data should you monitor/collect ? Host-Based Metrics (CPU, Memory, Disk Space) Application Metrics Error and success rates Service failures and restarts Performance and latency of responses Resource usage Network and Connectivity Metrics Connectivity Error rates and packet loss Latency Bandwidth utilization (Saturation - % of capacity) The Golden Signals of Monitoring In the highly influential Google SRE Book, the chapter on monitoring distributed systems introduces a useful framework called the four golden signals of monitoring that represents the most important factors to measure in a user- facing system. Latency Traffic Errors Saturation https://sre.google/sre-book/monitoring-distributed-systems/ What Challenges Do Highly Distributed Architectures Create? Work Is Decoupled From Underlying Resources Increase in Network-Based Communication Functionality and Responsibility Partitioned to Greater Degree Short-Lived and Ephemeral Units https://www.digitalocean.com/community/tutorials/monitoring-for-distributed- and-microservices-deployments Monitoring and Microservices Granularity and locality It is an architectural challenge to define the scope and boundaries of individual microservices within an application. Monitoring can help to identify tightly coupled or chatty services by looking at the frequency of service calls based on user or system behavior. Architects might then decide to combine two services into one or use platform mechanisms to guarantee colocation (for example Kubernetes pods). Monitoring and Microservices Impacts of remote function calls In-memory function in monoliths turn into remote service calls in the cloud, where the payload needs to include actual data versus only in-memory object references. This raises a number of issues that depend heavily on the locality and chattiness of a service: how large is the payload value compared to the in-memory reference? How much resources are used for marshalling/ unmarshalling and for encoding? How to deal with large responses? Should there be caching or pagination? Is this done locally or remotely? Monitoring provides valuable empirical baselines and trend data on all these questions. Monitoring and Microservices Network monitoring Network monitoring gets into the limelight as calls between microservices usually traverse the network. Software-Defined Networks (SDNs) and overlay networks become more important with PaaS and dynamic deployments. Although maintenance and administration requirements for the underlying physical network components (cables, routers, and so on) decline, virtual networks need more attention because they come with network and computing overhead. Monitoring and Microservices Polyglot technologies Monitoring solutions need to be able to cover polyglot technologies as microservice developers move away from a one- language- fits-all approach. They need to trace transactions across different technologies like, for example, a mobile front‐ end, a Node.js API gateway, a Java or.NET backend, and a MongoDB database. Monitoring and Microservices Container monitoring With the move toward container technologies, monitoring needs to become “container-aware”; that is, it needs to cover and monitor containers and the service inside automatically. Because containers are started dynamically—for example, by orchestration tools like Kubernetes—static configuration of monitoring agents is no longer feasible. Monitoring and Microservices Platform-aware monitoring Monitoring solutions need the capabilities to distinguish between the performance of the application itself and the performance of the dynamic infrastructure. For example, microservice calls over the network have latency. However, the control plane (e.g., Kubernetes Master) also uses the network. This could be discarded as background noise, but it is there and can have an impact. In general, cloud platform technologies are still in their infancy and emerging technologies need to be monitored very closely because they have the potential for catastrophic failures. Popular Monitoring Tools Ganglia Nagios Prometheus Graphite AWS Cloudwatch Grafana (popular visualisation tool) logstash, fluentd, splunk (Log collectors) ELK Stack (ELK) : Elasticsearch, Logstash and Kibana Elasticsearch is an open source, full-text search and analysis engine, based on the Apache Lucene search engine. Logstash is a log aggregator that collects data from various input sources, executes different transformations and enhancements and then ships the data to various supported output destinations. Kibana is a visualization layer that works on top of Elasticsearch, providing users with the ability to analyze and visualize the data. https://aws.amazon.com/what-is/elk-stack/ Observability Observability is a term derived from control theory, It is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Service infrastructures used on a daily basis are becoming more and more complex; proactive monitoring alone is not sufficient to quickly resolve issues causing application failures. With monitoring, you can keep known past failures from recurring, but with a complex service architecture, many unknown factors can cause potential problems. To address such cases, you can make the service observable. An observable system provides highly granular insights into the implicit failure modes. In addition, an observable system furnishes ample context about its inner workings, which unlocks the ability to uncover deeper systemic issues. Monitoring enables failure detection; observability helps in gaining a better understanding of the system. https://linkedin.github.io/school-of-sre/level101/metrics_and_monitoring/observability/ Observability The three core pillars of observability are traces, metrics, and logs. Tracing : A distributed trace is a record of a single service call corresponding to a request from an individual user. The trace starts with an initial span, called a "parent span." The request also triggers downstream subcalls to other services, generating a tree structure of multiple "child" spans. One of the top challenges in tracing is data sampling. You cannot collect a trace of every single transaction; this would result in too much data because your application can receive millions of requests. https://www.redhat.com/en/blog/observability-introduction Observability Metrics are the measurement of a specific activity over a specified interval of time. They are used to gain insight into the performance of a system or application. A metric typically consists of a timestamp, name, value, and dimensions. The dimensions are a set of key-value pairs that describe additional metadata about the metric Logs are records of events that happen at discrete points in time on specific systems. Logs can monitor system and application health and provide a historical record to support troubleshooting, such as during a system outage or application downtime. https://www.redhat.com/en/blog/observability-introduction Prometheus Architecture Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data. https://opensource.com/article/18/9/prometheus-operational-advantage

Use Quizgecko on...
Browser
Browser