CloudMonitoring.pdf
Document Details
Uploaded by FeatureRichButtercup
Technische Universität München
Tags
Full Transcript
Cloud Monitoring Michael Gerndt Technische Universität München 1 Why Monitoring? Make best use of your rented resources to reduce your costs and increase satisfaction of users of your services. 2 https://medium.com...
Cloud Monitoring Michael Gerndt Technische Universität München 1 Why Monitoring? Make best use of your rented resources to reduce your costs and increase satisfaction of users of your services. 2 https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e 3 An observable system is one that exposes enough data about itself so that generating information ( nding answers to questions yet to be formulated) and easily accessing this information becomes simple. https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e 4 fi Monitoring De nition: Monitoring Monitoring in the cloud is the process of collecting status information of applications and resources. The data can be used to observe application and infrastructure. De nition: Monitoring System It consists of all components for gathering monitoring data at runtime. De nition: Monitoring Data All (raw) data captured by the monitoring system. 5 fi fi fi Information De nition Information is gained by processing, interpreting, organizing and visualizing raw data. It increases knowledge about the observed system. Example Raw data are CPU and memory utilization Information is that there is a trend for an overload or a memory leak. Required information is not always clear in the Cloud. Collect any data available Proactively creating information Continuous analysis for triggering alarms or to give an overview of the status of the system. Reactively Triggered through events such as an incident. E.g. root cause analysis and autoscaling Information is produced by Observability Frontends 6 fi Purposes Infrastructure level Resource management Incident detection Root cause analysis Accounting or metering for payment Intrusion detection Auditing Application level Performance analysis Resource management, e.g. scaling decisions Failure detection and resolution SLA veri cation Auditing 7 fi Three pillars of monitoring Metrics Logs Traces 8 Monitoring Metrics Metric: e.g. execution time Semantics Unit Context server, application service,… Representation Aggregation sum, min, max, mean, percentiles, histogram Measurement frequency every second, minute, 5 minutes 9 Important Metrics Latency The time it takes to service a request. Selectively measure successful and error request. Throughput or Tra c Web service: requests/second Streaming system: network I/O rate or concurrent sessions Database: transactions/second or retrievals per second Error rate Rate of requests that fail. Explicitly (HTTP 500), implicitly (wrong reply contents), or by violating an SLA Utilization or saturation Percentage of capacity CPU, memory, I/O 10 ffi Monitoring and Cloud Layers Context Metric Purpose Aggregation: no, min, max, mean, percentiles Client Request type #requests, latency, SLA check Requests Availability Alerting Application Service name #requests, request rate Autoscaling Service id Latency, #replicas Performance tuning Microservices CPU time, memory usage Platform Container id CPU & memory quota, Container distribution Kubernetes utilization, incoming & Autoscaling VM cluster outgoing bytes Docker Infrastructure VM id, volume id CPU & memory Root cause analysis VM, volumes Service name #read/write, I/O latency #requests, size of requests Queueing services of infrastructure service Hardware Server id, switch id Disk utilization, tra c Management of VMs Servers, network SAN, disks 11 ffi Monitoring System Requirements Comprehensive Low intrusion Extensibility Scalability Elasticity Accuracy Resilience 12 Blackbox and Whitebox Monitoring Blackbox Monitoring The monitored system is handled as a black box. No data are gained from the inside of the system. E.g. only the request interface of a service is visible nothing about the internal structure. Whitebox Monitoring Data is also from the inside of the system. This gives more context and more detailed insights. E.g. Internal organization of a service is visible, e.g., asynchronous internal handling of requests, load balancing, backend services. 13 Overheads Overheads lead to intrusion Lot of reasons for overheads Instrumentation Computation for aggregations Memory overhead for bu ering Time to push to disk or transfer to collector Storage overhead for long-term storage Reduction techniques Number of metrics Measurement frequency Representation Batching Sampling Long-term coarsening 14 ff Amazon CloudWatch Monitoring and management service Collects Metrics Logs: Cloud Watch Log Insights Cloud Watch Management Console 15 Amazon CloudWatch Metrics Metrics Preselected set of metrics per service, e.g. CPU utilization, data transfer, disk usage for EC2 instances Read/write latency for EBS volumes Request counts and latency for load balancer Number of messages sent for Simple Queueing Service (SQS) queues Freely available memory and storage for RDS DB instances Custom application and system metrics 16 Amazon CloudWatch Data are provided online Frequency every second to 5 minutes depending on the data and your account Data with < 1 minute granularity stored for three hours Data with 1 minute granularity stored for two weeks Data with 1 hour granularity stored for 15 months Actions View graphs and statistics Set Alarms Access Management Console web interface Command line interface Libraries for Java, Script languages, Windows.Net Web Service API 17 Amazon CloudWatch View graph and create alarm 18 Amazon CloudWatch Command line interface Provides commands to manage monitoring, e.g., list metrics, get statistics, de ne user metrics, add data for user metrics Usage Install command line interface Provide your Amazon credentials Use the provided commands 19 fi Amazon Cloud Watch Example: user de ned metrics De ne a user metric Specify data points with mon-put-data mon-put-data -m RequestLatency -n "GetStarted" -t 2010-10-29T20:30:00Z -v 87 -u Milliseconds You can also specify multiple aggregated data points in the form of sum, minimum, maximum, SampleCount Get Access statistics mon-get-stats -n GetStarted -m RequestLatency -s "Average" --start-time 2010-10-29T00:00:00Z -- headers Time Average Unit 2010-10-29 20:30:00 24.5 Milliseconds 2010-10-29 21:30:00 15.4 Milliseconds 2010-10-29 22:17:00 134.333 Milliseconds 20 fi fi Amazon Cloud Watch Pricing Basic monitoring free: standard metrics every 5 minutes Detailed monitoring of EC2 instances every minute Each alarm, each custom metric Cloudwatch API calls Free Tier Basic Monitoring Metrics (at 5-minute frequency) 2022 10 Detailed Monitoring Metrics (at 1-minute frequency) Metrics 1 Million API requests (not applicable to GetMetricData and GetMetricWidgetImage) Dashboard 3 Dashboards for up to 50 metrics per month 10 Alarm metrics (not applicable to high-resolution alarms) Alarms Logs 5GB Data (ingestion, archive storage, and data scanned by Logs Insights queries) Events All events except custom events are included 21 Prometheus for Metrics Prometheus is an open source monitoring system https://prometheus.io Initially built by soundcloud.com now a Cloud Native Foundation project Features Metric collection in form of time series Storage by a time series database Query language for accessing the time series Alerting Visualization 22 Cloud Native Computing Foundation Promotes the concept of Cloud Native Computing Pushes for a sustainable ecosystem for Cloud Native Computing Hosts several fast-growing open source projects including Kubernetes, Prometheus and Envoy. Sandbox, Incubating and Graduated projects depending on the adoption and stability Runs CloudNativeCon 2024 Paris KubeCon and CloudNativeCon CNCF.io 23 Prometheus Architecture 24 Prometheus Scraping Metrics retrieved through /metrics endpoint. Metric types Counter: cumulative metric monotonically increasing Gauge: numerical value arbitrarily going up and down Histogram: counts for buckets, total sum, number of events Metric has Name Labels giving the context Metric name and labels de ne a time series e.g. api_http_requests_total{method="POST", handler="/messages"} Samples are a oat64 value plus a millisecond timestamp 25 fl fi Prometheus Instrumentation Client libraries for many programming languages const client = require('prom-client'); const counter = new client.Counter({ name: 'metric_name', help: 'metric_help', }); counter.inc(); // Increment by 1 counter.inc(10); // Increment by 10 // Setup server to Prometheus scrapes: server.get('/metrics', async (req, res) => { try { res.set('Content-Type', register.contentType); res.end(client.register.metrics()); } catch (ex) { res.status(500).end(ex); } }); 26 Prometheus Exporters Exporters allow to provide metrics for services that cannot be instrumented. https://prometheus.io/docs/instrumenting/exporters/ For example, the PostgreSQL exporter Application running in parallel to the database Will connect to the database and will then publish standard metrics sudo -u postgres DATA_SOURCE_NAME="port=5432" postgres_exporter 27 Prometheus Alerting Separated in Prometheus server and Alertmanager Rules determine an alert groups: - name: example rules: - alert: HighRequestLatency expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 for: 10m Alertmanager Manages those alerts, including silencing, inhibition, grouping and sending out noti cations via methods such as email, on-call noti cation systems, and chat platforms 28 fi fi Prometheus Scalability Hierarchical federation of servers https://wikitech.wikimedia.org/wiki/Prometheus 29 Visualization with Grafana Open-source platform for visualization Prometheus is one of possible data sources Connect through IP-address and port Create your own dashboard https://grafana.com/grafana/dashboards/10578 30 Logs De nition log A log is a sequence of immutable records of discrete events. Generated by applications, system level, infrastructure, any devices.... Event logs come in two forms: Plaintext —most common format of logs. Structured—much evangelized, typically JSON Typically huge amount of log data Logging can be con gured to levels Allows to drill down Di cult to analyze Representation ASCII - easily readable but ine cient with respect to space and time Binary - more e cient, example: protobuf from Google 31 ffi fi ffi fi ffi Protobuf What is it? Binary format for logs language-neutral platform-neutral extensible mechanism for serializing structured data backward-compatible: new trace formats can be processed with old tools forward-compatible: old trace formats can be processed with new tools Who owns it? 2001-2008 internally developed and used by Google since 2008 it is open source and maintained at Google Create.proto le Generate code Compile PB code Use PB classes to to de ne data using the protoc with your project serialize, share, & structure compiler code deserialize data.java,.py,.cc compiled.proto or others classes 32 fi fi ELK Stack for Log Processing ELK stack Elasticsearch, Logstash, Kibana Open Source and free Suited for logs and metrics Elastic Stack ELK + Beats and X-Pack 33 ElasticSearch Distributed data store, search and analytics engine based on Apache Lucene, a text search engine NoSQL database Organized in indexes Each index is a collection of documents, i.e., JSON objects. Schema free: the schema of an index is automatically extended as new documents are added. Distributed Scales horizontally by using multiple servers (nodes) for availability and fault tolerance All nodes form a cluster Indexes are stored in shards (primary and replicas) on the nodes. Access RESTful API Aggregator Framework performing complex data analysis and summarization 34 Logstash Supports Parsing for prede ned patterns (grok patterns), transforms and lters Derive structure Anonymize personal data geo-location lookups Filter plugins: https://www.elastic.co/guide/en/logstash/current/ lter-plugins.html 35 fi fi fi Kibana 36 X-Pack Extensions to ELK Authentication and authorization Monitor ELK Alerting Report generation of Kibana contents Machine learning, e.g. anomaly detection and forecasting SQL interface for elastic search 37 Beats Agents to collect data Filebeat for logs Metricbeat for metrics Heartbeat for health Elastic stack also provides integrations for data injection for major systems 38 Application of Log Processing Logging has other uses besides debugging and performance monitoring. Properly structured logs help you to perform: Incident root cause analysis Anomaly detection Fault prediction and predictive maintenance Detect and respond to Data Breaches and Other Security Incidents Ensure Compliance with Security Policies, Regulations & Audits Best practices for log analysis Pattern Detection and Recognition Log Normalization Classi cation and Tagging Correlation Analysis Arti cial Ignorance 39 fi fi AWS CloudWatch Logs Allows to store and analyze Logs from Amazon Services and your application services Log group allow to cluster several log streams from multiple sources. Metrics can be extracted from the log statements automatically and inserted into CloudWatch metrics Retention period can be controlled. By default logs are never deleted. Analysis via CloudWatch Logs Insight Special query language for analysis with support for log entry elds from AWS service logs and JSON-based application logs Support for visualization of query results Realtime processing via Streaming to Kinesis analytics service or Lambda Streaming to your Amazon OpenSearch Service cluster supporting OpenSearch and Elasticsearch 40 fi CloudWatch Logs Insight 41 Distributed Tracing – Google Dapper Tracing Capture the interaction of di erent services, the life of a request Capture the individual events, e.g., submit a request, receive the request, start processing, ….., submit answer, receive answer Associate events with a given request to be able to analyze the execution of this request. Google designed Dapper for Continuous and ubiquitous tracing Low-overhead Application transparency Scalability Check the Dapper paper 42 ff Tracing a request 43 Annotating events with a unique request id Trace id created at frontend service It has to be passed to subrequests Requires manual or automatic instrumentation. Request represented as a Dapper trace tree Nodes are called spans: lifetime of a request Edges indicate the temporal relationship Spans Represents an RPC Attributes: Span id: identi es a span Parent id: span id of triggering span Trace id: identi es triggering request Root span has no parent id Captures events related to that RPC Can include also annotations, application level events 44 fi fi Span Be aware of clock skews because events are created on di erent systems 45 ff Handling the Trace Id Trace context stored in thread-local storage Asynchronous execution Callbacks store the trace context Callback invoked: trace context copied to executing thread Interprocess communication Span and trace id are automatically transmitted Google’s advantage All applications use the same control ow and RPC library Instrumentation is thus automatic 46 fl Annotations Annotations are added by the application owner Additional timestamped events // C++: const string& request =...; if (HitCache()) TRACEPRINTF("cache hit for %s", request.c_str()); else TRACEPRINTF("cache miss for %s", request.c_str()); 47 Trace Collection Distributed and out-of-band 48 Security and privacy Dapper does not collect any payload data Payload can be added by the application developer via annotations Dapper can be used to enforce security policies E.g. proper use of authentication or encryption Or policy-based isolation Such runtime veri cation provides greater assurance than source code audits 49 fi Managing overheads Overheads Trace generation and collection overheads Amount of resources to store and analyze trace data Coalescing events Multiple trace events are coalesced to a log le write operation Asynchronous writes Writes are asynchronous to the traced application 50 fi Managing Overheads Adaptive sampling at the application Sampling: Only a certain rate of requests per second are captured. High-throughput applications use lower sampling rate than low-throughput apps. Adaptive sampling at collection time Used to reduce overall size and to meet the Bigtable throughput limit Trace id is used to decide whether data is collected Thus, although spans are distributed, entire traces are collected or discarded. 51 Dapper Tools Dapper Depot API Access to traces via trace id, bulk access based on Map Reduce, indexed access Indexed access Common request trace features: Service name + host name combined into a single index Storing an index requires signi cant storage. Therefore only a single index is captured. 52 fi Dapper User Interface 53 Open Source Alternative Open Telemetry (OTel, opentelemetry.io) Merger of Open Tracing and Open Census. Project of the Cloud Native Computing Foundation 54 Trace Visualization Jaeger (jaegertracing.io) Zipkin (zipkin.io) 55 Commercial tracing solutions AppDynamics https://www.appdynamics.de Acquired by Cisco 2017 Datadog www.datadoghq.com Instana www.instana.com Acquired by IBM 2021 56 Summary Cloud monitoring Gathering data as source for processing Processing will create information and insight Basis for many purposes on provider and client side Three pillars Logs Metrics Traces Collection Collect what is available – future usage unpredictable Be aware of the overheads Techniques for overhead reduction Additional challenge Create information from raw data 57