Monitor Virtual Machines with Azure Monitor: Collect Data PDF
Document Details
Tags
Summary
This document provides guidance on collecting data from virtual machines using Azure Monitor. It details data collection rules, cost considerations, and best practices. It's helpful for monitoring and managing Azure virtual machine environments.
Full Transcript
Monitor virtual machines with Azure Monitor: Collect data learn.microsoft.com/en-us/azure/azure-monitor/vm/monitor-virtual-machine-data-collection In this article This article is part of the guide Monitor virtual machines and their workloads in Azure Monitor. It describes how to configure collect...
Monitor virtual machines with Azure Monitor: Collect data learn.microsoft.com/en-us/azure/azure-monitor/vm/monitor-virtual-machine-data-collection In this article This article is part of the guide Monitor virtual machines and their workloads in Azure Monitor. It describes how to configure collection of data after you deploy Azure Monitor Agent to your Azure and hybrid virtual machines in Azure Monitor. This article provides guidance on collecting the most common types of telemetry from virtual machines. The exact configuration that you choose depends on the workloads that you run on your machines. Included in each section are sample log search alerts that you can use with that data. For more information about analyzing telemetry collected from your virtual machines, see Monitor virtual machines with Azure Monitor: Analyze monitoring data. For more information about using telemetry collected from your virtual machines to create alerts in Azure Monitor, see Monitor virtual machines with Azure Monitor: Alerts. Note This scenario describes how to implement complete monitoring of your Azure and hybrid virtual machine environment. To get started monitoring your first Azure virtual machine, see Monitor Azure virtual machines. Data collection rules Data collection from Azure Monitor Agent is defined by one or more data collection rules (DCRs) that are stored in your Azure subscription and associated with your virtual machines. For virtual machines, DCRs define data such as events and performance counters to collect and specify the Log Analytics workspaces where data should be sent. The DCR can also use transformations to filter out unwanted data and to add calculated columns. A single machine can be associated with multiple DCRs, and a single DCR can be associated with multiple machines. DCRs are delivered to any machines they're associated with where Azure Monitor Agent processes them. View data collection rules 1/14 You can view the DCRs in your Azure subscription from Data Collection Rules on the Monitor menu in the Azure portal. DCRs support other data collection scenarios in Azure Monitor, so all of your DCRs aren't necessarily for virtual machines. Create data collection rules There are multiple methods to create DCRs depending on the data collection scenario. In some cases, the Azure portal walks you through the configuration. Other scenarios require you to edit a DCR directly. When you configure VM insights, it creates a preconfigured DCR for you automatically. The following sections identify common data to collect and how to configure data collection. In some cases, you might need to edit an existing DCR to add functionality. For example, you might use the Azure portal to create a DCR that collects Windows or Syslog events. You then want to add a transformation to that DCR to filter out columns in the events that you don't want to collect. As your environment matures and grows in complexity, you should implement a strategy for organizing your DCRs to help their management. For guidance on different strategies, see Best practices for data collection rule creation and management in Azure Monitor. Control costs Because your Azure Monitor cost is dependent on how much data you collect, ensure that you're not collecting more than you need to meet your monitoring requirements. Your configuration is a balance between your budget and how much insight you want into the operation of your virtual machines. Tip For strategies to reduce your Azure Monitor costs, see Cost optimization and Azure Monitor. 2/14 A typical virtual machine generates between 1 GB and 3 GB of data per month. This data size depends on the configuration of the machine, the workloads running on it, and the configuration of your DCRs. Before you configure data collection across your entire virtual machine environment, begin collection on some representative machines to better predict your expected costs when deployed across your environment. Use Log Analytics workspace insights or log queries in Data volume by computer to determine the amount of billable data collected for each machine and adjust accordingly. Evaluate collected data and filter out any that meets the following criteria to reduce your costs. Each data source that you collect may have a different method for filtering out unwanted data. See the sections below for the details each of the common data sources. Not used for alerting. No known forensic or diagnostic value. Not required by regulators. Not used in any dashboards or workbooks. You can also use transformations to implement more granular filtering and also to filter data from columns that provide little value. For example, you might have a Windows event that's valuable for alerting, but it includes columns with redundant or excessive data. You can create a transformation that allows the event to be collected but removes this excessive data. Filter data as much as possible before it's sent to Azure Monitor to avoid a potential charge for filtering too much data using transformations. Use transformations for record filtering using complex logic and for filtering columns with data that you don't require. Default data collection Azure Monitor automatically performs the following data collection without requiring any other configuration. Platform metrics Platform metrics for Azure virtual machines include important host metrics such as CPU, network, and disk utilization. They can be: Viewed on the Overview page. Analyzed with metrics explorer for the machine in the Azure portal. Used for metric alerts. Activity log The activity log is collected automatically. It includes the recent activity of the machine, such as any configuration changes and when it was stopped and started. You can view the platform metrics and activity log collected for each virtual machine host in the Azure portal. 3/14 You can view the activity log for an individual machine or for all resources in a subscription. Create a diagnostic setting to send this data into the same Log Analytics workspace used by Azure Monitor Agent to analyze it with the other monitoring data collected for the virtual machine. There's no cost for ingestion or retention of activity log data. VM availability information in Azure Resource Graph With Azure Resource Graph, you can use the same Kusto Query Language used in log queries to query your Azure resources at scale with complex filtering, grouping, and sorting by resource properties. You can use VM health annotations to Resource Graph for detailed failure attribution and downtime analysis. For information on what data is collected and how to view it, see Monitor virtual machines with Azure Monitor: Analyze monitoring data. VM insights When you enable VM insights, it creates a DCR with the MSVMI- prefix that collects the following information. You can use this same DCR with other machines as opposed to creating a new one for each VM. Common performance counters for the client operating system are sent to the InsightsMetrics table in the Log Analytics workspace. Counter names are normalized to use the same common name regardless of the operating system type. If you specified processes and dependencies to be collected, the following tables are populated: VMBoundPort: Traffic for open server ports on the machine VMComputer: Inventory data for the machine VMConnection: Traffic for inbound and outbound connections to and from the machine VMProcess: Processes running on the machine By default, VM insights won't enable collection of processes and dependencies to save data ingestion costs. This data is required for the Map feature and also deploys the dependency agent to the machine. Enable this collection if you want to use this feature. Collect Windows and Syslog events The operating system and applications in virtual machines often write to the Windows event log or Syslog. You might create an alert as soon as a single event is found or wait for a series of matching events within a particular time window. You might also collect events for later analysis, such as identifying particular trends over time, or for performing troubleshooting after a problem occurs. 4/14 For guidance on how to create a DCR to collect Windows and Syslog events, see Collect data with Azure Monitor Agent. You can quickly create a DCR by using the most common Windows event logs and Syslog facilities filtering by event level. For more granular filtering by criteria such as event ID, you can create a custom filter by using XPath queries. You can further filter the collected data by editing the DCR to add a transformation. Use the following guidance as a recommended starting point for event collection. Modify the DCR settings to filter unneeded events and add other events depending on your requirements. Source Strategy Windows Collect at least Critical, Error, and Warning events for the System and events Application logs to support alerting. Add Information events to analyze trends and support troubleshooting. Verbose events are rarely useful and typically shouldn't be collected. Syslog Collect at least LOG_WARNING events for each facility to support alerting. events Add Information events to analyze trends and support troubleshooting. LOG_DEBUG events are rarely useful and typically shouldn't be collected. Sample log queries: Windows events Query Description Event All Windows events Event | where EventLevelName == "Error" All Windows events with severity of error Event | summarize count() by Source Count of Windows events by source Event | where EventLevelName == "Error" | Count of Windows error summarize count() by Source events by source Sample log queries: Syslog events Query Description Syslog All Syslogs Syslog | where SeverityLevel == "error" All Syslog records with severity of error Syslog | summarize AggregatedValue = count() Count of Syslog records by by Computer computer 5/14 Query Description Syslog | summarize AggregatedValue = count() Count of Syslog records by by Facility facility Collect performance counters Performance data from the client can be sent to either Azure Monitor Metrics or Azure Monitor Logs, and you typically send them to both destinations. If you enabled VM insights, a common set of performance counters is collected in Logs to support its performance charts. You can't modify this set of counters, but you can create other DCRs to collect more counters and send them to different destinations. There are multiple reasons why you would want to create a DCR to collect guest performance: You aren't using VM insights, so client performance data isn't already being collected. Collect other performance counters that VM insights isn't collecting. Collect performance counters from other workloads running on your client. Send performance data to Azure Monitor Metrics where you can use them with metrics explorer and metrics alerts. For guidance on creating a DCR to collect performance counters, see Collect events and performance counters from virtual machines with Azure Monitor Agent. You can quickly create a DCR by using the most common counters. For more granular filtering by criteria such as event ID, you can create a custom filter by using XPath queries. Note You might choose to combine performance and event collection in the same DCR. Destination Description Metrics Host metrics are automatically sent to Azure Monitor Metrics. You can use a DCR to collect client metrics so that they can be analyzed together with metrics explorer or used with metrics alerts. This data is stored for 93 days. Logs Performance data stored in Azure Monitor Logs can be stored for extended periods. The data can be analyzed along with your event data by using log queries with Log Analytics or log search alerts. You can also correlate data by using complex logic across multiple machines, regions, and subscriptions. Performance data is sent to the following tables: - VM insights: InsightsMetrics - Other performance data: Perf 6/14 Sample log queries The following samples use the Perf table with custom performance data. Query Description Perf All Performance data Perf | where Computer == "MyComputer" All Performance data from a particular computer Perf | where CounterName == "Current Disk Queue Length" All Performance data for a particular counter Perf | where ObjectName == "Processor" and CounterName == Average CPU "% Processor Time" and InstanceName == "_Total" | summarize Utilization AVGCPU = avg(CounterValue) by Computer across all computers Perf | where CounterName == "% Processor Time" | summarize Maximum AggregatedValue = max(CounterValue) by Computer CPU Utilization across all computers Perf | where ObjectName == "LogicalDisk" and CounterName == Average "Current Disk Queue Length" and Computer == Current Disk "MyComputerName" | summarize AggregatedValue = Queue length avg(CounterValue) by InstanceName across all the instances of a given computer Perf | where CounterName == "Disk Transfers/sec" | 95th summarize AggregatedValue = percentile(CounterValue, 95) by Percentile of Computer Disk Transfers/Sec across all computers Perf | where CounterName == "% Processor Time" and Hourly InstanceName == "_Total" | summarize AggregatedValue = average of avg(CounterValue) by bin(TimeGenerated, 1h), Computer CPU usage across all computers 7/14 Query Description Perf | where Computer == "MyComputer" and CounterName Hourly 70 startswith_cs "%" and InstanceName == "_Total" | summarize percentile of AggregatedValue = percentile(CounterValue, 70) by every % bin(TimeGenerated, 1h), CounterName percent counter for a particular computer Perf | where CounterName == "% Processor Time" and Hourly InstanceName == "_Total" and Computer == "MyComputer" | average, summarize ["min(CounterValue)"] = min(CounterValue), minimum, ["avg(CounterValue)"] = avg(CounterValue), ["percentile75(CounterValue)"] = percentile(CounterValue, maximum, 75), ["max(CounterValue)"] = max(CounterValue) by and 75- bin(TimeGenerated, 1h), Computer percentile CPU usage for a specific computer Perf | where ObjectName == "MSSQL$INST2:Databases" and All InstanceName == "master" Performance data from the Database performance object for the master database from the named SQL Server instance INST2. Perf | where TimeGenerated >ago(5m) | where ObjectName == Average of "Process" and InstanceName != "_Total" and InstanceName != CPU over last "Idle" | where CounterName == "% Processor Time" | 5 min for summarize cpuVal=avg(CounterValue) by Computer,InstanceName | join (Perf| where TimeGenerated >ago(5m)| where each Process ObjectName == "Process" and CounterName == "ID Process" | ID. summarize arg_max(TimeGenerated,*) by ProcID=CounterValue ) on Computer,InstanceName | sort by TimeGenerated desc | summarize AvgCPU = avg(cpuVal) by InstanceName,ProcID Collect text logs Some applications write events written to a text log stored on the virtual machine. Create a custom table and DCR to collect this data. You define the location of the text log, its detailed configuration, and the schema of the custom table. There's a cost for the ingestion and retention of this data in the workspace. Sample log queries 8/14 The column names used here are examples only. The column names for your log will most likely be different. Query Description MyApp_CL | summarize count() by code Count the number of events by code. MyApp_CL | where status == "Error" | summarize Create an alert AggregatedValue = count() by Computer, rule on any error bin(TimeGenerated, 15m) event. Collect IIS logs IIS running on Windows machines writes logs to a text file. Configure IIS log collection by using Collect IIS logs with Azure Monitor Agent. There's a cost for the ingestion and retention of this data in the workspace. Records from the IIS log are stored in the W3CIISLog table in the Log Analytics workspace. There's a cost for the ingestion and retention of this data in the workspace. Sample log queries Query Description W3CIISLog | where Count the IIS log entries by URL csHost=="www.contoso.com" | summarize for the host www.contoso.com. count() by csUriStem W3CIISLog | summarize sum(csBytes) by Review the total bytes received by Computer each IIS machine. Monitor a service or daemon To monitor the status of a Windows service or Linux daemon, enable the Change Tracking and Inventory solution in Azure Automation. Azure Monitor has no ability on its own to monitor the status of a service or daemon. There are some possible methods to use, such as looking for events in the Windows event log, but this method is unreliable. You can also look for the process associated with the service running on the machine from the VMProcess table populated by VM insights. This table only updates every hour, which isn't typically sufficient if you want to use this data for alerting. Note The Change Tracking and Analysis solution is different from the Change Analysis feature in VM insights. This feature is in public preview and not yet included in this scenario. 9/14 For different options to enable the Change Tracking solution on your virtual machines, see Enable Change Tracking and Inventory. This solution includes methods to configure virtual machines at scale. You have to create an Azure Automation account to support the solution. When you enable Change Tracking and Inventory, two new tables are created in your Log Analytics workspace. Use these tables for logs queries and log search alert rules. Table Description ConfigurationChange Changes to in-guest configuration data ConfigurationData Last reported state for in-guest configuration data Sample log queries List all services and daemons that have recently started. Kusto ConfigurationChange | where ConfigChangeType == "Daemons" or ConfigChangeType == "WindowsServices" | where SvcState == "Running" | sort by Computer, SvcName Alert when a specific service stops. Use this query in a log search alert rule. Kusto ConfigurationData | where SvcName == "W3SVC" | where SvcState == "Stopped" | where ConfigDataType == "WindowsServices" | where SvcStartupType == "Auto" | summarize AggregatedValue = count() by Computer, SvcName, SvcDisplayName, SvcState, bin(TimeGenerated, 15m) 10/14 Alert when one of a set of services stops. Use this query in a log search alert rule. Kusto let services = dynamic(["omskd","cshost","schedule","wuauserv","heathservice","efs","wsusser vice","SrmSvc","CertSvc","wmsvc","vpxd","winmgmt","netman","smsexec","w3svc", "sms_site_vss_writer","ccmexe","spooler","eventsystem","netlogon","kdc","ntds ","lsmserv","gpsvc","dns","dfsr","dfs","dhcp","DNSCache","dmserver","messenge r","w32time","plugplay","rpcss","lanmanserver","lmhosts","eventlog","lanmanwo rkstation","wnirm","mpssvc","dhcpserver","VSS","ClusSvc","MSExchangeTransport ","MSExchangeIS"]); ConfigurationData | where ConfigDataType == "WindowsServices" | where SvcStartupType == "Auto" | where SvcName in (services) | where SvcState == "Stopped" | project TimeGenerated, Computer, SvcName, SvcDisplayName, SvcState | summarize AggregatedValue = count() by Computer, SvcName, SvcDisplayName, SvcState, bin(TimeGenerated, 15m) Monitor a port Port monitoring verifies that a machine is listening on a particular port. Two potential strategies for port monitoring are described here. Dependency agent tables If you're using VM insights with Processes and dependencies collection enabled, you can use VMConnection and VMBoundPort to analyze connections and ports on the machine. The VMBoundPort table is updated every minute with each process running on the computer and the port it's listening on. You can create a log search alert similar to the missing heartbeat alert to find processes that have stopped or to alert when the machine isn't listening on a particular port. Review the count of ports open on your VMs to assess which VMs have configuration and security vulnerabilities. Kusto VMBoundPort | where Ip != "127.0.0.1" | summarize by Computer, Machine, Port, Protocol | summarize OpenPorts=count() by Computer, Machine | order by OpenPorts desc 11/14 List the bound ports on your VMs to assess which VMs have configuration and security vulnerabilities. Kusto VMBoundPort | distinct Computer, Port, ProcessName Analyze network activity by port to determine how your application or service is configured. Kusto VMBoundPort | where Ip != "127.0.0.1" | summarize BytesSent=sum(BytesSent), BytesReceived=sum(BytesReceived), LinksEstablished=sum(LinksEstablished), LinksTerminated=sum(LinksTerminated), arg_max(TimeGenerated, LinksLive) by Machine, Computer, ProcessName, Ip, Port, IsWildcardBind | project-away TimeGenerated | order by Machine, Computer, Port, Ip, ProcessName Review bytes sent and received trends for your VMs. Kusto VMConnection | summarize sum(BytesSent), sum(BytesReceived) by bin(TimeGenerated,1hr), Computer | order by Computer desc | render timechart Use connection failures over time to determine if the failure rate is stable or changing. Kusto VMConnection | where Computer == | extend bythehour = datetime_part("hour", TimeGenerated) | project bythehour, LinksFailed | summarize failCount = count() by bythehour | sort by bythehour asc | render timechart 12/14 Link status trends to analyze the behavior and connection status of a machine. Kusto VMConnection | where Computer == | summarize dcount(LinksEstablished), dcount(LinksLive), dcount(LinksFailed), dcount(LinksTerminated) by bin(TimeGenerated, 1h) | render timechart Connection Manager The Connection Monitor feature of Network Watcher is used to test connections to a port on a virtual machine. A test verifies that the machine is listening on the port and that it's accessible on the network. Connection Manager requires the Network Watcher extension on the client machine initiating the test. It doesn't need to be installed on the machine being tested. For more information, see Tutorial: Monitor network communication using the Azure portal. There's an extra cost for Connection Manager. For more information, see Network Watcher pricing. Run a process on a local machine Monitoring of some workloads requires a local process. An example is a PowerShell script that runs on the local machine to connect to an application and collect or process data. You can use Hybrid Runbook Worker, which is part of Azure Automation, to run a local PowerShell script. There's no direct charge for Hybrid Runbook Worker, but there's a cost for each runbook that it uses. The runbook can access any resources on the local machine to gather required data. It can't send data directly to Azure Monitor or create an alert. To create an alert, have the runbook write an entry to a custom log. Then configure that log to be collected by Azure Monitor. Create a log search alert rule that fires on that log entry. Next steps Additional resources Documentation Monitor virtual machines with Azure Monitor - Azure Monitor Learn how to use Azure Monitor to monitor the health and performance of virtual machines and their workloads. 13/14 Best practices for monitoring virtual machines in Azure Monitor - Azure Monitor Best practices organized by the pillars of the Well-Architected Framework (WAF) for monitoring virtual machines in Azure Monitor. Monitor virtual machines with Azure Monitor: Deploy agent - Azure Monitor Learn how to deploy the Azure Monitor agent to your virtual machines for monitoring in Azure Monitor. Monitor virtual machines and their workloads with an Azure Monitor guide. Training Module Collect guest operating system monitoring data from Azure and hybrid virtual machines using Azure Monitor Agent - Training Discover how to set up and integrate a Log Analytics agent with a workspace in Defender for Cloud using the Azure portal, enhancing security data analysis capabilities. 14/14