RD-@Timed Guide - Removing and Replacing Timed Annotations

Summary

This document provides a guide on how to remove and replace @Timed annotations in your code to use APM-based alternatives for better performance. It details the process of confirming usage, identifying existing integrations, and replacing @Time metrics with trace metrics in monitors and dashboards.

Full Transcript

@Timed; why you should avoid it, how to remove it, and what to use instead The @Timed annotation was, for a while, the primary way of collecting latency and count metrics for what’s going on inside services at Toast; as a result of this, it’s found in an awful lot of our services, our libraries, and...

@Timed; why you should avoid it, how to remove it, and what to use instead The @Timed annotation was, for a while, the primary way of collecting latency and count metrics for what’s going on inside services at Toast; as a result of this, it’s found in an awful lot of our services, our libraries, and our templates. However, the SRE team believes its use is expensive, redundant, and produces lower quality data than the APM. Hundreds of thousands of unused metrics, which we are billed for, come from its use. Recent changes like the Dropwizard 3 migration have caused @Timed metrics to become even more of a nuisance, so now is the time to drop it. The short version: SRE recommends removing the @Timed annotation from your code, once you’ve found and modified any Datadog resources that depend on it. How do I migrate off of @Timed? 1. Confirm if you’re actually using the metrics anywhere. 2. Identifying how the metrics are being used. 3. Replace @Timed-originating metrics in your monitors and dashboards with APM-based alternatives… or just use the APM dashboard. 4. Once your use cases are covered, search through your codebase and remove @Timed annotations 5. On a new deploy, use the metrics summary with a version number tag for your service to confirm that the count of metrics has decreased. Why stop using @Timed? How does the APM replace it? @Timed is expensive, and most of the data it generates is never used. @Timed data is almost always completely redundant. trace.servlet.request handles resource request metrics trace.* metrics exist for many other use cases already - database queries, GraphQL operations and more If you need something custom, @Trace is as easy to use as @Timed Trace metrics are significantly more searchable and extensible @Timed metrics are unintuitive and less accurate than APM data. How do I migrate off of @Timed ? 1. Confirm if you’re actually using the metrics anywhere. Go to the metrics summary and filter by the tag for your service, e.g. service:ds-model. Then use the sidebar to filter to only Actively Queried metrics. Ignore any ServiceStatusResource metrics. @Timed metrics look like this; a prefix (usually app ), the Java classpath (usually starting with com.toasttab... ) ending with the function name and suffixes like.count ,.15MinuteRate ,.request.filtering.median etc. You should know them when you see them. If you don’t see any metrics in use, you can skip ahead to removing them - step 4 of this section. 2. Identifying how the metrics are being used. If you do see a few, it’s helpful to know where they’re being used so that you can be confident you can update any related monitors or dashboards. Clicking on any metric brings up their information. You can scroll to the bottom and see what dashboards or monitors use the metric in the Related Assets section. Generally speaking, if you have a lot of these, they’re probably only being used in the same handful of dashboards. To search for monitors in bulk, once you find out the format for the timed metrics emitted by your service, you can use the metric: query with a wildcard in the Monitors section. If you don’t see any related assets, it’s possible that the metrics are showing up as having been queried because they were used ad-hoc, such as in the Metrics Explorer. Those metrics can also be safely removed; one reason why metrics might sit in this category is because they were used to research for this article! 3. Replace @Timed -originating metrics in your monitors and dashboards with APM-based alternatives… or just use the APM dashboard. The APM dashboard (found in the Datadog Service Catalog; see the example for ds-model) for any given service is a feature-rich way to drill down into any given resource and see detailed interactions between it and its dependencies. It also displays JVM metrics, and other key metrics. It may work as a replacement for most service dashboards. Use the APM dashboard for your service for a while and reconsider if you even need the widgets you created for @Timed metrics. You can bring data from the APM dashboard into your own quite easily as well. Click on a widget, press ⌘+C , and then paste it into your own dashboard with ⌘+V. There are also APM summary widgets that can be included in any Datadog dashboard. If you still need to replace metrics in use within your existing widget, or in the case of monitors, you should be able to use trace.servlet.request , another trace.* metric, or the Datadog APM Metrics dropdowns to find a matching query to the one you’re replacing. If you’re struggling to find a replacement, ask in #sre on Slack and we’d be happy to assist. The “APM Metrics” dropdowns in Datadog’s widget editors make it remarkably easy to select metrics for any given endpoint. One particular caveat to note when transforming these metrics is that @Timed -originating metrics like.p99 are measured in milliseconds, and request time metrics from the APM are measured in seconds. Make sure to adjust your monitor thresholds accordingly to avoid accidentally paging anyone. 4. Once your use cases are covered, search through your codebase and remove @Timed annotations This is pretty straightforward; for most services, you can just remove them in bulk using something like project search in IntelliJ, and run spotlessApply to take care of removing imports. 5. On a new deploy, use the metrics summary with a version number tag for your service to confirm that the count of metrics has decreased. At that point, you’re done! Why stop using @Timed ? How does the APM replace it? @Timed is expensive, and most of the data it generates is never used. We are billed for metric ingest. @Timed metrics account for an enormous portion of our Datadog budget, and there’s an extremely strong chance that you are generating them, and you aren’t using them, or you’re using a tiny, tiny subset of the data you asked for. At time of writing, if you go to the Datadog metrics summary, you’ll see that you can filter the metrics we generate by whether or not they actually receive any use, and that only about 2% of the metrics we generate throughout Toast actually receive any use from Datadog. If you use @Timed once, you might be surprised to see that it creates up to 60 different metrics each time. This seems to vary depending on which version of the annotation you use, whether your service has migrated to Dropwizard 3 or not etc; the below random example has 60 resulting metrics and none of them get used. So @Timed produces an enormous amount of data. Their low rate of actual use is universal - that isn’t a cherry-picked example. If you type in some of the keywords used in @Timed metrics, you’ll see a pattern. If you searched for just 1MinuteRate metrics, one of the types created by @Timed by default, you’ll see that only 1% get accessed. If you search for 15MinuteRate , it’s 0.1%: @Timed data is almost always completely redundant. At Toast, we have an extremely strong APM implementation; the Datadog Java Agent plugs in directly to most of our user data to automatically create rich data for incoming requests in Dropwizard, or GraphQL queries, etc. The result is that nearly every single thing that we use @Timed for is already redundant. trace.servlet.request handles resource request metrics The vast majority of places where @Timed shows up in our codebase are to instrument incoming requests from users. It shows up all the time in the stack of annotations on resource functions, next to HTTP method and path annotations. Let’s take a look at that example again - it’s the /models path on ToastJamResource in the ds-model service, which generates 60 different metrics: Every single one of those metrics can be replaced with trace.servlet.request , which we already get included with our APM spend. That metric can be used to get counts, arbitrary percentiles, sort by errors, etc. If you want the fine detail on how that works, it’s what Datadog calls a distribution metric. trace.* metrics exist for many other use cases already - database queries, GraphQL operations and more Use the Metrics Summary and filter to your service to see what APM data is already collected and turned into trace.* metrics. When you filter to a specific service in that view, make sure to put a wildcard at the end of your service name. The cards service outputs 8 kinds of tracing metrics, but by searching for service:cards* you can also find metrics tagged under cards-aws-sdk , cards-postgresql , cards-hibernate etc., allowing you to use the APM to get accurate database interaction times as well. If you need something custom, @Trace is as easy to use as @Timed If you have a specific type of operation that you are instrumenting using @Timed and it is not already being captured by the Datadog Java Agent’s auto-instrumentation, it’s easy to migrate - see Datadog’s own documentation at Java Custom Instrumentation using Dat adog API , in particular take note of the @Trace annotation, which allows you to create new operation types with almost the same ease of use as @Timed provides - all you need to do is provide an operation name and resource name. With tracing, we are charged based on the amount of data ingested. This is more scalable than creating new metrics, avoiding the costs associated with custom metrics and tag cardinality. There’s no such thing as free observability, and you are encouraged to think carefully about what should be instrumented, but the cost of APM data is easier to understand and doesn’t have the immediate extra cost of a new metric each time. Trace metrics are significantly more searchable and extensible You just need the name of the service and the path, and you can select the specific metric that you’re looking for. The trace-generated metric count:trace.servlet.request{service:ds-model, resource_name:get_/v1/jam/models, env:preproduction}.as_count() is a lot easier to recognise and understand than monotonic_diff(sum:app.com.toasttab.service.dsmodel.resources.ToastJamResource.models.count{env:preproduction}) That’s why Scalable Observability uses APM data; it provides a straightforward way of thinking about every service’s basic request metrics without having to mess around with discovering arbitrary class names. It also means that if you wanted all the different endpoints in ToastJamResource , you could do that with a wildcard ( resource_name:*/v1/jam/* ) and not have to seek out ten different metrics and keep updating the query each time you add a new endpoint. These metrics can be combined with Datadog functions like moving rollup to get the time bound metrics comparable to 1MinuteRate and 5MinuteRate. You can split out errors from it by specifying error:false if you only care for the latency of requests without errors. There’s a separate Hits by HTTP Status if you care for finer detail on response types. @Timed metrics are unintuitive and less accurate than APM data. You would think that a metric ending in.count generated by @Timed would show you the number of requests a service got within a specific timeframe if you put it into a Datadog graph. It doesn’t. Let’s take the example from the /models endpoint. This is over a week, on a service that gets very little traffic, and; the graph is shaped strangely with jagged drops, and is constantly showing values in the hundreds. the.count metric doesn’t do anything meaningful by itself That metric is actually a gauge which returns the cumulative sum of the number of requests since that metric registry was actually created. The fall-off you see on the back half of the graph are when instances of the service begin to go offline. If you actually want to count the number of requests the service receives, you need to use Datadog’s Monotonic Diff function. forcing the count metric to show something closer to actual counts There’s still obvious issues with this metric. What does it mean to get 50.17 requests in a time window? If you use a real count metric, like the one provided by the APM, it looks closer to the second graph, albeit with whole numbers. However, I’ve seen a couple of dashboards that don’t realise this un-intuitive distinction, and end up full of nonsense data because something like Monotonic Diff wasn’t applied. It’s important to note that trace.* metrics are generated before any trace sampling takes place, so they are accurate regardless of your service’s trace sampling settings.

Use Quizgecko on...
Browser
Browser