Getting Started with SRE – Step 1 – Instrumentation

Introduction

After we set ourselves the goal of system reliability, our first goal must be awareness. 

Simply put, we must be fully aware of how our service performs.

By performance, we are not referring to benchmark tests written and executed in isolation but rather how well our service delivers value to its users. The four golden signals (Traffic, Latency, Saturation, and Errors) described in the Google SRE book are a great starting point to understanding how our system performs.

To translate these concepts into action, we must be able to quickly and easily answer the following questions:

  1. How much is our service being used?
  2. What is the error rate our users are experiencing?
  3. Is our service responding to requests in a reasonable amount of time?
  4. Is our service fundamentally healthy?

For us to be able to answer these questions, we need to have sufficient instrumentation in our service. This article briefly discusses the what, why, and how of system instrumentation and concrete examples.

A quick note before we begin

Examples in this article are written in Golang and are based on StatsD and Datadog-based instrumentation. However, the underlying concepts are directly transferable to other programming languages and instrumentation options.

How to add instrumentation

This section will discuss how to add instrumentation and the various types of instrumentation we can use with Golang and StatsD. If you are familiar with these technologies, skip to the next section.

Counting Events

The first and perhaps most common metric we will add is an event counter. Events can be anything; examples include the usage of an API, the number of errors, or the number of results generated from an operation. 

When tracking events, it is important to remember that they are occurrences or things that have happened. We will be using the occurrence of the events to trigger alarms or the rate at which something is happening.

The following is an example of how to track a single event:

statsD.Count("login", 1, “result:error”, "issue:not_found")

The above code shows a call to the Count() method on our StatsD client. The first parameter, “login”, denotes this metric’s unique key or name. 

The second parameter, 1, denotes how many of these events we are recording. This is the most common usage of Count() to count a single event. 

The two parameters, “result:error” and “issue:notfound”, are a variadic list of tags that we are adding to the metric. By adding tags to our metrics, we can record multiple similar or related events under the same metric key. Using tags also allows us to efficiently graph and monitor these events both as a group and separately.

For the remainder of this article, we will use the Count1() method which is provided by some libraries as a shortcut to calling Count(key, 1).

The following example shows a realistic example with three different events being tracked under two separate keys:

func Login(username, password string) (*User, error) {
	user, err := loadUserFromDB(username, password)
	if err != nil {
		if err == ErrNotFound {
			statsD.Count1("login.err", "issue:not_found")
		} else {
			statsD.Count1("login.err", "issue:db_error")
		}

		return nil, err
	}

	statsD.Count1("login.success")

	return user, nil
}

There are two key points to note from this example. Firstly, all the errors are recorded under the “login.err” key but separate tags. This allows us to track the total the amount of errors (using just the key) and also be able to quickly understand the cause of the errors (by using the tags).

Secondly, successful logins are recorded using a different key, “login.success”. This is intentionally separate from the errors because in most cases, we will want to chart and monitor success, and error events separately. While combining them under a single key and using tags to differentiate the different outcomes is possible, it is annoying to use when charting.

Tracking Values over time

In the previous section, we used Count() to track the total number or rate of events over time. But sometimes, we want to track a particular value as it changes over time; for this, we use the Gauge() method.

Gauge() tracks only the last reported value for a key; thus, it is ideal for tracking resources like CPU and RAM usage, the number of goroutines currently executing, or the number of connections in a connection pool.

The usage of Gauge() is very similar to Count() and looks like this:

func MonitorPool(connPool ConnectionPool, hostname string) {
	statsD.Gauge("pool.active", float64(connPool.ActiveCount()), "host:"+hostname)
	statsD.Gauge("pool.idle", float64(connPool.IdleCount()), "host:"+hostname)
}

As you might expect, the first parameter is the key for this metric. The second parameter is the current value, and the last parameter is again metric tags as a variadic slice of strings.

When using Gauge(), we should always include sufficient tags to adequately graph the metric. For example, drawing a graph that shows the highest number of goroutines running across all servers is not very useful. However, graphing the number of goroutines being used per-server is. This same idea applies to other common uses of Gauge() like CPU and memory use.

Tracking the distribution of events

The last metric type we are going to examine is Histogram(). Like Count(), Histogram() tracks the individual occurrences of events but takes it much further; it also generates summaries of those events.

Histogram() generates five different values: the average, the median, the maximum, 95th percentile (p95), and the count for each metric. This makes Histogram() ideal for tracking the time-based events, like latency.

The following is a common use-case for Histogram():

func LoginAPI(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()

	doLogin()
	
	statsD.Histogram("api", float64(time.Since(start).Nanoseconds()), "endpoint:login")
}

The parameters in our Histogram() call are very similar to Count() and Gauge(). The first parameter is the metric name; the second parameter is the value we are tracking; in our example, the function’s duration, and the final parameter is the metric tags.

In some StatsD libraries, the UX of Histogram() has been improved with a Duration() method that takes care of the time calculation we see in the previous example. This Duration() method enables us to shorten the above example to the following:

func LoginAPIImproved(resp http.ResponseWriter, req *http.Request) {
	defer statsD.Duration("api", time.Now(), "endpoint:login")

	doLogin()
}

In this version, we are leveraging the fact that defer runs when the function return and the fact that defer parameters are evaluated when the line of code is encountered and not when it is executed. The result is much simpler and cleaner.

There are two additional points to remember when using Histogram(). Firstly, as Histogram() automatically counts occurrences, we do not need to add a separate metric for that. 

Secondly, when tracking the timing, particularly latency, p95 is the most significant value. This is because p95 gives us a reasonable approximation of what most users are experiencing. It is a better metric than the median because the median only represents what half the users have experienced. And it is better than the maximum because the maximum can be significantly skewed by a single extreme value.

General Tips Regarding Metric Keys and Tags

Tip 1: Add context

As we have previously mentioned all metric keys within a particular application must be unique. A simple way to achieve this is to prefix all metrics for a particular package with that package’s name. This has the additional advantage of making it easier to find the code generating a particular metric. For example, we can define a private constant in the package called logTag like this:

package api

const logTag = “api.”

Then when emitting metrics for this package, we can do something similar to:

statsD.Count1(logTag + “errors”,”type:db_failed”)

While we could define constants for all the log tags, this gets unwieldy quickly and given that logTag and “errors” are unchanging, the compiler should detect them as such and optimize the string concatenation for us.

Tip 2: Be wary of metric cardinality caused by tags

When a metric has a high number of tag and tag value combinations, it is considered to have high cardinality. Due to the complexity and cost of managing metrics with high cardinality, Datadog may refuse to fully process these metrics (and may charge extra for high cardinality data that it does process). To avoid these problems, we should limit the number of tags associated with a metric to as few as possible, less than 10.

Similarly, we should limit tag values to a predictable set of values, ideally, 20 or less.

Please do not do something like this:

func MyAPI(requestUUID string) {
	statsD.Count1("error", "requestUUID:"+requestUUID)
}

As the requestUUID is unique, every value we instrument will result in a new tag value, and the result will be a metric with very high cardinally. 

Another common mistake similar to this is to tag all metrics “per-user”; which leads to the same disastrous results.

What should we be instrumenting?

As mentioned previously, the minimum we need to understand is how our system is performing from our user’s perspective. The metrics we add to our system to track this performance are referred to as Key Business Metrics or KBM.

Ideally, though, it is not enough to know how the system is performing but to understand or at least hypothesize as to why our system is in the manner it is. To achieve this, we should also add metrics for the following:

  • The performance and utilization of all external dependencies
  • The performance and utilization of the underlying servers and infrastructure
  • The performance and utilization of any databases or caches
  • The occurrence of any significant events that could impact the performance of the system

Let’s examine each of these in a little more detail.

Instrumenting Key Business Metrics (KBMs)

For those of us instrumenting a HTTP-based API service, the simplest and easiest KBMs to track are our API statistics, specifically, our API’s Usage, Timing, and Error Rate. Thankfully achieving this is very easy, thanks to our Duration() function.

For HTTP-based services, we should add the following code (or something similar) to all of our HTTP handlers:

func LoginHandler(resp http.ResponseWriter, req *http.Request) {
	defer statsD.Duration("api", time.Now(), "endpoint:login")

	// call to the business logic/actually perform the login
	doLogin()
}

There are some interesting things to note here. The metric key does not need to be “api”, it can be whatever you like, but it should be the same for every handler in your system. This will make it much more efficient to generate graphs and monitors, as we will see in an upcoming article.

Also, we should differentiate the different endpoints by defining a different tag value (“login” in the example) for each handler. This approach, using the same metric key and tag name but different tag values makes the generation of the following graphs very easy:

This one metric allows us to answer three of our original questions regarding our system health, namely:

  • How much is our service being used?
  • Is our service responding to requests in a reasonable amount of time?
  • Is our service fundamentally healthy?

Ok, we could use some more details regarding the last question, but it is an excellent start.

We only have one question remaining for our KBMs this being:

  • What is the error rate our users are experiencing?

The metric we have already added uses Duration(), which you will remember is based on Histogram(). Histogram() tracks and summarizes our timing values and keeps a count of the total number of requests. This is the first thing we need to calculate our error rate. This missing piece is to track the number of errors. Let’s expand our previous example to include a metric that tracks the number of errors:

func LoginHandler(resp http.ResponseWriter, req *http.Request) {
	defer statsD.Duration("api", time.Now(), "endpoint:login")

	err := doLogin()
	if err != nil {
		statsD.Count1("api.error", "endpoint:login")
		
		resp.WriteHeader(http.StatusInternalServerError)
		return
	}
	
	_, _ = resp.Write([]byte(`OK`))
}

The significant change here is the addition of a Count1() call with a new metric called “api.error”

We do not need to add a metric to track success, as we can calculate this with the total number of requests and the error count. A typical improvement over this approach is to add another tag that tracks either the type of error or the response code. Anything that gives us a better indication of the cause of the error without reading the logs.

Now that we have our two metrics, we can combine them and generate the following chart:

We will discuss how to construct this chart in an upcoming article.

With just these two metrics in our HTTP handlers, we have our basic KBMs covered.

KBM’s will not always be limited to API-related values; if your service has other significant outputs, like publishing to a stream or uploading files, basically anything that stakeholders rely on, then these should also be instrumented. The main point to remember is to track our application’s usage and quality of service.

Action Items:

Here is a list of action items for this section that you should consider adding to your backlog:

  1. Add Duration() metrics for all KBMs
  2. Add usage metrics for all KBMs (if not already handled by the use of Duration() for point 1)
  3. Add error tracking metrics for all KBMs

Instrumenting External Dependencies

For most services, the performance and quality of service that we receive from our dependencies significantly impacts the performance and quality of service that our service can provide. For this reason, it is advantageous to instrument and monitor these dependencies.

Yes, the owners of these services should also be monitoring their services; however, this does not account for any network issues between the services, nor does it mean that we will be informed when those services are experiencing issues.

Given that the most common dependencies are other HTTP-based services, the way we instrument our interactions with these services is similar to how we instrumented our HTTP handlers with usage, timing, and error rates.

First we will use a Duration() call to track the timing and amount of times we call our dependency like this:

func CallMyDependencyV1() error {
	req := buildRequstToMyDep()

	start := time.Now()
	resp, err := http.DefaultClient.Do(req)
	statsD.Duration("external.mydep", start)

	if err != nil {
		return err
	}

	// TODO: process the response body

	return nil
}

You will notice a couple of things here. Firstly, we are not using defer. 

We could, but this would mean that the tracked duration would include the response time and any time used to process the response.

Secondly, we are tracking the time taken, regardless of the result from HTTP.Do(). This ensures we are tracking all usage (errors and success). However, it also means that if an error is triggered by HTTP.Do() the timing values will be impacted. Typically this call does not return an error except for configuration issues (like a missing or invalid URL).

Now that we are tracking the timing and amount of calls to the dependency, we can expand our example to include tracking errors using Count1(). Here is the resulting code:

func CallMyDependencyV2() error {
	req := buildRequstToMyDep()

	start := time.Now()
	resp, err := http.DefaultClient.Do(req)
	statsD.Duration("external.mydep", start)

	if err != nil {
		statsD.Count1("external.mydep.error", "cause:do")

		return err
	}

	// process the response
	if resp.StatusCode != http.StatusOK {
		statsD.Count1("external.mydep.error", "cause:status_code")

		return fmt.Errorf("unexpected error code. code: %d", resp.StatusCode)
	}

	// TODO: process the response body

	return nil
}

As you can see, we have not added one Count1() call but two. Both calls use the same metric key and tag name but different tag values. This gives us a fast and easy way to create the dashboards and monitors while allowing us to quickly see the errors’ cause.

With these two metrics, we can generate the following three charts using similar methods to those we used for the KBMs:

Action Items:

Here is a list of action items for this section that you should consider adding to your backlog:

  1. Add Duration() metrics for all external dependency calls
  2. Add usage metrics for all external dependency calls (if not already handled by the use of Duration() for point 1)
  3. Add error tracking metrics for all external dependency calls

Instrumenting Hardware

Another likely cause of degradation to service’s performance is the hardware on which the service is running. Fortunately, these metrics are almost always available to us without having to add any code. Datadog has integrations with all the major cloud providers. The Datadog agent is the application responsible for forwarding the metrics from the server to Datadog, can also monitor the local system.

At a minimum, we should be monitoring the CPU and memory usage of our applications, as you can see in the following charts:

There are many other system and network resource metrics available to us; you may wish to track metrics like disk usage, network throughput, and even the number of instances currently running the service.

One low-level metric you can track is not automatically provided, the number of running goroutines. Using the NumGoroutine() function from Go’s runtime package and StatsD’s Gauge() function, we could use a long-running goroutine to periodically poll and publish this value.

Action Items:

Here is a list of action items for this section that you should consider adding to your backlog:

  1. Investigate the environment that hosts your application to see what automatic instrumentation is available.
  2. Add any system-level metrics (like goroutines or garbage collection) that are significant to your system.

Instrumenting DBs and Caches

The performance of databases and caches also greatly impact the quality of service we can provide. As such, we should also closely instrument their performance. There are many ways to instrument databases and caches, but the underlying ideas are the same as the KBMs and external dependencies we have considered.

Firstly, we need to track total usage. With this, we can quickly notice sudden spikes or drops in this usage that likely indicate a problem. Ideally, we should also track usage by type or purpose. The most efficient way to do this is by using the same metric for all usage and differentiating using tags as we did with our API usage. This enables us the ability to generate these two charts:

Secondly, we need to track the performance of our database. Most of the time, if the database becomes slow, our KBM response time will also become slow. Similarly, if the usage of our service increases, there is often a database usage increase. This could also lead to database performance degradation depending on the capacity of the database.

We can use Duration() to generate the usage charts and this timing chart:

Finally, we should track the amount and type of errors using Count1() as we have done with the KBMs and dependencies. This will allow us to generate a chart like this:

When adding these metrics, please remember to add sufficient tags and tag values so that we can quickly identify the type of query and type of error without digging deeper into the logs.

Action Items:

Here is a list of action items for this section that you should consider adding to your backlog:

  1. Investigate if the database and cache libraries you use already contain instrumentation
  2. Add any missing timing metrics for all database and cache calls
  3. Add any missing usage metrics for all database and cache calls (if not already handled by the use of Duration() for point 1)
  4. Add any missing error tracking metrics for all database and cache calls

Instrumenting Significant Events

There are always events that are likely to impact the performance of the application or our instrumentation. I cannot provide you with a definitive list of these events as they will differ widely between systems, teams, and environments. However, events like deployments, application start, application stop, and configuration changes are good starting points. As are environmental issues like auto-scaling of the hardware or database backups.

Action Items:

Here is a list of action items for this section that you should consider adding to your backlog:

  1. Add instrumentation of application start and stop.
  2. Add instrumentation for any other key events you cant identify.
  3. Develop a practice whereby any events discovered while investigating significant changes in metrics (e.g., sudden spikes or drops) or downtimes cause you to add more event metrics. Thereby giving you greater visibility of your system over time.

Not all errors are created equal

I have one final optimization to offer you. Often we (programmers) instrument all the different types of errors in the same way. Our instinct is that everything that is not successful is an error.

However, with this perspective, we are overlooking how we want and need to respond to different types of errors can be quite different.

For example, assume you are on call, and when your system emits more than ten errors in a minute, a page (phone call) is sent to your phone. If a user incorrectly fills out the form submitted to your API and then gets frustrated and spams the submit button, this is probably something you don’t want to get woken up at 4 am for. On the other hand, if the database crashes at 4 am, that is something you need to get woken up to fix.

At a minimum, user errors should be tagged separately from other errors so that we can tune our alerts to allow for more user errors or exclude them from triggering pagers at bad times. I prefer to use a separate metric instead of a tag. This can make charting a little more troublesome but don’t mind if it results in uninterrupted sleep.

Summary

This article has examined why we want to add instrumentation to our systems and how we should do it. We have defined Key Business Metrics and why it is imperative to track them closely. We have also discussed how to add instrumentation to the parts of our service that are most likely to negatively impact the quality of service we deliver.

I hope you have enjoyed this article and found it helpful. Stay tuned for the next part, where we discuss how to visualize our instrumentation into a dashboard.


If you like this content and would like to be notified when there are new posts or would like to be kept informed regarding the upcoming book launch please join my Google Group (very low traffic and no spam).

References