Introduction
In Part 1 of this series, we introduced the goal of understanding how our system performs by adding instrumentation. This article expands on this goal by taking the instrumentation data captured and visualizing this data on a dashboard.
The goal of SRE dashboards is to give us a simple and efficient means of understanding how our application is performing and why.
As highlighted in Part 1, we must be able to quickly and easily answer the following questions:
- How much is our service being used?
- What is the error rate our users are experiencing?
- Is our service responding to requests in a reasonable amount of time?
- Is our service fundamentally healthy?
This article will first examine the aspects that make for a highly effective dashboard. We will then discuss a series of tricks and techniques we consider when constructing our dashboard. With the theory out of the way, we will then examine a typical dashboard for a REST service that will serve you well as a base for your dashboards.
Designing a Highly Effective Dashboard
The best way to design and build a dashboard is to approach it like any other product design; start by thinking about the dashboard users. Who are they? What are they trying to achieve? How and when will they be using the dashboard?
The answer to the first question is easy; Who are the users? You and your teammates. Because of this, some amount of tribal knowledge can be expected. The use of abbreviations, technical jargon, and company-specific terminology is expected. It is best not to take this too far, but it is essential to remember that one of the driving goals is ease of use.
What are we trying to achieve with the dashboard? We have already covered this in the introduction. We want to know if our system is doing what we built it for, and if not, why? With this in mind, we should organize and optimize the dashboard so that the significant metrics can be quickly and easily seen and understood; this often means placing them at the top of the dashboard. Similarly, we should strive to reduce or remove any need to interpret the dashboard. We can do this by separating or filtering out expected events. We can also do this by providing comparison data; we will see more on this in a large section.
The last question is the most important, How are we using the dashboard?
The most significant time we will use the dashboard is when our system has a problem, i.e. during an outage incident.
During an incident, It is natural for folks to be panicky and distracted and, perhaps, not thinking clearly. The military has a term for this; it’s called the Fog of War.
The best way to address this problem is through preparation and simplicity. We prepare by constantly tweaking and improving the dashboard and by focusing on the clarity of the charts.
Additionally, we should take every incident, however minor, as an opportunity and ask ourselves, what metric or chart or other improvements would have made detecting and diagnosing the issue faster and more straightforward? Sometimes even simple changes like reordering the charts, adding a scale, or changing the colors can have a significant impact.
Tricks and Techniques
Historical Comparisons
One of the most valuable features of Datadog dashboards is the ability to plot the current value of a stat side-by-side simultaneously the day, week, or month before.
For many of the services we build, the usage and, therefore, the metrics differ throughout the day. However, they tend to be consistent from one day to the next or between now and the same time last week. By charting both the current value and the value at the same time last week, we can quickly recognize disparities and, by extension, discover potential aberrations in how our system is performing or being used.
Consider the following queries-per-second (QPS) chart:
This chart is accurate and informative, but we cannot, without prior experience, know if our QPS is good or bad. We could be running at 10% or 200% of last week’s usage without any idea. By adding last week’s usage as a dotted and smoothed line, our chart is significantly improved to this:
Now, at a glance, we can see that our usage pattern is consistent with last week and, by extension, typical.
The code required to generate this graph is:
{
"viz": "timeseries",
"requests": [
{
"q": "sum:my.service.api.count{$commit,$services,$environment}.as_rate()",
"type": "line",
"style": {
"palette": "dog_classic",
"type": "solid",
"width": "normal"
}
},
{
"q": "autosmooth(week_before(sum:my.service.api.count{$commit,$services,$environment}.as_rate()))",
"type": "line",
"style": {
"palette": "dog_classic",
"type": "dotted",
"width": "normal"
},
"on_right_yaxis": false
}
],
"yaxis": {
"max": "auto",
"scale": "linear",
"min": "auto",
"label": "",
"includeZero": true
},
"markers": []
}
In the above code, you will notice the following:
- We have used the same metric my.service.API.count for both requests
- We have applied week_before() to the second request, making it the historical line.
- We have applied auto smooth() to the historical line to convert actual data to a trend line. Reducing the noise on the chart.
- We have changed the historical request line style to dotted; this makes the historical line less visually significant and allows us to differentiate between the current and historical values.
A word of warning, hIstorical comparisons are not always useful, for example, for sparse data, like errors, there is often no value in comparing the current errors with the same time last week.
Timing (Average vs. p95 vs. p99 vs. Max)
In the previous article, we used the Duration() and Histogram() methods to record timing events. Consequently, we can chart multiple timing values for every stat we recorded. All histogram-based stats automatically calculate the min, max, average, median, p95, and sometimes a p99 value.
Before we go any further, we need to explain what p95 and p99 are. P95 refers to the 95th percentile of all the values. If we were to sort all the values from smallest to largest, p95 would contain the smallest 95% of the values. The following is a handy visualization:
When plotting timing on a chart, I strongly recommend using the p95 (or p99) value for timing, as this is what most people are experiencing. We could use the maximum value, but it would be impacted by any random or aberrant situation and lead us to believe the system performed poorly when it was only one odd value. Similarly, we cannot use the average or median values as the average is impacted by these random odd values and the median only tells us what half of our users are experiencing.
When charting timing, it is vital to ensure that we are comparing like values. For example, the response times of two different APIs are unlikely to relate to each other. Consider the following chart:
In this chart, we have plotted each API’s response times separately for two different endpoints. In this example, it doesn’t make much sense to compare the performance of these two endpoints with each other as they have completely different performance profiles. However, we have instead included a historical comparison for each endpoint. Here is the code for the above chart (with the historical chart removed for brevity):
{
"viz": "timeseries",
"requests": [
{
"q": "max:my.service.api.95percentile{$commit,$host,$environment} by {endpoint}.fill(zero)/1000000",
"type": "line",
"style": {
"palette": "dog_classic",
"type": "solid",
"width": "normal"
}
}
],
"yaxis": {
"max": "auto",
"scale": "linear",
"min": "auto",
"label": "",
"includeZero": true
},
"markers": []
}
In this above code, you will notice:
- We are using the max() function – this plots the biggest value for the chart period (i.e., the highest p95 for the minute or second, depending on the scale of the chart)
- To make the chart more relatable, we divide all the values by 1,000,000, converting the recorded values from nanoseconds to milliseconds.
- We have tagged the API timing values with an “endpoint” tag, which allows us to plot all the endpoints on one chart with only 1 line of code.
When charting timing for sparse data, using a line chart will not produce a practical result; you will need to switch to a bar chart instead and forego any historical comparison.
Rate vs. Count
When charting metrics, you need to decide if you want the data to be displayed as a rate (i.e., how many events per second) or as a count (i.e., the total number of events in the period).
In the following image, we are displaying the number of queries to our service in two very different ways:
On the left, we are using rate, and we can see we have roughly 2 queries per second. On the right, we are using count, and we can see that our total requests for each minute are roughly 120. Now, look at what happens in the chart on the right if we change the time period on the dashboard from 1 hour to 15 minutes:
The scale of our chart has changed, making it difficult for the reader to get an accurate appreciation of how much our service is being used. For this reason, I strongly recommend using rate for frequently occurring (aka non-sparse) data like QPS. Conversely, with infrequent (aka sparse) data, it is better to use count instead of rate. Consider the following image:
The chart on the left tends to be much more straightforward than the one on the right.
One of the most common sources of spare data will be errors. With error charts, we should aim One of the most common sources of spare data is errors. With error charts, we should aim for a situation where a blank chart (e.g., no errors) is the norm, and anything on the chart needs investigating. This often means separating user errors from system errors. User errors are something over which we have only a slight influence.
Consider a situation where a user makes a couple of invalid requests to our service at 4am in the morning. Given that it is early in the morning, our service will likely have minimal usage. As a result, these few bad requests could mean that a high percentage (perhaps all) of the requests to our system are errors. If we had an alert on this error rate, we would be woken up for something we cannot fix. Separating user and system errors and having different alert thresholds can save us from unnecessary sleep interruptions.
Template Variables
The previous article discussed applying tags to our metrics to make charting easier. This is where we see the payoff of this approach. The following image shows a QPS chart that includes both templates and a historical comparison:
In the middle of the image, you can see $commit $environment and $host. These are template variables. We have configured these variables as follows:
- $commit allows us to filter the chart to only data from a particular version of the application.
- $environment allows us to filter between the production and staging environments.
- $host allows us to filter to a particular server.
To add template variables, we click the pen icon at the top of the dashboard and are presented with a form that looks like this:
With this form, we mapped the template variable name to a tag and set a default value. Using default values, as we have, makes it easier to show the production environment data when the dashboard first loads.
Use of Tags
While tags are great, charting “by tag” sometimes adds complexity or obscures a chart. Consider the following two charts:
If we want to see the total usage of our database, the chart on the left, which does not use tags, is superior. However, if we want to see how we use the database, the chart on the right is better as it shows the same data split by type.
Perhaps a better example of how complex charts can get when using tags consider the following:
With this chart, it is tough to see an individual metric. However, we can still use this chart to see sudden spikes and drops.
We could also filter this chart using applying a template variable. This provides a quick way to “drill down” into the data without making many charts.
Using Formulas
Sometimes our data needs a little help in the UX department. Never is this more true than when recording duration values in Go. Durations are recorded in nanoseconds, which is not at all that human-friendly. A much better UX can be achieved by converting this data from nanoseconds into milliseconds or seconds. This is where formulas come in.
You might have noticed in the previous image that we are charting database query timing in milliseconds. The following image shows how to apply a formula to achieve this conversion:
As you can see, we are dividing the raw value (nanoseconds) by 1,000,000 to convert it into milliseconds.
Another typical use of formulas is to calculate percentages. In the following image, you can see that we are calculating the system error rate of our API:
In this case, we compare the total number of requests with those that did not result in either success (200) or a user error (4xx) and then multiply the result by 100 to convert it to a percentage.
Convert Relationships into Percentages
The trick of converting values to percentages that we used for the error rate can be applied in other compelling ways. Take, for example, the following chart:
In this chart, we are comparing our current API usage with that at the same time last week. This chart is exceedingly handy as it quickly informs the reader of our usage level without them needing to have the context of what usage is “normally” like at this time.
These sorts of comparisons provide a powerful signal as to the state of the system. We can achieve this chart using the following configuration:
As you can see, this time, we are subtracting last week’s usage from this week’s and dividing the result by last week’s usage. After multiplying the result by 100, the result is the percentage increase or decrease in usage compared to last week.
Add Sensible Markers and Y-Axis Controls
As with many of these tricks and tips, it is important to remember that our goal is usability. The ability to see “at a glance” that something is significant or out of place. For this trip, I recommend that you add markers (the dotted horizontal lines in the previous chart) to all charts to improve readability. Consider the following pair of charts:
These charts display the same information, but the chart on the left does not have the Y-Axis set to a range of 0 to 100 and has no markers added.
When glancing at the chart on the left, you could become concerned regarding a sudden and significant spike. However, when we have the scale set, we can see that while it was a spike, it did not come close to exceeding our capacity.
A Typical REST Service Dashboard
Now that we have discussed a series of techniques we should use to build a dashboard, let’s take a quick look at the components of a REST service dashboard.
I like to organize my dashboard into groups of metrics in a very particular order. I ensure that the most important (read impactful) metrics are at the top and the charts we need for debugging are further down.
My six metric groups are:
- Key Business Metrics (KBM)
- Other Metrics
- External Dependencies
- Data Store Usage
- Infrastructure
- Data Store Infrastructure
Let’s dig into each of these.
KBM
In this group, our primary goal is to see “at a glance” the overall state of the system and the quality of the service we are providing. We are intentionally limiting the number of charts in this section to as few as possible to limit distractions. For most of these metrics, we will want to apply historical comparisons so that the user quickly see what the metrics should be and how they might be wrong.
Here are the standard KBM charts for a REST service:
The source for these charts is here.
The following is a quick explanation of these charts:
- QPS – The queries per second or usage of the APIs that make up this service
- API Usage (by Endpoint) – This expands on the QPS chart by showing the usage separated by endpoint.
- API Timing (by Endpoint) – Shows the p95 duration by endpoint.
- API System Error Rate – This shows the percentage of API requests that result in system-only errors.
- API Error (by Endpoint) – Breaks down the system errors by endpoint.
- Service Events – Displays significant events (like system start and stop).
- API Usage % (week/week) – This compares the current usage of our service and its relation to the same time last week.
Other Metrics
In this group, we will include any business-related metrics our system produces that are not significant enough to be considered KBMs. Charts in this section will be particular to the system itself and its environment.
I can offer a few examples of charts commonly found in this group. The first example is a chart of user errors. In the KBMs, we intentionally filtered out user errors and only monitored system-generated errors.
While it is often impossible for us to fix all user errors, they only sometimes indicate that something needs to be fixed with our system. However, we can often do things that either cause or could alleviate these errors.
The second example is a usage chart for another REST service. Often our REST service is one of many services that exist to serve our customers. Therefore, it can be helpful to add a chart from another service to our dashboard that we can use to see if a sudden drop or spike in our usage also happened to other services. When this occurs, it is a strong signal indicating the problem outside of our service (like a network-related issue).
The final example is any charts showing errors generated by our system in the inner modules of the system. In the KBM group, we charted the errors generated by the API package. These errors can have many causes; these error charts help us debug the cause of the errors we saw emitted by the API.
External Dependencies
In this group, we monitor our usage of our dependencies and the quality of service we receive from them. It is common for the performance of an external dependency to be able to impact our performance. Any sudden spike in latency or errors will likely negatively impact our quality of service.
Charts in this section are, of course, heavily dependent on the dependencies of the service and environment. However, it is common to use the same quality-of-service charts that we used for our KBMs. For all dependencies, we should have charts for Usage and Latency, as you can see in the following image:
As you can see in the requests chart, we compare the current usage with a smoothed version from the previous week. With this comparison, we are looking for changes in usage that do not correlate with similar changes in our QPS. The usage chart is also helpful in displaying sudden spikes or drops in usage, which likely indicate an issue with either the dependency or our connection to it.
You will also notice that we are running the same week-on-week comparison on the latency chart. This comparison is primarily used to inform us of the typical latency, but it can also indicate other potential problems like degradation in performance.
Also, similar to the usage chart, sudden spikes in the latency chart can indicate a dependency or network issue. Note that network issues will often cause multiple, perhaps all, dependency latencies to spike simultaneously.
We can also include other charts in this section, including resiliency events (like retries or circuit breaker open) and unexpected responses (like 4xx or 5xx HTTP response codes). For charts like this, we should aim to have them empty most of the time, and when there are any values, we should investigate.
Data Store Usage
Database and Caches (like Redis) are external dependencies and could have been included in the previous section, but given their importance and the fact that they are under our control, I recommend charting them separately. Like external dependencies, any latency issues or errors with our data stores are extremely likely to result in issues with our KBMs.
Typical charts in this section include:
- DB Query Usage Rate – This will show the current QPS against the database and a week-on-week comparison. The comparison gives us an indication of what is “normal”. Sudden spikes or drops will likely indicate a problem with the DB or our code.
- DB Query Usage Rate by Type – This chart is a more detailed version of the previous one. It is often jam-packed with information and hard to read, but it can provide more information about the type of usage spikes or drops when such issues occur.
- DB Timing by Type – Like the previous chart, this one is likely very dense and hard to read at a glance. However, different DB operations take different amounts of time, so a summarized version has limited value. This chart’s value is in tracking database contention or lock issues through noticeable spikes or increases in latency.
- DB Errors – Like many of our error charts, this one should be optimized to be empty most of the time, and any data in this chart should be investigated and fixed.
- Redis Usage Rate – This chart has the same goals, usage, and issues as the DB Query Usage chart.
- Redis Timing – This chart has the same goals, usage, and issues as the DB Timing chart; however, splitting this chart by type is often difficult. Typically, this is less of a problem as Redis latency is more consistent between the different use cases.
- Redis Errors – This chart has the same goals, usage, and issues as the DB Errors chart.
Infra
When it comes to infrastructure, there are likely way more metrics available than we could ever find a need for. For this reason, it is advisable to only focus on those that will directly impact the quality of service we can provide. In most cases, these will be the resources on our servers, such as the CPU, memory, and hard disk.
We should use infra metrics in two main ways. Firstly, to help diagnose or explain unusual behavior in our other metrics, particularly the KBMs.
In this case, we are looking for sudden changes corresponding to the event we are investigating.
Secondly, we should use these metrics to look for trends that might result in future issues. For example, when the CPU usage increases to 80% simultaneously as our busiest usage. This is unlikely to be an immediate problem, but when our usage increases, that 80% is likely to also increase. In this case, we should attempt to get ahead of that potential problem and address the CPU issue through scaling or improvements in our CPU utilization.
Given that we are monitoring finite resources, all charts should plot the max value for a period rather than averages or medians.
Typical charts in this section include:
- CPU Usage % – This shows the CPU usage as a percentage of the available CPU. This chart should be displayed per host as averages have less meaning, especially when we have a single abnormal process or an uneven load distribution. We should also include an averaged and smoothed comparison with the same time last week to have a simple reference of “normal” usage.
- Memory Usage % – This shows the memory usage as a percentage of the total available memory. Similar to the CPU, this should be displayed per host to see abnormalities. We should look for a slow upward-only trend that might indicate a memory leak. We should also include an averaged and smoothed comparison with the same last week to have a simple reference for “normal” usage.
- Instance count – This chart is only helpful in an environment with auto-scaling and shows the number of instances currently in use. We should also include a comparison of the same time last week as a baseline.
Other charts you might want to consider include:
- Network I/O – For services that ingest or publish large amounts of data, a network I/O chart can be helpful to monitor cases where usage exceeds the host’s capabilities.
- Typically you will want a week-on-week comparison to provide context.
- Disk I/O / IOPS – For services that use the HDD or produce a significant amount of logs, it might be helpful to monitor HDD usage.
- This is especially true in cloud environments where resources are limited or chargeable.
- Typically you will want a week-on-week comparison to provide context.
- Disk throttling, CPU steal, and Firewall Dropped Packets – Charts like these can help diagnose cases where usage exceeds provisioned capacity.
- Typically week-on-week comparisons are unnecessary as our goal is to ensure these events seldom or never occur.
Data Store Infra
When monitoring the data store infrastructure, all of the points in the Infra section (above) apply. However, some of these metrics take on a different significance when it comes to the performance of our data stores, and there are also some additional metrics that we need to monitor.
For databases, we want to pay closer attention to hard disk usage and performance than we might for application servers. Database server performance is often bounded by the capability and capacity of the HDD. We should be looking for situations where our usage is approaching the maximum capacity and trying to get ahead of the issue before it creates a problem.
For services like Redis, which stores all the data in memory, monitoring memory use is critical.
Typical charts in this section include:
- CPU Usage %, Memory Usage % – These should be very similar, if not identical, to the charts in the Infra section
- Storage Usage % – This shows the current HDD usage compared with the total available. Displaying this as a percentage will allow us to see that we are running out of HDD space before it happens (and at what rate we consume the disk).
- Connection Counts – For databases and caches (e.g., Redis), a high number of connections can result in low performance due to thread thrashing and contention. Monitoring these values over the day, with comparisons to last week, allows us to identify potential issues, especially those caused by bad deployments.
Other charts you might want to consider include:
- Replication Lag – This chart or value tracks the DB replication lag.
- Cache Hit % – With this chart, we are tracking the effectiveness of our cache.
Summary
Seeing as you got this far, congratulations. This topic seems simple from the outside, but there are a lot of nuances. The effectiveness of the dashboard can vary significantly based on the effort and skill we put into it.
The best thing you can do to improve your dashboard’s effectiveness is to constantly tweak it.
Continue to ask yourself:
- What information am I missing?
- How can I present this data to make it easy to understand at-a-glance?
- After an incident, ask yourself, what chart would have made the issue obvious?
Action Items
The following are the steps that you should take (if you haven’t already) to get started with dashboards:
- Create a dashboard
- Add the template variables – typically environment, commit and host.
- Add the sections – You can copy those from this document and add your own as needed.
- Add the typical charts into each of the sections.
- In some cases, you may need help from your infrastructure team. Sometimes the metrics you want are not synchronized into Datadog or are not tagged to your service correctly.
- This is often true with infrastructure metrics; you will want to ensure they are tagged to your service for easy filtering.
- Add a calendar event (once every 2 weeks or so) to review and tweak the dashboard.
If you like this content and would like to be notified when there are new posts or would like to be kept informed regarding the upcoming book launch please join my Google Group (very low traffic and no spam).
References
- Feature image by Chris Liverani on Unsplash