At TrueLayer, we operate a platform that allows developers to securely access their customers' financial data and initiate payments. This means that there are many businesses who depend on us being reliable in order for their products to function.
We follow engineering best practices like testing, pull requests, and autoscaling to ensure that changes we make do not break our services and that we are able to handle the traffic coming into our systems. However, in the real world machines fail, and things break.
We have an “on call” team who is ready to respond to incidents at any time. As we scale, our focus is on constant improvement in order to make our platform even more reliable for our customers, and make our on call engineer’s lives easier. In this post, we will share some of our learnings, highlight the importance of dashboard design, and give some suggestions on how to do it well.
It’s 3am and you are woken from your sleep by a phone call — who could be calling at such a time? You reach for your phone and realise that the phone call is from Pager Duty, the system we use for notifying our on-call team 🔥. An emotionless computer-generated voice tells you that a core service is misbehaving with a high error rate, and clients are being affected — so it’s time to start investigating. 🔎
Still dazed and confused, you fetch your laptop and open up the service’s dashboard. You are faced with a plethora of graphs and tables, telling you everything except what you want to know — how healthy the service is. You let out a groan because although you’re still unaware of which century it is, you are also going to have to pull this information together yourself.
Fast forward 30 minutes, and the sleep fog has mostly cleared. Getting the data you needed involved adding some new graphs to the dashboard, going through logs in Kibana, and running commands in Kubernetes. You have realised that the service isn’t scaling properly, so it was unable to handle the traffic it was receiving. You fix the issue by re-configuring the autoscaler, after which the error rate drops back to normal. After adding a note to the incident to explain what went wrong, you head back to sleep.
Going from being asleep one minute, to debugging production issues the next is not easy. Responding to incidents means investigating issues under pressure, sometimes involving unfamiliar services.
The scenario above was far from ideal. In this case, a better-designed dashboard would have allowed the engineer to gain situational awareness faster, lessening the time and effort required to resolve the incident. Below we will share what we have learned creating dashboards that provide context to a problem as quickly as possible.
To design great dashboards you need to put yourself into the shoes of an engineer responding to an incident. Let’s go through the various stages of responding to an alert.
When attempting to diagnose a problem, it helps to first understand how the problem could manifest itself. A doctor will ask what symptoms you are experiencing, but how do we ask a service how it is feeling? We can do this by understanding how the problem is affecting our customers/other services. The most common ways of understanding the health of a service are as follows:
What percentage of requests made to the service are being successfully served?
This is possibly the most basic metric to measure — and it can give you an indication as to how badly users are affected. Is it failing half of the time or every time?
How long the service is taking to respond to requests, broken down by percentiles.
A service could be returning successful responses 100% of the time, but if it is taking 15 seconds to do so your product is unusable for customers.
How much load the service is under (usually measured by requests per second).
This can help you understand why your service is struggling — someone famous may have tweeted about your product leading to millions of signups overnight.
These three main indicators should be at the top of your dashboards, to allow engineers to quickly situate themselves with the issue.
Now that we have a basic understanding of how the issue is presenting, we can begin asking follow up questions. Similarly, a doctor might ask you how your sleep has been, or if you have been under any stress lately. This is where we can put data that may explain the negative symptoms.
This is by no means an exhaustive list, but just a few suggestions for metrics to answer these basic questions:
How many replicas of your service are running.
A low number of replicas may explain why the error rate or latency is high, perhaps they are unable to handle the load they are under.
The amount of CPU/Memory/Disk Space being used by the service.
High CPU usage for example can result in the service being unable to respond to requests, driving up the latency and error rate.
The information above is definitely important, but resist the urge to feature it at the forefront of your dashboards.
At this stage, the engineer has used a combination of symptoms and supporting information to think of some possible causes of the issue. Just like a doctor will try to eliminate/prove a cause by asking for tests to be conducted, the engineer will look to more detailed sources of information.
It is important not to try to include all of this information in dashboards, as this leads to clutter and more cognitive overload for the engineer. Sometimes it’s better to rely on other systems that are better suited to answer those questions.
A list of recent deployments for each service — at TrueLayer, we post these into a slack channel.
Database & Queue Metrics
Health metrics from databases/queues used by the service.
Tracking a request across various different services — we use Jaeger.
Logs related to the service — we check these in Kibana.
Service Specific Metrics
Metrics that are specific to the service, and useful for debugging can be included further down in the dashboard.
Sometimes a service can struggle because the services it depends on are unhealthy. It’s helpful in those cases to examine the dashboards of those other services.
Use descriptive titles — A clear and concise name on every graph helps engineers find the metrics they are looking for.
Specify units of measurement — Every graph/table in your dashboard should have a unit of measurement specified, otherwise the numbers can be difficult to interpret (e.g requests per second, error rate in percentage, memory usage in GB).
Allow for “zooming in” — Some dashboard systems such as Grafana let you include dropdown menus in your dashboard that allow the engineer to view metrics related to a single instance or region. This can be very powerful, letting the engineer quickly identify issues only impacting one region or instance of a service.
Responding to incidents isn’t easy, so it’s critical that engineers have well-designed dashboards to help restore service as quickly as possible. Remember that less is more — keep the most critical information at the top of the dashboard, with more service specific metrics at the bottom.
Following these guidelines has improved our customer experience, and we hope you find them useful.
If you have ever solved these kinds of problems before or would like to, we would love to hear from you. Check out our openings. We are hiring!