After the launch of PayDirect, we learned a lot about how customers see and use our instant payment product. By listening to feedback, we found that we weren't tracking something they really cared about: how “instant” our instant payments actually are.
This article will walk you through the process of introducing a new service level indicator: in this case for payment settlement time. We will also provide an overview of our observability stack and philosophy.
Quantifying what good looks like
Is the service up?
Are we responding HTTP requests in a reasonable amount of time?
We want an answer to these questions in real time, and if the answer is alarming (for example, the service is down), we want to send an alert to the relevant engineering team.
Service level indicators (SLIs)
We start by defining and measuring a set of service level indicators (SLIs) – a quantifiable representation of the product properties that users care about.
Usually, we start by tracking a few standard metrics. These include availability, error rate, throughput, latency percentiles and more.
Service level objectives (SLOs)
Measuring is not enough. Is it good enough if your availability is 90.5%? What about 99.9%?
For each SLI, you want to define a threshold and a time period. For example, 99.9% of incoming requests in the last seven days were handled successfully. That is a service level objective (SLO).
Defining (and being accountable!) for your SLOs might be scary, but baking those SLOs into the design of your services can also be the most enjoyable part of the job! Plus, operating a service that is always available under heavy load is extremely satisfying.
Service level agreements (SLAs)
Saying that your service is “always” up and running might be too much, but it should at least satisfy the service level agreements (SLAs), which are basically SLOs written into the contract you signed with your customers. Failing to meet your SLAs can lead to unpleasant financial consequences.
Let’s get back to our problem. We started to receive several questions from our clients, which were a variation of the following:
Can you check why the transaction with id 904c09e2-5a55-4ed6-a169-f8dc886c8d67 is not settled yet?
Why are those transactions taking so long?
It’s a bad sign when clients spot a problem before you do. That’s when we realised that we were missing a key SLI: transaction settlement time.
The rise of a new SLI
Why are payments not instant?
Moving money is a complicated business. In fact, each payment goes through a multi-status lifecycle:
Booked: TrueLayer checks that there are enough funds in the account
Submitted: TrueLayer instructs the payment scheme to process the payment
Settled: the payment scheme notifies us that money has landed in the beneficiary account
We were already monitoring the success of each state change in isolation, ie the percentage of messages handled successfully and how long it took to process each of them. But we weren’t monitoring the payment lifecycle as a whole — how long it took to move from booked (the payment is authorised) to settled (the money has landed in the bank account of the beneficiary).
It was time to fix it.
Tracking the new SLI
We use Prometheus to scrape and query realtime metrics exposed by our services.
In a nutshell, Prometheus does a GET request against the /metrics HTTP endpoint, which all our microservices expose. This allows us to query the collected metrics. These metrics are used to both notify the teams if an alert is triggered and to power our dashboards in Grafana.
All the data required to compute the transaction settlement time is stored in our payment ledger.
The simplest way to start tracking the new SLI would have been to add a new metric to its /metrics endpoint. However, that isn’t the approach we chose.
The payment ledger is one of the most critical components in our entire stack. It's a fairly complex project on its own and, to keep it maintainable, we try to limit its scope as much as possible.
There was also a performance concern. We didn’t want to increase the load on the database that is used as the source of truth for payment statuses. Computing the transaction settlement time every 15 seconds would have impacted our maximum throughput (#payments/s), which is one of the key SLIs for the payment ledger.
There is a way forward, though. Our payment ledger emits an event every time a payment is created or its status changes. We decided to use those events as the integration point for tracking the new SLI.
The rise of a new microservice
We created a new microservice called transaction-settlement-tracker. It consumes the payment ledger events and maintains a view over payment state transitions in a separate Postgres database.
The view does not store all information about a payment. It just keeps track of statuses and how long each transition took. This helps to keep the microservice focused and easy to maintain.
We later realised, looking at our view, that we could expose another key metric we were missing: the total number of in-flight transactions (transactions that are not settled yet).
In-flight transactions have proven to be extremely helpful during our periodic load tests to understand if the payment ledger was struggling and how fast it recovered after a massive request spike.
We’ve now been monitoring transaction settlement time for over two months and we have a precise understanding of what normal and abnormal looks like.
In the next few weeks, we will roll out alerts to page our on-call team when the metric fails to meet the threshold we have identified.
In 2022, we’ll be able to proactively notify our customers of potential issues around settlement delays before they reach out to us!