How to keep track of what your customers really care about
We created a new service level indicator to track payment settlement – here's how.
After the launch of PayDirect, we learned a lot about how customers see and use our instant payment product. By listening to feedback, we found that we weren't tracking something they really cared about: how “instant” our instant payments actually are.This article will walk you through the process of introducing a new service level indicator: in this case for payment settlement time. We will also provide an overview of our observability stack and philosophy.
Quantifying what good looks like
- Is the service up?
- Are we responding HTTP requests in a reasonable amount of time?
- Can you check why the transaction with id 904c09e2-5a55-4ed6-a169-f8dc886c8d67 is not settled yet?
- Why are those transactions taking so long?
The rise of a new SLI
- Why are payments not instant?
- Booked: TrueLayer checks that there are enough funds in the account
- Submitted: TrueLayer instructs the payment scheme to process the payment
- Settled: the payment scheme notifies us that money has landed in the beneficiary account
We were already monitoring the success of each state change in isolation, ie the percentage of messages handled successfully and how long it took to process each of them. But we weren’t monitoring the payment lifecycle as a whole — how long it took to move from booked (the payment is authorised) to settled (the money has landed in the bank account of the beneficiary).It was time to fix it.
Tracking the new SLIWe use Prometheus to scrape and query realtime metrics exposed by our services. In a nutshell, Prometheus does a GET request against the /metrics HTTP endpoint, which all our microservices expose. This allows us to query the collected metrics. These metrics are used to both notify the teams if an alert is triggered and to power our dashboards in Grafana.All the data required to compute the transaction settlement time is stored in our payment ledger.The simplest way to start tracking the new SLI would have been to add a new metric to its /metrics endpoint. However, that isn’t the approach we chose.The payment ledger is one of the most critical components in our entire stack. It's a fairly complex project on its own and, to keep it maintainable, we try to limit its scope as much as possible.There was also a performance concern. We didn’t want to increase the load on the database that is used as the source of truth for payment statuses. Computing the transaction settlement time every 15 seconds would have impacted our maximum throughput (#payments/s), which is one of the key SLIs for the payment ledger.There is a way forward, though. Our payment ledger emits an event every time a payment is created or its status changes. We decided to use those events as the integration point for tracking the new SLI.
The rise of a new microserviceWe created a new microservice called transaction-settlement-tracker. It consumes the payment ledger events and maintains a view over payment state transitions in a separate Postgres database.The view does not store all information about a payment. It just keeps track of statuses and how long each transition took. This helps to keep the microservice focused and easy to maintain.
We later realised, looking at our view, that we could expose another key metric we were missing: the total number of in-flight transactions (transactions that are not settled yet).In-flight transactions have proven to be extremely helpful during our periodic load tests to understand if the payment ledger was struggling and how fast it recovered after a massive request spike.