Managing state in a concurrent, distributed, dynamically and massively scalable environment can be challenging. One strategy is to just eliminate state (basically just stop things from changing)! Anyone building software today will know that this is not always possible. Whilst the languages and systems we use to build modern software are improving in this regard (immutability by default, immutable infrastructure etc), issues still crop up.
One such issue we faced at TrueLayer was with our refresh tokens. They were the very definition of change and became a frequent visitor in our ticketing systems.
We spent quite a few cycles changing and improving our systems to mitigate these issues before we decided to do the most obvious thing and fix the actual refresh token itself. If it doesn’t change, we will probably have fewer change-related issues. Like, duh?
However, it wasn’t as if we didn’t really know this, we just held the view that having a one-time use refresh token was better. But to understand this you need to know a little about us and freshen up (pun intended) of refresh tokens.
Refresh Tokens — A Refresher
At TrueLayer we work hard to abstract away the different authentication mechanisms used by the numerous banks we integrate with.
One of the most visible touchpoints is our implementation of OAuth2. No matter what authentication scheme we need to deal with upstream, our clients always get a consistent set of authentication tokens. Lifetimes of the tokens can vary when they are close to expiration or if the upstream provider has further restrictions — otherwise, these lifetimes are also predictable.
If you are involved with integrating with TrueLayer, or know OAuth anyway, you will know that refresh tokens are a long-lived token that requires additional authentication in order to obtain a short-lived access token. Our refresh token lifetimes slide throughout the absolute length of consent. This means that their lifetime is reset whenever they are used. The sliding window is 30 days or the remaining consent window — whichever is less. This means that even if an authorisation consent is, say, 90 days, the refresh token must be used within a 30-day window in order to “refresh” the access — at which point the refresh token lifetime is reset once again.
This is all well and good. So far no surprises. Except that our refresh token could be used exactly once. And this is where things become challenging.
A long time ago…
Refresh tokens are probably the most valuable artefact our clients will need to manage. In fact, they are so important we are going to do whatever we can to make them as secure as possible. Ensuring refresh tokens can be used only once will significantly limit the opportunity for them to be (mis)used egregiously or maliciously.
This is not a direct quote but it does sum up our thinking. An industry-leading posture on security will always be a compelling argument to use TrueLayer rather than one of our competitors. We will make our products as easy to use as our security will allow us. When faced with a choice between easier use vs stronger security, we will always lean towards stronger security. Having a one-time use refresh token was a no-brainer.
The problem with a one-time refresh token, though, can be summed up with one word:
This seemingly innocent error message has a very special meaning within the world of OAuth2:
And this is the error message we return in all those cases. From a security standpoint, such a broad error condition will help protect against enumeration attacks — that is, a potential attacker cannot discover anything useful out about a refresh token. This is fine, we will log the actual reason in our systems to help us identify the root cause should an issue be raised.
But for a one-time use token, the issue goes deeper.
What happens if a refresh_token is somehow “lost”? Let’s say you have issued a refresh request but there was a network partition, or a bug in exception handling or a database write didn’t commit, or a space particle changed a zero to a one somewhere… etc. and refresh token you have now is incorrect. In this case, access has now been lost. Unfortunately, there is absolutely nothing TrueLayer can do to restore the access. It’s gone! The only resolution is to have your customer re-authenticate.
The complexity required to maintain a constantly changing refresh token coupled with the poor user experience when it goes wrong is a case where security is too strong.
Complexity leads to mistakes
A system where random re-authentication is tolerated will either cause customers to lower their guard or get frustrated and leave
Conversely, a refresh token that does not change is easy to secure and will ensure re-authentication occurs in a predictable way.
Trying to mitigate the issues with constant refresh token change will eventually lead to lower security as mistakes are introduced and workarounds implemented.
So, our refresh tokens are now durable. If you recall from the refresher we still require a client secret in order to use a refresh token — which is already really secure!
So, the refresh token you get now is permanent for the life of the consent. At least that is how it looks. Things still change in our system: refresh token expirations need maintaining, customer credentials change and require encryption and safe handling, infrastructure-related issues are inevitable and things will still go wrong from time to time. But these are our problems to manage and we will continue improving to ensure they never become our client's problems.