Building Spend Management: Slashing latencies from 3.5s to 600ms
Streamlining Spend
Rippling offers Spend Management, a product that handles expense requests and corporate card needs. Every time a customer charges a Rippling card, our authorization API is called to approve or decline the transaction. This API runs arbitrarily complex policies, evaluates risk and fraud models, and verifies the charge against the card and company limits. This needs to take place within four seconds or the transaction will be declined automatically which has real-world consequences for our customers, like the inability to pay for AWS bills, client dinners, etc.
In this post, we'll describe the key changes we made to bring the 99th percentile of this endpoint from around 3.5 seconds down to 600 milliseconds.
Application level improvements
With any deep dive into latencies, profiling code and determining latency hotspots are the key tools in an engineer’s arsenal for application-level improvements. These deep dives resulted in the following learnings.
Properly using database indices
Databases are powerful engines that can store, fetch, and process large quantities of data. Mismanaging the query patterns and indices within the database can lead to unacceptably high latencies. This article explains how indices are constructed and optimized for insertions, deletions, and lookups. Even though Rippling uses Mongo for our database needs, these learnings span SQL/NoSQL implementations.
We faced two major challenges.
Composite indices work left to right
We encountered an issue where one of our queries did not supply all the necessary fields required to use an index. For instance, we have a data model for Transactions that an employee (role) has made with their corporate card. We want to aggregate the total amount spent across all the transactions for a given employee and card. The index that was created for this table was:
1
[“company”, “role”, “card”, “transaction_date”]
The expectation was to use the above index on a query similar to:
1
2
3
SELECT SUM(amount)
FROM transactions
WHERE role=B AND card=C AND transaction_date BETWEEN A AND B;
The gotcha here is that the database engine was not able to utilize the index we created because “company” was not provided in the WHERE clause. This is due to the nature of how B-Tree indices store the data. The fix was straightforward, as it was to include “company” in the WHERE clause. One lemma from this is that if we only provide “company” and “role” in the WHERE clause, we will be able to reuse the above index.
Query planner can choose incorrectly
The query planner for the database engine will run statistics on the table to probabilistically determine the best index to utilize. Depending on the shape of data over time, the query planner may choose other indices that are better suited for the query. In our case, we had created a specific index for a query, but it was not being picked up ~1% of the time.
To discover this, we looked at multiple flame graphs, compared them with requests that didn’t have slowdowns around the same time, and pinpointed a function that took longer intermittently. The function would perform aggregations similar to the query in the previous section. There were three categories of reasons for the intermittent spikes: the slow queries processed a lot more data, the structure of the query or indices was not performant, or the application is having slow down around this time (i.e. garbage collection, higher load volumes). At this point, we created a performance testing function that runs the function 10,000 times with the same parameters to create a latency distribution to see if we could recreate it.
We confirmed that we were able to recreate the 99th-percentile spikes. This ruled out the possibility that it was due to the amount of data the database was processing. We were able to consistently reproduce the latency distributions, which led us to believe the application-layer slowdown is less likely. So we stored the query planner results and associated them with the latencies. We were then able to analyze the difference in query planner results and discovered that the slow queries had a different index than the one we expected.
The solution was easy at this point. For Mongo, we added a hint for the database engine to choose the index we specify. One should be cautious as the shape of data can change over time and may no longer be as optimal in the future if you are not careful. Certain SQL implementations also allow hinting for the index selection.
Pooling connections and warming them up
The cluster hosting the authorization API would create a database connection just in time and with a short expiry. This means that whenever we needed to query the database, we would open up a connection, authenticate, make the query, and then close the connection after a specified time. If the application didn’t receive another request within the specified time, then the application would have to restart the connection process. The process of creating a connection per database cluster would scale linearly to the number of dependencies needing a connection, and we saw it take upwards of 125 milliseconds per connection.
The solution was to increase the connection expiration time and have a pool of connections ready to be used at any time. This allows the re-use of connections and limits the number of just-in-time database connections created.
Even with the connection pool, the first request would still face the connection penalty per dependency, since the application was cold. The cold start latency will linearly scale relative to the number of connections required. That coupled with a large replicaset hosting our endpoint will lead to multiple requests hitting a cold start.
Rippling has automated CI/CD that takes place hourly, which means every hour every pod will refresh the connection pools. This made the likelihood that the load balancer hit an already warmed application fairly low, which caused these penalties to occur regularly.
To fix the cold start problem, a warmup script was created that would run a set of queries on all dependent clusters during application start-up. Only after the calls succeeded would the application signal that it could start handling requests.
From the work here, the just-in-time database connections took place before any requests came in and would not happen during a request’s lifecycle.
Multithreading and timing out third-party APIs
By this point, the authorization API had a latency breakdown of about 300 milliseconds from Rippling-specific code and about 400 milliseconds to 2.5+ seconds from a third-party API. The third-party API would eat away most of the allotted latency budget. Since this API wasn’t under our control, we had to utilize a timeout method for the API.
The Python request library has a timeout capability that allows the developer to set a time at which a timeout exception should occur. We had initially used this mechanism and set it to 1.5 seconds to help short-circuit runaway requests. The timeout worked in the sense that we could cap the overall latency to 1.5 seconds in the worst case. However, this caused the entirety of the request to take place synchronously within Python and blocked any non-dependent work from taking place.
Generally, I/O-bound work is a great candidate for multithreading. Python’s multithreading isn’t true multithreading as it still only utilizes the main thread to do the work due to the GIL (global interpreter lock). Multiprocessing is a standard Python library that allows true multi-threading by spinning up a separate process on a separate thread with a separate GIL. Because the authorization API was only making a single network call, we didn’t want to incur the latency cost of starting a new Python process. Thus, we went with the multi-threading library.
We kicked off the network call first. While it is in flight, the application will execute any independent work until it completes. Then the thread will check if the 3P API call has been completed or if the timeout has been breached. In either case, it will return. At this point, all dependent work can take place.
Parallelizing this work brought the 99th percentile latencies down from 800 milliseconds to 600 milliseconds. This gain is because a majority of the code could be executed in parallel. The maximum latency topped out at 1.5 seconds because the third-party API is typically the slowest portion of the code path.
One layer at a time
Rippling developed a highly reliable and low-latency API where the 99th percentile latency is 600 milliseconds. A majority of the time stems from third-party calls and tends to be what causes slowdowns in our APIs. All of the work above only focuses on the application layer and the improvements that were made there.
Stay tuned for future posts that will cover how Spend Management prevents regressions and increases stability outside the core application layer.