You're reading the Ruby/Rails performance newsletter by Speedshop.

Looking for an audit of the perf of your Rails app? I've partnered with Ombu Labs to do do just that.

日本語は話せますか？このニュースレターの日本語版を購読してください。

Scaling web applications is really about one thing: managing queues.

When scaling a web application, our goal is serve an increasing number of requests per second, while maintaining a high quality of service (no dropping requests) and keeping latency low (more on that later).

I've talked many times in this newsletter, and in my courses and workshops, about Little's Law.

Little's Law is:

Average requests processed in parallel = Average request arrival rate X Average request service time

The average number of requests processed in parallel is also sometimes called the "number of requests in the system". It means that if we have 10 requests arriving each second, and it takes 1 second to process each request, then with unlimited capacity, on average we will be processing 10 requests at any given moment.

When we "scale" a system, what we means is that the request arrival rate increases. More requests arrive per second. As you can see from Little's Law, this has a direct relationship to the number of requests we process in parallel.

Does service time increase as request arrival rate increases? Sometimes. In general, service time and request arrival rate are slightly correlated. Usually, there isn't much of a correlation, until a saturation point. Then, service times increase exponentially. This saturation point usually comes when an underlying, non-automatically-scaling resource is overwhelmed. 99 times out of 100, this is your SQL database.

Of course, your database is a queueing system, just like your web application as a whole. Requests (database queries) come in and are serviced within a particular service time. Eventually, this becomes overloaded as request arrival rate increases. Service times can also increase too after a certain saturation point; this is usually caused by CPU resources running out.

Let's summarize what we know about scaling a web application so far from Little's Law:

The amount of requests we process in parallel is directly related to arrival rate of requests and how long it takes to service them.
As request arrival rate increases, the number of requests we're processing in parallel increases in direct proportion.
At a certain point, service times will also increase, usually exponentially (system overload).

What happens as the amount of requests we process in parallel increases?

Let's say you've got 5 Unicorn processes running a Ruby web app. Requests arrive at the rate of 3 per second, and each takes 1 second to process.

We talked at the beginning about how one of the main goals of scaling is keeping latency low. Latency increases as request arrival rate increases - we all know this from hard experience. But why is latency increasing?

It isn't because service time is increasing. Service times only increase when the underlying systems (DB, CPU) are overloaded, which only happens at the extremes of usage.

What is the utilization of our hypothetical 5-process Unicorn deployment? How would you measure it?

CPU utilization won't tell you very much. It's quite possible for CPU utilization to be hovering at 20% and for latency to go through the roof.

Instead, we can use Little's Law to calculate average utilization of our Unicorn processes over time. If we have 3 requests arriving per second and each takes 1 second to process, then on average, we're processing 3 requests in parallel in each moment. That means on average, 60% of our Unicorn processes are busy, and 40% are free.

This kind of utilization - of the Unicorn process, of what is actually doing the work - is what correlates with latency.

That's because all scaling is about managing queues.

When you increase the request arrival rate without increasing resources to do the work, all you're doing is making it more likely that work will have to wait. And that applies the whole way down the chain - to the CPU and the database as well.

If we increase our request arrival rate from 3 to 4 in my previous example, we've increased utilization from 60% to 80%. But, we've also decreased the average number of free/unused Unicorn processes by half. As a result, average request queue times increase by 2x. Note how this relationship means that latency increases exponentially as we approach 100% utilization.

I have client after client that gets tripped up on which utilization they should be tracking: many look at CPU and wonder why the system gets slow when a flash sale starts.

Pay attention to all the queues in your system: request queueing, background job queueing, CPU load (the queue of tasks running on or waiting for a CPU core). Managing these queues is how you manage scale.

Until next time,

-Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/scaling-is-queue-management?e=[UNIQID]

Copyright © 2023 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.