You're reading the Ruby/Rails performance newsletter by Speedshop.

Looking for an audit of the perf of your Rails app? I've partnered with Ombu Labs to do do just that.

日本語は話せますか？このニュースレターの日本語版を購読してください。

It can be confusing - should you be using 1 process per CPU? 3? Maybe even less than 1?

Everyone who deploys a Rails application has to make a decision: how many processes will you deploy per CPU core? That is, your deployment target (container, host, your laptop, whatever) has a certain number of available CPUs. You have to decide how many Rails processes will serve traffic from this machine. In Unicorn, this is the worker_processes setting. In puma it's just called workers.

When making this decision, we're optimizing two competing goals:

Utilization. We want utilization to be high, so that we can reduce costs. If we have very low utilization, we're paying for resources we don't use.
Latency. We don't want to add a lot of additional latency caused by processes waiting for CPU resources. Web applications need to be fast.

The problem is that these goals have an inherent tradeoff: the closer utilization gets to 100%, the higher our added latency will be. This tradeoff is exponential. For example, at 50% utilization our app might be 1% slower than it was at 1% utilization. At 90% CPU utilization, our application might be 10% slower. And at 99% utililzation, it might be 100 or even 400% slower than the unloaded case.

You may also be wondering: what's a CPU core for the purposes of this discussion? If we're on Intel, is it a single hyperthread (logical core) or a single physical core? AWS has made this quite clear for us: hyperthreads are, in most respects, basically equal to a single physical core. That's why you pay for each "vCPU" on AWS rather than per physical core. Since we've all been running 1 process per hyperthread in production for over 10 years, I think it's a safe assumption.

We always want to deploy at least one worker per CPU core. If we don't, that CPU core will be idle. This is because traditional (not using Ractors) Ruby processes only utilize one CPU core at a time. We'll come back to this more in detail later, but you can in general think of a Ruby process as being "pinned" to a single CPU core. ARM doesn't have this distinction: a core is a core.

As for higher ratios of processes-to-CPU-cores, some of you will have this decision made for you, because you are memory limited. That is, if you have a 4 CPU machine with 8GB of memory, but your Ruby processes use 2 GB of memory each, you can't deploy more than 1 process per CPU core because you don't have enough memory!

If you can't deploy even 1 worker per CPU core due to memory limitations, you have a problem and you're wasting money, because you're not using all the CPU you're paying for. Consider changing instance type on AWS to one with a higher memory-to-CPU ratio (r-series instances are 8:1 and can be cheap, depending on region) or work on reducing memory usage (see my many other posts on the topic!)

Let's say memory isn't a limiting factor. Should you ever use more than 1 worker per CPU core?

Return to our goal: utilization. If we deploy 1 worker per core, and then fully "load" our machine by sending it infinite traffic, what will our CPU utilization be?

The answer is "it depends on the application". If our application, for example, spends 0.25 seconds running Ruby code and then 0.75 seconds waiting on a database response, in the long run our CPU utilization will be 25%.

In general, when under full load (i.e. infinite traffic), we want CPU utilization to be around 80%. If we go higher than that, we'll start to see too much additional latency.

So, in that case, in my example application that spends 75% of its time waiting on I/O, we might try to run 3 processes per CPU core. In the long run, CPU utilization will be about 75%. That's ideal.

However, it's usually not so simple. Most web applications have hundreds, or even thousands of different URLs that they serve, each one requiring a different percentage of time waiting on I/O versus time spent running CPU. That makes it dangerous to always work based on averages.

The solution is to treat these averages as a starting point to tinker with, and then watch changes in production as you make them. If, when switching from 1 to 2 processes per CPU, you see application response time increase, you may want to roll that change back. Everyone will have a different amount of added latency they will accept. Will you accept an additional 10 milliseconds per request in exchange for 2x better utilization? Maybe, maybe not. It depends on the product.

Puma provides an additional configuration setting here: threads per process. The correct number of threads per process is an interesting question, once which took a huge amount of discussion on the Rails issue tracker recently (tl;dr: the answer is 3 threads per process).

However, one thing I want to make clear: if you are using multithreading (i.e. Puma, or Sidekiq), you should not use more than 1 process per core.

In a sense, you need to choose which concurrency model you're going to use. Will you use process based concurrency (Unicorn), where multiple processes contend for the same CPU? Or will you use thread based concurrency, where multiple threads contend for the same CPU? If you mix them, you're going to end up with a confusing and probably unoptimized mishmash. It'll work, but I doubt you'll get the ratios and settings correct.

I hope this has provided a bit of clarity around a very important deployment setting in all Rails applications. Happy hacking!

You can share this email with this permalink: https://mailchi.mp/railsspeed/how-many-ruby-processes-per-cpu-is-ideal?e=[UNIQID]

Copyright © 2024 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.