You're reading the Ruby/Rails performance newsletter by Speedshop.

Understanding if CPU or memory is the bottleneck on your deploys can unlock cost savings

Hey Rubyists,

One implication of last week's post on the Global VM Lock is about horizontal scaling for a Ruby web app: how does the GVL affect the decision to adopt Puma or Sidekiq over a single-threaded, non-concurrent substitute such as Unicorn or Resque?

The question is what is preventing you from adding 1 more Ruby process on your web servers or background job servers: memory or CPU?

We say that a service is "bound" by a particular resource in this circumstance. If we cannot increase throughput because our utilization of a resource is too high, then our throughput is "bound" by that resource.

So, are your servers' throughputs CPU or memory bound?

If you know the answer to this question, you can answer these questions too:

Should I change my server type? (e.g. should you change from m5 to c5 instances on AWS?)
Should you investigate switching to Sidekiq or Puma from a single-threaded non-concurrent library?

Consider the grocery store example laid out in the GVL post. Say our grocery store wishes to reduce checkout waiting time. To do this, they must add more checkout counters. Two constraints on a grocery store adding more checkout counters would probably be a) employee count and b) floor space. These map very nicely to CPU and memory, respectively.

If the grocery store is out of floor space and has filled all available areas where they could put a checkout counter, there's no point in hiring additional employees, and they can't add more checkout counters until they increase floor space. They are employee-bound.

If the grocery store has all available employees working checkout counters, then there's no point in adding additional floor space, because there wouldn't be anyone available to manage the checkout counters they would put there. They're floor-space bound.

Consider this in the context of an ordinary Performance-M Heroku dyno. These dynos have 2 CPU cores and 2.5GB of memory. On this dyno, let's say you have 2 Unicorn processes on the machine right now, each taking up 350MB of memory. Using Librato, you determine that your average CPU load is 1.5.

Try to think of the answer to the following questions before moving on:

Should you add more processes to this dyno?
Would throughput be increased if you switched to Puma?

Let's start with memory. Understanding memory saturation is generally very simple. Are you swapping or not? If you have more than (Current Memory Usage / # of Processes Running) memory left, you could add 1 more process. Simple enough.

Ruby memory usage is generally independent of traffic quantity, but varies with age of the process. That is, memory usage is the same if you have 100 requests/second or 1000 req/sec, but generally starts low (say, 256 MB when it starts) and gets higher, slowly over time (maybe 350MB after 24 hours of continuous service). So, take your measurements for memory usage on old processes that have been alive for a while, and you'll be fine with that simplistic algorithm I just described.

But for CPU, things get more complicated. It's the opposite scenario: CPU usage varies directly with traffic, and has nothing to do with time/how old a process is. This means that you can make a big mistake with provisioning CPU: if you take CPU load measurements when the box is not under high traffic load, you can add too many processes to the box, and then during a period of higher traffic your box will slow to a crawl as CPU utilization hits 100%.

To get an accurate measurement of "how much CPU will this dyno utilize in the worst case scenario", you need to get the processes 100% busy and then see what happens to CPU. This can be difficult to achieve in production without a synthetic load test.

Instead, I offer the following rule of thumb: under maximum request load (i.e. infinite req/sec), CPU load will tend towards (# of Application Threads * % of Request Time Spent in CPU-bound Work).

The latter statistic is easily obtainable from your APM dashboard. Here's Codetriage's dashboard on Scout:

I realize that's hard to read in an email, but about ~15% of CodeTriage's total response time is spent waiting on I/O. This is on the low end - Rails app response times are usually 10-50% I/O.

Given my simplistic rule-of-thumb, we can actually back-track and guess how much I/O is done in the application that I gave stats for - if CPU load is 1.5 and there are 2 threads, then about 25% of the average response time is I/O.

So, I'll ask again for my example app with a CPU load of 1.5 and using 700MB out of 2.5GB available memory:

Should you add more processes to this dyno?
Would throughput be increased if you switched to Puma?

Think about it one more time.

My answer: No, you should not add more processes to this dyno. Given what we know about CPU-usage-per-thread, adding 1 more process increases CPU load to about ~2.25, which will increase CPU load beyond the number of cores available. It may work initially when you deploy it or when traffic is low, but during high traffic this app will slow to a crawl with 3 processes on a dyno. As for Puma, the answer is _no_: if the CPU is already saturated, the app is CPU-bound and switching to Puma does not provide any meaningful gains, because Puma only improves memory utilization per thread.

Let's imagine what numbers would have to change to answer "yes" to both of these questions on our 2-CPU 2.5GB-memory dyno. If memory usage per process was 900MB, and CPU load was 1 (with 2 processes, as before), you should switch to Puma. With Puma, you will probably squeeze 2 processes into the dyno with about ~2-3 threads each. CPU load will rise, and memory usage will stay about the same, and you'll get more throughput out of the dyno as there are now ~5 threads available to take requests rather than just 2. As an exercise, consider Amdahl's Law to make an estimate of how much more throughput we should get from this new configuration.

I hope this sheds some light on provisioning web app servers. As always, you can reply to this email with questions.

Until next week,

Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/understanding-if-your-deployment-is-cpu-or-memory-bound?e=[UNIQID]

Copyright © 2020 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.