Copy
You're reading the Ruby/Rails performance newsletter by Speedshop.

If you enjoy this newsletter, you'll enjoy The Ruby On Rails Performance Apocrypha, a collection of past newsletter mailings in a convenient-to-read format. Buy it for $10 on Gumroad

Is there such a thing as a dyno, instance or k8s node that is "too large"?

It's a simple question with a long answer.

We typically have to decide how many Ruby processes to run on a single unit of computing: either an entire VPS (e.g. EC2 instance), a Heroku dyno, or a Kubernetes pod. These units of computing have a limit on their CPU and memory resources, determined either by the physical hardware underneath or by a virtualization framework. Virtualization frameworks don't really differ meaningfully in how they divide up CPU time. Generally, when a framework says we "get one CPU", the resources are similar to that of having a physical machine with one physical CPU. For the purposes of this discussion, we're going to assume memory is unlimited, so we can deal with it another time.

This tuning is about paralellism and cost. We're trying to minimize the cost of a given highly-parallel workload by choosing an instance size that maximizes parallelism-per-dollar.

Parallelism is just "how many things you can do in parallel". If I've got a grocery store with one checkout counter, then I have a parallelism of 1. If I have two checkout counters, 2, and so on. Recall the difference between parallelism and concurrency.

Paralellism on a one CPU machine is pretty easy - your maximum parallelism is one process, doing one thing at a time. You could add concurrency to this setup, with threads or fibers, but fundamentally, it's got one CPU core, so it can only do one thing at a time.

What about on a machine with 4 CPUs? 8? 16? 64? Does this scale linearly? On a 64 CPU machine, can I do 64 things in parallel (that is, a "paralellism of 64?")

First, we need to talk about the differences between CPU architectures and how threading can work differently in each.

When it comes to CPU virtualization, it is helpful to understand if the resources allocated are based on hyperthreads or 1-to-1 hardware threads. Hyperthreading is Intel's proprietary term for simultaneous multithreading, or, basically running 2 threads in parallel on the same physical core. I'll use the term "hyperthreading" and simultaneous multithreading interchangeably, like Kleenex and tissues. On Intel and AMD processors, simultaneous multithreading means that "one thread" may not always have 100% of the processor time available that it needs. On a hyperthreading processor, you would generally get better single-thread performance if you disabled hyperthreading, at the cost of basically halving your multi-core performance. This effect means that a single thread on a simultaneous-multithreading processor isn't quite as "powerful" as an architecture without simultaneous multihreading, such as ARM (that is, after you take into account the single-thread speed of both processors!)

This is important when evaluating what "one CPU" means. Amazon sells instances based on "vCPUs", which actually correspond to a single hyperthread. This is the opposite of how we talk about desktop CPUs, where we describe the number of physical cores rather than logical cores. So while your Intel desktop processor may have 8 cores, it has 16 hyperthreads, and would be sold on AWS as a 16 vCPU processor. On ARM architectures, there is no such distinction. The number of physical and logical cores is equal.

So, threads sharing processor time on the same physical core is one limit to parallelism in a CPU. On a 16-core machine, for example, our 16-core machine will not serve 16x the throughput of a 1-core machine, no matter what we do!

There are two more important limits to parallelism on a single CPU die: shared memory access (particularly L3 cache) and temperature. If a thread has to wait on another thread to access shared memory, or has to throttle because the die has become too hot, these are limitations on parallelism that are not present on two separate single-CPU machines.

Because of these limitations: hyperthreading, shared memory access, and temperature, a 64-core machine does not have the parallel processing power of 8 8-core machines, despite the number of cores being totally equal.

One way to see this in practice is via Geekbench results. Consider this table:
 

AWS Instance Name

vCPU count

Geekbench Multicore Score

Score per vCPU

Loss versus smallest equivalent

m5.large

2

1031

515.5

0%

m5.xlarge

4

2000

500

3.0%

m5.2xlarge

8

3988

498.5

3.3%

m5.4xlarge

16

7500

468.75

9.1%

m5.8xlarge

32

14500

453.125

12.1%

m5.16xlarge

64

22000

343.75

33.3%

c6g.medium

1

730

730

0.0%

c6g.large

2

1467

733.5

-0.5%

c6g.xlarge

4

2850

712.5

2.4%

c6g.2xlarge

8

5600

700

4.1%

c6g.4xlarge

16

10900

681.25

6.7%

c6g.8xlarge

32

19800

618.75

15.2%

c6g.16xlarge

64

34000

531.25

27.2%


When I say "loss versus smallest equivalent", I mean the loss in multicore score versus an equivalent number of vCPUs in a smaller per-vCPU configuration. For example, the smallest equivalent configuration for a m5.16xlarge would be 32 m5.larges. 

What we can see here is that the loss is almost negligible until about 16 cores, and at 64 cores the losses in power are substantial. 

What you'll also notice is that the per-CPU-score loss is less on the c6g series. There's probably plenty of reasons for this, but one reason is probably because the ARM-powered c6g isn't hyperthreaded, leading to one less place threads can block each other and contend for shared resources.

Frequent readers of my material will realize at this point that we're just talking about Amdahl's Law. In fact, we can actually back-track from the table above to estimate the value of "p" in the Amdahl's Law equation, getting an estimate of what percentage of work can be parallelized across multiple cores. If you do the high-school algebra, it turns out that number is about 0.995, which means 99.5% of what threads do can be parallelized.

One of the big insights of Amdahl's law is that there are marginal gains to increasing concurrency if work cannot be done 100% in parallel. Using our empirically-observed parallelism value of 99.5%, we can create an idealized chart of the increase in throughput on multicore processors for any number of threads:



In Amdahl's Law parlance, the y-axis is the "speedup" (we might think of this as increase in throughput of our web application) and the x-axis is the number of threads. We can see that a theoretical processor running 200 threads would only increase our application's throughput by 100x, half that of 200 1-core machines! 

So, we should all use the smallest possible instance/dyno/node size, right?

There are many factors that go into the decision of what instance size to choose which is not "maximizing parallelism per dollar":
  • Autoscaling. Let's say you need 48 Ruby processes (Unicorn, Puma, etc) to serve a given load. If you have all 48 of those on a single compute node, you can't really autoscale up or down to meet changes in traffic. 
  • Minimizing request queue time. Request queue time varies with the inverse of the number of workers pulling from a queue. 1 unicorn process per node with 4 nodes will, on average, have 4 times more request queueing than 4 unicorn processes on 1 node, despite the same number of processes being deployed.
Most applications will want to balance request queue time against autoscaling and cost. While minimizing request queue time encourages us to put more and more processes on each instance, Amdahl's Law effects caused by CPU coordination pushes us in the opposite direction.

Most applications find a happy medium where they can deploy 4 to 16 Ruby processes in a compute node, using the usual pre-forking configuration (master -> child workers, like Puma's cluster mode). These configurations deliver the most throughput per dollar at the highest quality of service (low request queue time).

Until next time,

-Nate



 
You can share this email with this permalink: https://mailchi.mp/railsspeed/is-there-such-a-thing-as-an-aws-instance-thats-too-big?e=[UNIQID]

Copyright © 2022 Nate Berkopec, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.