puma_worker_killer and friends: How to Configure them Correctly

You're reading the Rails Performance newletter, written by Nate Berkopec of Speedshop.

Last week, I talked about how automated "memory killer" process monitors, like puma_worker_killer, Unicorn Worker Killer, or even monit can be a drag on performance. Go read that one again if you'd like to review the problem.

The problem comes when these resource thresholds are misconfigured and the process is started too often. So, this week, I wanted to talk about how to use these tools sanely and safely.

There is no single "kill the Rails process if it uses more than X amount of memory" number that will make sense for everyone. You'll have to adjust everything to your own circumstances. Please, never copy-paste configuration directly from a project's README. You need to think through each setting critically - understand what the setting does, how it applies to you, and then decide what you should set it to.

For example, here's a configuration from puma_worker_killer's README:

PumaWorkerKiller.config do |config|
config.ram = 1024 # mb
config.frequency = 5 # seconds
config.percent_usage = 0.98
config.rolling_restart_frequency = 12 * 3600 # 12 hours in seconds, or 12.hours if using Rails
config.reaper_status_logs = true # setting this to false will not log lines like:
# PumaWorkerKiller: Consuming 54.34765625 mb with master and 2 workers.

config.pre_term = -> (worker) { puts "Worker #{worker.inspect} being killed" }
end
PumaWorkerKiller.start

If you copy-paste this into your config/puma.rb, I can almost guarantee you'll have problems. Why?

puma_worker_killer's ram setting is a cluster-wide setting. That is, it counts the memory usage of the master process and all of the child processes, then kills the worker with the most memory usage.

This means that the ram setting for puma_worker_killer is completely dependent on how many Puma workers you are running. Hard-coding a number in here is a bug waiting to happen, especially if you use an environment variable, such as WEB_CONCURRENCY, to tune the number of puma workers you're using.

So, let's think critically about killing Ruby processes for excessive memory usage - what would be a good, sane behavior that would fix the issue while not causing new performance issues along the way?

Ideally, we would restart a process:

If its memory usage was abnormally high
We haven't restarted this process recently

puma_worker_killer unfortunately can't really be configured to work this way. Maybe someone should open a patch to enable such a configuration :) puma_worker_killer can only kill the largest worker after the entire cluster exceeds a certain threshold, but I'd rather configure it to kill any worker than exceeds a certain threshold.

However, it's still possible to come up with a puma_worker_killer config that's more sane than copy-pasting what you see in the README:

PumaWorkerKiller.config do |config|
config.frequency = 60 * 10 # Kill at maximum every 10 minutes.
config.percent_usage = 0.90 # Start killing largest workers only when we reach maximum size of the container
end
PumaWorkerKiller.start

There are cases when this configuration will not reduce memory usage enough to keep you from using swap. That is intentional. If this configuration (kill every 10 minutes, kill only largest worker) doesn't fix your memory issues, killing workers more often is only trading one problem for another. You'll have reduced memory usage at the cost of far worse (up to 10x worse!) performance. This configuration intentionally fails loudly in this case so that you understand how serious that problem is.

Remember that automated resource-thresholds like puma_worker_killer are a temporary measure. You still need to fix the problem underneath.

-Nate