You're reading the Ruby/Rails performance newsletter by Speedshop.

It's the one mistake that I see over and over again.

Almost every service that we work on as Rails developers has a queue.

Of course, many things in life have queues, not just Rails applications. You go to the grocery store, there's a queue. Need a new ID card? Certainly a queue there. And our Rails apps have queues too: if all of your Unicorn, Puma, or workers are busy, the work will have to queue until one is free and can pick it up.

Now imagine if your grocery store had absolutely no visibility into the state of the queues at their checkout. The manager can't just look over and see how many people are waiting in line and if they need to get on the intercom and tell people to come up to the front to work additional checkout stations. No, imagine instead that this grocery store just had a black hole in front of the checkout stations that all the customers walked into and then popped out of whenever a checkout was ready. How would the manager know when to add more checkout personnel? They wouldn't, and, being mostly cheapskates, they would probably just leave one person up there manning a single lane, leaving you to float in eternal discombobulation in the Kroger Black Hole.

That's ludicrous of course. So why do so many of us run our Rails applications this way?!

Time after time, client after client, I find that they have no instrumentation on their web or background work queues, sometimes (usually?) both. Autoscaling is accomplished, if at all, by the equivalent of wetting ones thumb and checking the wind direction.

Of course, Daddy Bezos absolutely loves it when you overscale your service because you have no idea that your queues are all empty. He'll be gold-plating his yacht's thirty-seventh toilet because you scaled up to a hundred servers even though most of your background jobs start processing within less than a second.

And your customers are absolutely delighted when their password reset email takes 6 hours to send, or when your website has a time to first byte of 10 seconds for each pageload.

There is no reason to scale a service up or down except to control queue times. If you don't know what your queue time distribution is, you are absolutely just guessing.

I think a lot of people get tricked by their APM here. When New Relic or Datadog says "your response time is 500 milliseconds", you think, ah, well, that's what it is. But these services, while they can instrument queue time, do not do it by default! They almost always only display the time it takes to service a request or job, not the time it spent queueing. So when New Relic says "SomeJobClass" executes in 30 seconds, you have absolutely no idea how long it spent queueing before that 30 seconds even started.

Here's how to set up queue time instrumentation for your web application:

If you're not on Heroku, you will need to add a header (usually X-Request-Start, see your APM's docs) to each request _before_ your Rails app sees it (so, in your reverse proxy like Nginx or at the load balancer). This is impossible in some setups, e.g. Amazon ALB with no reverse proxy on the server.
New Relic has docs.
Datadog has a built in integration but I don't really like how it reports by default. At Gusto, we built a custom Rack middleware that creates a Distribution metric which gives us percentiles too (75/90/95/99/max)
Scout: On by default too.

Here's how to set up queue time instrumentation for your background jobs:

New Relic: Build it yourself, doing something like this.
Datadog: Build it yourself using a custom metric - see the New Relic example.
Scout: On by default.

Until next time - keep the queueing under control, my friends.

-Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/never-ever-ever-deploy-without-measuring-queue-time?e=[UNIQID]

Copyright © 2022 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.