10 April 2022

Redis & Sidekiq

Dan Mayer
Dan Mayer @danmayer

Redis & Sidekiq

A collection of notes about working with Sidekiq and Redis. A previous post about Ruby & Redis, briefly touched on some things, but I will get into more specifics in this post.

Redis and Background Jobs

A common usage of Redis for Rubyists is for background jobs. Two popular libraries for jobs are Sidekiq and Resque. At this point, I highly recommend Sidekiq over Resque as it is more actively maintained and has more community support around it. I am not going to get into too many specifics of Sidekiq and Resque, but talk a bit more about how they use Redis. There are always some gotchas when working with Redis, ask folks about sometimes an incident that occurred because of a keys, flushall, or flushdb command. Some of these commands are destructive which is always something to be careful with, but they also all have very slow performance characteristics. It is worth noting how some of the calls in Resque and Sidekiq scale with queue depth, which is critical to understand.

A Incident Caused by our Sidekiq/Redis Observability

UPDATE: We no longer think the line below is the culprite, we observe latency growth and decline with queue size, but we are unsure of the cause and unable to reproduce. As seen, in the NOTE above the Sidekiq latency call is O(1+1) and therefor fast and predictable.

We got into trouble when moving from Resque to Sidekiq because our observability instrumentation was frequently making an O(S+N) call (Sidekiq’s queue latency). It wasn’t much of a problem until we had one of our common traffic spikes that results in a brief deep queue depth. Our previous Resque code didn’t have any issues and had some similar instrumentation being sent to Datadog. While our Sidekiq code had been live for days, this behavior where our processing speed decreased with queue depth hadn’t been observed or noticed. The problem came to light when on a weekend (of course) a small spike caused a background job backlog, as we an expected common case. The latency went way up due to our instrumentation and we started processing jobs slower than we enqueued them. This fairly quickly filled our entire Redis leading to OOM errors.

Redis Sidekiq Analytics

Analytics Monitoring our Recovery

These charts are from after the incident. We moved to a new Redis to get things back up and running during the incident, and after things were under control worked on draining the old full Redis, in a isolated way that couldn’t impact production load. In this graph, you can see as we reduce the queue size the latency of our Redis calls also reduces in step. I included CPU to show how hard we were taxing our Redis, this chart isn’t 1:1 as we were adding and removing workers and making some other tweaks, but the queue size -> latency is a direct correlation.

Code Culprite

NOTE: Update Mike responded that he doesn’t think the latency call is the issue so we are further investigating

UPDATE: We no longer think the line below is the culprite, we observe latency growth and decline with queue size, but we are unsure of the cause and unable to reproduce. As seen, in the NOTE above the Sidekiq latency call is O(1+1) and therefor fast and predictable.

As mentioned it wasn’t any of our normal code that was really the problem, it was this line that was part of our instrumentation and observability tooling. Sidekiq::Queue.new(queue_name).latency. As with any incident, there were a ton of other related things, but it is worth noting that this seemingly simple line could have some hidden gotchas or an outsized impact on your Redis performance. As that latency call scales linearly with queue size, it is calling Redis’s Lrange under the hood which is an O(S+N) operation.

Sidekiq / Redis Performance

A colleague @samsm, helped dig into this incident by putting together the queue size -> latency charts above as well as all the helpful tables I am sharing below. These showe how Sidekiq calls translate into their Redis implementation operation details and the operational costs.

Redis / Sidekiq Math

Doing the math on various Sidekiq operations: how much will they impact Redis?

Big O Complexity Notation 101

Big O Notation

Redis Big O

Sidekiq Operation Complexity

Redis Sidekiq Mapping

Additional Sidekiq / Redis Reading

Some additional reading if you want to dig in further on working with Sidekiq and Redis

Categories

Ruby