You're reading the Ruby/Rails performance newsletter by Speedshop.

Looking for an audit of the perf of your Rails app? I've partnered with Ombu Labs to do do just that.

日本語は話せますか？このニュースレターの日本語版を購読してください。

Making a network call to a third party during your web response: a common pitfall. Here's a few ways to mitigate the risk.

It's a common story.

It's the middle of the business day. Your customer service team (or, your personal email inbox!) is overwhelmed with complaints about downtime and the site being down. What's going on?

A quick look through your dashboards shows the story clear enough: a third party service, which you call over HTTP, has gone down completely. Web requests are waiting for up to 60 seconds for the remote service to respond, and then either timing out or returning a 500 when they don't. Each request now takes 60 seconds. It's causing your autoscaler to go off the hook, running up your AWS bill for no reason. Your own website now looks down. Even requests which don't use this 3rd party service are down.

This is how DevOps-naive organizations become battle-hardened operations machines. It's incidents like this which start to build the battle scars and the resulting playbook for solving these problems.

But you don't have to learn every lesson the hard way. Often, you will anyway. As human beings, we're quite bad at investing in the present to prevent an uncertain future negative. I want to do the best that I can in this newsletter to give you some fast and effective, low-cost tips.

Why is it bad to call out to 3rd parties during a web request?

I've seen this repeated so many times at so many client companies.

When we are making a response to a web request, we are on a very tight timeline. Most web responses should take a few hundred milliseconds or less. After that, the user becomes frustrated.

Making a network call to a 3rd party during a web request creates three major risks:

1. It couples your response time to something out of your control
2. It couples your scalability to a service not in your control
3. It couples your downtime to a service not in your control

These three things - response time, scalability, and downtime, are the three reasons organizations work on performance. A 3rd party service going down destroys each goal.

When a 3rd party, which you are making calls to during a request, goes down, the following things happen:

1. Some responses (the ones which use the 3rd party) get very slow. This is the first customer impact.
2. As response times get slower, your compute infrastructure starts to run out of capacity. If 10% of your requests start taking 30 seconds to process, for most applications that represents a 5-10x increase in total load.
3. Your autoscaler tries to compensate, but runs out of room and hits a limit or doesn't react fast enough. You do have autoscaling set up based on request queue times right? CPU based autoscaling will fail in this scenario, because the increase in response times is due to I/O wait, which means CPU utilization will actually fall.
4. Your service goes down for all requests, as request queues fill up and everyone's requests time out.
So, should you ever deploy a 3rd party HTTP API dependency in a web application? It's plain that you shouldn't.

The root cause of how this fault makes it into production is an over-focus on shipping the next feature at the expense of code review by senior engineers. Integrating with 3rd party APIs is powerful, no doubt, and can greatly enhance an application's usefulness for little cost and little time. But the simple act of making an HTTP call to a 3rd party during a web request is quite easy to catch in the software design process: it isn't hidden, difficult to see in the bowels of a library. Someone usually is paying a bill for this, anyway!

The first and best solution is to never do this in the first place. 3rd party APIs should only be interacted with in background jobs, which are far more fault tolerant. If you restrict your application to interacting with 3rd parties via background job only, you will by necessity build in systems that can deal with downtime, because no one is allowed to interact with the third party in realtime. You have to build a "pending" state into the system to allow remote state to update. In this case, downtime simply looks like "everyone is stuck in pending" not "our site is down".

I'm not saying this is easy, I'm saying it's necessary.

Band-aids and mitigations

The second best solution is inventory and quarantine. Find where you are making calls to 3rd party APIs in your application, and then mitigate these risks.

Many APMs (Datadog's is best at this) allow you to view external HTTP calls in a separate external services tab. Do so, and keep a close eye on this inventory.

These tools work by monkeypatching Net::HTTP and other Ruby HTTP libraries to figure out when you're making network calls. If you don't have these integrations on and installed, make sure you do!
Once you know where the calls are coming from (inside the house!), there are a few steps you can take before you move them into background jobs:

1. Set aggressive timeouts. Most libraries have default timeouts ~30 seconds, but that is far too long for a web request. 5 seconds is usually more appropriate. You can set longer timeouts in background jobs if you like.
2. Set up circuit breakers. There are many gems that do this in the Ruby ecosystem, such as cb2. A circuit breaker allows a piece of code to fail a limited amount of times, and then when the circuit is "broken" or "tripped", avoids that codepath completely and immediately transitions to the failure state. This pattern is extremely valuable for preventing 3rd party API calls to cause outages.

Protect your uptime and your response times. Don't outsource your availability to third parties.

For any two systems which depend on each other, to get the uptime of the full system you have to multiply the uptime of the two sub-systems together. Adding any network call into your infrastructure - 3rd party or 1st party - has this effect. At least in the case of a 1st party dependency, someone is getting paid by you to fix it. With 3rd parties, well, good luck with that support ticket.

Until next week,
Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/mitigating-risk-during-the-request-http-to-3rd-parties?e=[UNIQID]

Copyright © 2023 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.