Unexpected performance characteristics when exploring migrating a Rails app to Heroku

I work at a small non-profit research institute. I work on a Rails app that is a “digital collections” or “digital asset management” app. Basically it manages and provides access (public as well as internal) to lots of files and description about those files, mostly images.

It’s currently deployed on some self-managed Amazon EC2 instances (one for web, one for bg workers, one in which postgres is installed, etc). It gets pretty low-traffic in general web/ecommerce/Rails terms. The app is definitely not very optimized — we know it’s kind of a RAM hog, we know it has many actions whose response time is undesirable. But it works “good enough” on it’s current infrastructure for current use, such that optimizing it hasn’t been the highest priority.

We are considering moving it from self-managed EC2 to heroku, largely because we don’t really have the capacity to manage the infrastructure we currently have, especially after some recent layoffs.

Our Rails app is currently served by passenger on an EC2 t2.medium (4G of RAM).

I expected the performance characteristics moving to heroku “standard” dynos would be about the same as they are on our current infrastructure. But was surprised to see some degradation:

  • Responses seem much slower to come back when deployed, mainly for our slowest actions. Quick actions are just as quick on heroku, but slower ones (or perhaps actions that involve more memory allocations?) are much slower on heroku.
  • The application instances seem to take more RAM running on heroku dynos than they do on our EC2 (this one in particular mystifies me).

I am curious if anyone with more heroku experience has any insight into what’s going on here. I know how to do profiling and performance optimization (I’m more comfortable with profiling CPU time with ruby-prof than I am with trying to profile memory allocations with say derailed_benchmarks). But it’s difficult work, and I wasn’t expecting to have to do more of it as part of a migration to heroku, when performance characteristics were acceptable on our current infrastructure.

Response Times (CPU)

Again, yep, know these are fairly slow response times. But they are “good enough” on current infrastruture (EC2 t2.medium), wasn’t expecting them to get worse on heroku (standard-1x dyno, backed by heroku pg standard-0 ).

Fast pages are about the same, but slow pages (that create a lot of objects in memory?) are a lot slower.

This is not load testing, I am not testing under high traffic or for current requests. This is just accessing demo versions of the app manually one page a time, to see response times when the app is only handling one response at a time. So it’s not about how many web workers are running or fit into RAM or anything; one is sufficient.

ActionExisting EC2 t2.mediumHeroku standard-1x dyno
Slow reporting page that does a few very expensive SQL queries, but they do not return a lot of objects. Rails logging reports: Allocations: 8704~3800ms~3200ms (faster pg?)
Fast page with a few AR/SQL queries returning just a few objects each, a few partials, etc. Rails logging reports: Allocations: 820581-120ms~120ms
A fairly small “item” page, Rails logging reports: Allocations: 40210~200ms~300ms
A medium size item page, loads a lot more AR models, has a larger byte size page response. Allocations: 361292~430ms600-700ms
One of our largest pages, fetches a lot of AR instances, does a lot of allocations, returns a very large page response. Allocations: 19837333000-4000ms5000-7000ms

Fast-ish responses (and from this limited sample, actually responses with few allocations even if slow waiting on IO?) are about the same. But our slowest/highest allocating actions are ~50% slower on heroku? Again, I know these allocations and response times are not great even on our existing infrastructure; but why do they get so much worse on heroku? (No, there were no heroku memory errors or swapping happening).

RAM use of an app instance

We currently deploy with passenger (free), running 10 workers on our 4GB t2.medium.

To compare apples to apples, deployed using passenger on a heroku standard-1x. Just one worker instance (because that’s actually all I can fit on a standard-1x!), to compare size of a single worker from one infrastructure to the other.

On our legacy infrastructure, on a server that’s been up for 8 days of production traffic, passenger-status looks something like this:

  Requests in queue: 0
  * PID: 18187   Sessions: 0       Processed: 1074398   Uptime: 8d 23h 32m 12s
    CPU: 7%      Memory  : 340M    Last used: 1s
  * PID: 18206   Sessions: 0       Processed: 78200   Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 281M    Last used: 22s
  * PID: 18225   Sessions: 0       Processed: 2951    Uptime: 8d 23h 32m 12s
    CPU: 0%      Memory  : 197M    Last used: 8m 8
  * PID: 18244   Sessions: 0       Processed: 258     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 161M    Last used: 1h 2
  * PID: 18261   Sessions: 0       Processed: 127     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 158M    Last used: 1h 2
  * PID: 18278   Sessions: 0       Processed: 105     Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 3h 2
  * PID: 18295   Sessions: 0       Processed: 96      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 3h 2
  * PID: 18312   Sessions: 0       Processed: 91      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 169M    Last used: 13h
  * PID: 18329   Sessions: 0       Processed: 92      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 163M    Last used: 13h
  * PID: 18346   Sessions: 0       Processed: 80      Uptime: 8d 23h 32m 11s
    CPU: 0%      Memory  : 162M    Last used: 13h

We can see, yeah, this app is low traffic, most of those workers don’t see a lot of use. The first worker, which has handled by far the most traffic has a Private RSS of 340M. (Other workers having handled fewer requests much slimmer). Kind of overweight, not sure where all that RAM is going, but it is what it is. I could maybe hope to barely fit 3 workers on a heroku standard-2 (1024M) instance, if these sizes were the same on Heroku.

This is after a week of production use — if I restart passenger on a staging server, and manually access some of my largest, hungriest, most-allocating pages a few times, I can only see Private RSS use of like 270MB.

However, on the heroku standard-1x, with one passenger worker, using the heroku log-runtime-metrics feature to look at memory… private RSS is I believe what should correspond to passenger’s report, and what heroku uses for memory capacity limiting…

Immediately after restarting my app, it’s at sample#memory_total=184.57MB sample#memory_rss=126.04MB. After manually accessing a few of my “hungriest” actions, I see: sample#memory_total=511.36MB sample#memory_rss=453.24MB . Just a few manual requests not a week of production traffic, and 33% more RAM than on my legacy EC2 infrastructure after a week of production traffic. Actually approaching the limits of what can fit in a standard-1x (512MB) dyno as just one worker.

Now, is heroku’s memory measurements being done differently than passenger-status does them? Possibly. It would be nice to compare apples to apples, and passenger hypothetically has a service that would let you access passenger-status results from heroku… but unfortunately I have been unable to get it to work. (Ideas welcome).

Other variations tried on heroku

Trying the heroku gaffneyc/jemalloc build-pack with heroku config:set JEMALLOC_ENABLED=true (still with passenger, one worker instance) doesn’t seem to have made any significant differences, maybe 5% RAM savings or maybe it’s a just a fluke.

Switching to puma (puma5 with the experimental possibly memory-saving features turned on; just one worker with one thread), doesn’t make any difference in response time performance (none expected), but… maybe does reduce RAM usage somehow? After a few sample requests of some of my hungriest pages, I see sample#memory_total=428.11MB sample#memory_rss=371.88MB, still more than my baseline, but not drastically so. (with or without jemalloc buildpack seems to make no difference). Odd.

So what should I conclude?

I know this app could use a fitness regime; but it performs acceptably on current infrastructure.

We are exploring heroku because of staffing capacity issues, hoping to not to have to do so much ops. But if we trade ops for having to spend much time on challenging (not really suitable for junior dev) performance optimization…. that’s not what we were hoping for!

But perhaps I don’t know what I’m doing, and this haphapzard anecdotal comparison is not actually data and I shoudn’t conclude much from it? Let me know, ideally with advice of how to do it better?

Or… are there reasons to expect different performance chracteristics from heroku? Might it be running on underlying AWS infrastructure that has less resources than my t2.medium?

Or, starting to make guess hypotheses, maybe the fact that heroku standard tier does not run on “dedicated” compute resources means I should expect a lot more variance compared to my own t2.medium, and as a result when deploying on heroku you need to optimize more (so the worst case of variance isn’t so bad) than when running on your own EC? That’s maybe just part of what you get with heroku, unless paying for performance dynos, it is even more important to have an good performing app? (yeah, I know I could use more caching, but that of course brings it’s own complexities, I wasn’t expecting to have to add it in as part of a heroku migration).

Or… I find it odd that it seems like slower (or more allocating?) actions are the ones that are worse. Is there any reason that memory allocations would be even more expensive on a heroku standard dyno than on my own EC2 t2.medium?

And why would the app workers seem to use so much more RAM on heroku than on my own EC2 anyway?

Any feedback or ideas welcome!

10 thoughts on “Unexpected performance characteristics when exploring migrating a Rails app to Heroku

  1. Hi

    I spend a year trying to tune my app to work better on 2X dyno type.
    My conclusion was that the biggest problem is what you mentioned: “maybe the fact that Heroku standard tier does not run on “dedicated” compute resources means I should expect a lot more variance”.

    When you look at dyno types https://devcenter.heroku.com/articles/dyno-types you can see that 1X dyno has 1x-4x variance in computing performance (CPU). And 2X dyno type has 4x-8x compute. Even using 2X dyno still makes app working worst than a dedicated server in my experience.

    I was also experimenting with performance-m dyno but the best results I got just using performance-l which is a powerful dedicated server. Looking back after a year of tuning my app https://knapsackpro.com just using a powerful server made the biggest difference and might be worth it so you don’t spend a lot of time trying to optimize app while it can behave randomly due to compute variance like 4x-8x. This article is also interesting showing NewRelic request response times going down after switch to performance dyno https://medium.com/swlh/running-a-high-traffic-rails-app-on-heroku-s-performance-dynos-d9e6833d34c4

    Another useful thing might be to look at dyno load to verify how much CPU is saturated
    https://help.heroku.com/88G3XLA6/what-is-an-acceptable-amount-of-dyno-load

    Recently I found this very interesting thread with tips on how to tune puma workers’ numbers and what to look for when checking dyno load in the Heroku metrics tab.
    https://github.com/heroku/heroku-buildpack-ruby/pull/946

    I hope you find it useful.

  2. > I expected the performance characteristics moving to heroku “standard” dynos would be about the same as they are on our current infrastructure. But was surprised to see some degradation:

    As the other commentor mentioned: Heroku’s standard dynos are multitenent which means (as you noticed) a high variance. Short actions are still short, but high actions can take longer. I looked up your ECS t2.medium and it has 2 VCPUs. To get the same setup you would need to use Perf-M dynos. But usually it’s more cost effective to switch to Perf-L dynos which have 4x the VCPUs at only double the price. Though you also need to make sure your adjusting your WEB_CONCURRENCY to use all the cores I have recommendations here https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#recommended-default-puma-process-and-thread-configuration

  3. Thanks schneems, this is super helpful, I appreciate it!

    The number of vcpus isn’t relevant to the observations so far — I was testing only one worker with one thread a time, just to see differences in response time when *not* under load, just single responses, so it was only using one vcpu at most.

    I hadn’t expected CPU to be that much different from a t2.medium — but the real unexpected thing I still can’t totally explain is that, as you say, short actions are still short (little or minor difference between my t2.medium), but longer actions get disproportionately slower and also have more variance, compared to shorter actions.

    I still can’t totally explain why that happens, but it sounds like that is a known phenomenon to you? That’s reassuring that at least what I’m seeing is in your experience typical! Perhaps it is just a natural mathematical outcome of the multi-tenancy somehow.

    Next, I do plan to think about the best way to do some load testing, to test various heroku setups as well as compare them to my existing EC2 setup.

    Heroku will definitely reduce the need for local ops expertise, heroku’s infrastructure is amazing. But if the performance pressures end up exagerated on heroku, then the increased need for more expert time spent on performance tuning the app might balance some of that out (as you know, not a super easy or trivial thing to do, not necessarily for a junior dev, and requiring ongoing attention).

    But if performance dynos eliiminate the extra performance pressure compared to our existing infrastructure AND fit within our budget (still significantly cheaper for our low-traffic app than a sysadmin/devops FTE), it could all work out.

    I write all this partially to leave this record, because this is not something i had seen discussed before, and was not something I was expecting.

  4. > I still can’t totally explain why that happens, but it sounds like that is a known phenomenon to you?

    Well on a standard dyno you’re not guaranteed to have a vCPU. You’re always fighting someone else for it. Your memory is guaranteed, but all the containers on the instance share the cores. We tend to not say that performance dynos will buy you more speed, instead what we say is they buy you more consistency (less variance) since at all times you’ve got the same amount of resources. It sounds like you need a full CPU core for the full 4ish seconds that it takes to complete a the slowest long-tail requests. If someone else on your dyno is on average using more than 1 CPU say 1.1 or 1.2 then periodically the system will give them those resources when you are wanting them. Since there’s many people on a shared instance everyone might be doing this and since you’re the only one that’s actually asking for a maximum of 1 core, your average will always be below 1 say 0.7 (just making numbers up).

    As you’re explaining the problem it sounds like a variance issue which is exactly what we advertise the performance dynos for. You might see some perf benefit from moving to standard-2x as they’re packed less densely so they just have fewer neighbors, but I guarantee you’ll see a difference if you move to performance dynos (though again, from a cost basis try to skip perf-M dynos if you can).

    > was testing only one worker with one thread a time

    Also due to the way we route requests, we send them to a random dyno. That means it’s very likely that one dyno will get two “slow” requests in a row and since your setup is only able to handle one request per dyno, the second request might have to wait 3 seconds before the first request is done. If you have new relic you would see this under “request queuing” time. I HIGHLY recommend at least 2 workers minimum or at least 1 worker and multiple threads. It will drastically decrease this queueing time and most efficently use resources even if you’re on standard dynos. Also the addon “rails auto scale” will show you request queuing time.

    So to recap for your case I recommend: Having more than one worker or switching to worker + threads to drop request queuing time. Once there I recommend trying different dynos (and referencing my recommend settings from the chart). From a cost to perf benefit, I think that running Perf-L dynos and maxing them out with 8 processes would be my first recommendation. My second recommendation would be to try standard-2x to see if fewer neighbors (and possibly an extra process) help significantly.

    Hope this helps

  5. Thanks, yeah. It may just be that mathemtically “more variance” means you’re more likely to notice it on longer responses, that makes sense.

    I am aware that you always want more than one worker/thread in a real app, including for reasons of routing. (This also makes me nervous about a standard-2x dyno that can apparently only fit TWO puma workers with maybe two threads each for my app, also potential routing performance issues, I think you really want as many workers/threads as possible to avoid that kind of routing backup).

    That’s just not what I’m testing now. I am intentionally testing a staging app with only one concurrent request being made to it at a time — so request queueing is not relevant. This is just an artificial test scenario to get some baseline expectations — adding more workers effects performance under load, but can’t speed up the time a non-busy worker takes to return a response to a single request, that’s what I’m trying to look at to understand the baseline.

    As I continue to test things… it also looks like maybe MANY BYTE responses are weirdly slower/more variance. That one is even more inexplicable to me.

    I have a response which is 455K, but is not too too slow on my EC2 infrastructure — ~600ms. But on my heroku standard-1x, it’s still taking more like 1200ms. This is not very slow, but is a LARGE response… not sure if differnet heroku buffering behavior could be involved.

    OR, it certainly could still just be more-or-less coincidental variance, happens to be slower at the time I’m looking at it due to multi-tenancy. I really should and will switch to testing on performance dynos soon.

  6. Part of the issue is that a single performance-l is more than we were thinking we’d spend for our pretty low-traffic app. We are a cash strapped low budget non-profit, and this is a pretty low traffic app.

    I had been estimating more like 4 standard-2x’s (~$200/month) to get to the RAM we currently have, but not realizing the implications of multi-tenancy — or the routing implications of fewer workers per dyno.

    It does sound like a performance dyno is what’s going to be required for this app; we’ll just have to do more testing, and then decide if it fits in our budget. (performance-l is $500/month, or $3600/year more than I had been rough estimating. Not actually THAT much money, but it’s real money; on top of other resources involved in a heroku deployment of course, including db and bg workers and we use solr).

    > My second recommendation would be to try standard-2x to see if fewer neighbors (and possibly an extra process) help significantly.

    Huh, you get fewer dynos on a standard-2x for some reason? I will try that out.

  7. OK, incidentally — when testing on a performance-m dyno — again NOT under load, intentionally just testing ONE request a time serially no concurrency no routing queue issues — I get response times very very similar to same on my manual EC2 t2.medium. Close enough that I’m going to call it just measurement variance and identical.

    So okay, that’s a thing! I think my conclusion is that if your app is very optimized/tuned and has only nice well-behaved <500ms responses, you may not notice the impact of multi-tenancy on "professional". (Although again I'm not totally sure I wasn't noticing it on some fast but large responses etc).

    But if your app is not so well behaved and has some slow responses, they are likely to get much slower on multi-tenant professional, and you really are going to want/need `performance`, so should plan and budget for that.

    Possibly different Private RSS RAM consumption is still confusing me a bit — but I'm less confident in my observations there, my observations may just be random variance and not truly a difference. (Or differences in methods of measuring?)

  8. > Huh, you get fewer dynos on a standard-2x for some reason? I will try that out.

    Because they’re the same underlying machine as the 1x dynos, but each app has 2x the ram, so therefore you get 1/2 the neighbors. It’s not a thing we advertise, it’s just an artifact of how we implemented 2x dynos.

    > Possibly different Private RSS RAM consumption is still confusing me a bit — but I’m less confident in my observations there, my observations may just be random variance and not truly a difference. (Or differences in methods of measuring?)

    Measuring ram inside of a container is REALLY hard. I’ve been trying to do it for going on 6+ years https://github.com/schneems/get_process_mem/issues/7. The short answer is that: you can’t. You’ll get pretty close on a perf dyno because you’ve effectively got the entire machine to yourself. Once you start running in a containerized environment with multiple other apps then the OS and containers budgeting of how much memory you’re using can be very different than an RSS reading. You can see that thread and some of the links there to understand a bit about why. But the short version is that even though we think of memory as a single number, it’s translated to many smaller chunks of pages on physical ram and there’s usually a few layers of indirection (such as TLB) between your program, and the OS and the physical memory. To make matters worse/different shared resources can be accounted for differently.

    For memory I mostly only look at the graphs that Heroku provides. Basically as long as you’re under the max value for your dyno you’re not hitting a swapping penalty. (it’s normal to see a little swapping/paging periodically even when you’re under). If you see consistent swapping on your app then it’s a huge perf hit and ususally it’s worth it to invest in either upgrading the amount of Ram you have or working to decrease memory use https://devcenter.heroku.com/articles/ruby-memory-use

    If a perf-m dyno works for you then go for it. It’s got the same vcpu count you had before so it should be an apples-to-apples comparison. If you need to scale up and double your dyno count then it’s worth it to instead run fewer Perf-L dynos for the same cost (assuming you’re maxing out your processes on them).

Leave a comment