When should you scale a database vertically versus horizontally?

Last week, we talked about the situations when a SQL read replica makes things harder than they have to be. This week, I want to talk a little bit about the math behind read replicas and scaling for throughput, and then talk a little bit about the circumstances in which a read replica is actually very necessary.

When do you need to scale a database plan? When do you actually need to scale up a plan from, say, 4xlarge to 8xlarge?

Amazon actually provides an extremely helpful list of metrics to monitor when scaling a SQL database (quoted, my comments marked):

A DB instance has a number of different categories of metrics, and how to determine acceptable values depends on the metric.

CPU

   CPU Utilization – Percentage of computer processing capacity used. Nate: Generally, should be less than 80% most of the time, especially on smaller plans (4 vCPUs or less). On very large plans, with 30+ vCPUs, this metric may be fine at 95% or less.

Memory

   Freeable Memory – How much RAM is available on the DB instance, in megabytes. The red line in the Monitoring tab metrics is marked at 75% for CPU, Memory and Storage Metrics. If instance memory consumption frequently crosses that line, then this indicates that you should check your workload or upgrade your instance. Nate: agreed!

   Swap Usage – How much swap space is used by the DB instance, in megabytes. Nate: You can never fully eliminate swap usage, because it's up to the kernel. In general, swap usage is fine if freeable memory exceeds 10% of total memory.

Disk space

   Free Storage Space – How much disk space is not currently being used by the DB instance, in megabytes. Nate: Self-explanatory, although 90% of people will hit CPU and IO metrics first.

Input/output operations

   Read IOPS, Write IOPS – The average number of disk read or write operations per second. Nate: All database providers will throttle you if you exceed the amount provisioned for your plan. Be sure you know what that is.

   Read Latency, Write Latency – The average time for a read or write operation in milliseconds. Nate: This should generally be less than a few milliseconds. If it isn't, you are possibly being I/O throttled or are running out of CPU.

   Queue Depth – The number of I/O operations that are waiting to be written to or read from disk. Nate: Again, if this number is higher than 1 for any extended period of time, you are either CPU or I/O throttled.

Network traffic

   Network Receive Throughput, Network Transmit Throughput – The rate of network traffic to and from the DB instance in megabytes per second. Nate: Not usually what people run out of first.

Database connections

   DB Connections – The number of client sessions that are connected to the DB instance. Nate: AWS provides pretty high numbers of connections, but if you've got more than 500 database connections, you probably want to use a connection pooler like pgbouncer or whatever the equivalent is for your chosen database.

When any of these metrics exceeds acceptable numbers (usually CPU or IOPs), just scale up the plan. As discussed last week, you should do this until you run out of database plans offered, and even then, you should try asking your provider if they will give you a bigger one.

If you're not at the largest database plan and you've added a read replica, you've almost always wasted your time.

However, there are a few circumstances when you should not follow this advice, and should create a read replica *before* you reach maximum database plan size:

Analytics loads. Analytics loads tend to have extremely long-running queries (30+ seconds). These queries can take out shared locks which block writes on the affected rows (depending on your isolation level, but true by default for most databases). However, most database engines nowadays are quite good at avoiding needing shared locks, and most shared locks do not live until the end of the transaction. You can probably get away with light analytics loads, especially if queries take less than a second or two to run and are run infrequently.
Giving business intelligence units a "bombproof" database. Often, internal units, such as marketing, want to be able to query the database to get customer data and run reports. Do NOT let them do this against a production database, even if you think you've got the permissions right (read-only), as they will probably write crap, unoptimized SQL that thrashes your performance metrics.

Of course, once you reach max plan size, yes, then you should start thinking about read replicas. It's certainly a lot easier to set up than sharding, though it's also far from free, as I discussed last week.

I hope this has been useful - protect your time and code, scale vertically first and horizontally later!

You can share this email with this permalink: https://mailchi.mp/railsspeed/read-replicas-so-when-are-they-useful?e=[UNIQID]

Copyright © 2019 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.