Fixing Performance By Throwing Cloud At It

Cloud Computing
18 Comments

Your databases are hosted in the cloud – either in VMs or in a database-as-a-service – and you’re having a performance emergency that’s lasted for more than a day. Queries are slow, customers are getting frustrated, and you’re just not able to get a fix in quickly.

Just ask management if they want to throw hardware at it.

Here’s what your email should look like:

Right now, everyone’s unhappy with performance, and I’m in the middle of researching the issue. I’m not going to have a definitive answer – let alone a solution – for at least a few days.

In the meantime, do you want to spend $1,500 per week to temporarily increase performance?

If so, I can change our VM from an r5.4xlarge (16 cores, 128GB RAM) to an r5.24xlarge (96 cores, 768GB RAM.) Our VM costs will go from $293/week to $1,757/week. Those costs don’t include licensing because I’m not familiar with how we’re licensing this VM – that would be a separate discussion for Amy in Engineering, who manages the licensing.

Let’s buy stuff

The email is short and to the point because:

It ignores blame – I don’t care whose fault it is, whether it’s my own fault for being unable to fix it faster, or a growing customer base, or bad code, or maybe the database server should have been larger in the first place.

It doesn’t include an end date – maybe I’m going to be able to solve it in 3 days, but maybe it’ll take a month. This open-ended temporary solution buys us breathing room to do a really good job on the fix rather than duct taping something crappy.

It doesn’t quibble over VM sizes – because during an unforeseen, unsolved performance emergency, you simply don’t know which exact VM size is the best bang for the buck. Just go with the largest size possible, end of story. If someone wants to quibble over sizes, then they’re welcome to do the research to figure out what the right size is. You are not the person to do that research – you need to keep your head down solving the problem. If you don’t even know the root cause and the software fix yet, then you certainly can’t make an exact determination on which VM size will give you the right fix.

It translates technical debt into real dollars – because sometimes the business makes the conscious decision to ship less-than-optimal code in order to bring in revenue. Sometimes, they also want the ability to continue shipping that less-than-optimal code, and they’re willing to spend money to keep bringing in that revenue. That might frustrate the anal-retentive perfectionist in me that wants every line of code to be flawless, but the reality is that sometimes we just need to keep shipping and taking customer money.

It gives the business a choice – and choices help take heat off you. When a manager is screaming at me that we have to fix things faster, I love being able to say, “It sounds like this is really important to you, and you want to move quickly. Here’s a way you can make that happen right away. Oh, what’s that you say? You’re not really that interested? I see. Well, in that case, I’m not going to work 9AM-9PM 6 days per week to fix this, because I have a family at home.”

When your company chose to migrate to the cloud, I bet one of the reasons was to gain flexibility. Let that be your friend.

Previous Post
Free SQL Server Load Testing Tools
Next Post
BrentOzar.com is Turning 17 Years Old.

18 Comments. Leave new

  • You’re not wrong. 😉

    Reply
  • Bad situation. Because they are spoiled by not having to really care on prem. Seemingly no cost to them so they can say or do anything.

    Reply
  • Alan Cranfield
    April 23, 2019 10:07 am

    Good solution. The beauty of the elastic compute cloud. Upgrading the storage could help too.. e.g. AWS will allow you to modify a volume to increase the IOPS. Might be less $$ than an instance upgrade.

    Reply
  • The blunt response ahh! He looks at you like you are telling him you need a raise or more ram for your pc. You give your response and now it’s your boss problem to get it approved. If other depts disagree ( and they will then they can do the work of figuring out how much cloudpower is needed. Love it. …..

    Meanwhile, your head is down fixing most likely an application problem or some other3rd party issue, or why your third floor has a switch which is old and dusty and is plugged into a non-ups fridge outlet! 🙂 but the network guys say all is perfect, it’s not their fault! You look over your cube and Netflix is playing just fine so it can’t be the network!

    You can sit back and remember when you laughed at them when they told you to go to the cloud and it was going to save them so much money. Good old boys club in action.

    You can smirk when they call India for support as your backup. India does not care its already tommorrow there and this is yesterdays problems. They will transfer you at quitting time and its late for them when you are talking to them.

    You can tell your boss to take an ethernet cord to the roof and plug it into the cloud, as your speed and latency suck on a daily basis and it peaks around 7 to 5 a day. You can call Microsoft or Oracle and ask for the special internet pipe to the sky, just like in Eltons johns song.

    The women in the office hate you because it takes ten minutes to write an email now since Office 365.

    Everyone is meeting outside of your office and you quietly tune a query with 50 indexes on it that your app dept has F*** with. Just another day in the cube, now go home and soak the forearms until you get that ergonomic keyboard and bonus at the end of the year!

    Reply
    • Adrian Sanchez
      April 23, 2019 1:17 pm

      “India does not care its already tomorrow there and this is yesterdays problems.”….priceless ?

      Reply
  • ken j ambrose
    April 23, 2019 11:58 am

    Great points about adaptive resources – which I feel is one of the big potential advantages of cloud compute. BUT, depending on the nature of the problem, adding cpus and/or memory might not help much – if at all!
    So before you make that big pronouncement in the meeting with stakeholders, and state that adding resources for additional $$$ will alleviate the pain, you should probably be pretty sure it will!
    How can you test and be sure you proposal will actually alleviate the problem? Not easy I think.

    Reply
    • Ken – to the contrary, it’s trivially easy to test that stuff in the cloud. Just spin up a server with the resources you need, and compare the before/after on the queries there.

      Reply
  • ken j ambrose
    April 23, 2019 12:28 pm

    Hi Brent,
    When supporting close to 10 business critical 3rd party apps (where we rarely own the query code) that reference data in our sql servers across dozens of offices on nationwide vpn, and where performance problem root cause could be at any layer of the technology, identifying said root cause is non-trivial in my experience. I suppose if you have _already_ narrowed down the problem to particular sql server queries from a particular app and client, then testing with additional resources would be easy.

    Reply
  • Well written blog post and definitely a lot I can take for myself and get better at writing such emails (not that I am in need to write them on a regularly basis anyway).
    However…just thinking: If the problem would be blocking and the workload was IO intensive how would upsizing RAM and CPU help…OK you’d probably get a bigger buffer pool but I just would like to point out that this probably isn’t gonna be a remedy for everything.

    Reply
    • Martin – right, but if the problem is blocking, that doesn’t take me long to narrow down. I can usually solve those in just a few hours, max, with the help of sp_BlitzLock, sp_BlitzIndex, and sp_BlitzCache.

      Reply
  • Willem Leenen
    April 24, 2019 2:32 am

    My reality is different than what you are describing.

    As a contrast, let me describe the approach here. One architect here said that “there won’t be performance problems in the cloud, because you can always scale up”.
    And in case of any performance issue’s, a popular question is if we need to increase the resources. For example, I was ordered to increase from 2 CPU’s to 16 CPU’s.
    These vendor induced measures didn’t help, as there wasn’t any componentstress. The blocking just didn’t go away. (their next step was to increase diskspeed).
    In short: I explain via presentations the difference between performance issue’s, capacity issues and connected scalability. How to measure performance, how to get diagnostic info.
    I slowly see the meme “In the cloud you don’t need DBA’s” fade away. But to email the managers that a bigger box will solve the problems, I have never had the need to do that.

    Reply
    • Willem – OK, cool! Glad you’ve got a relatively unlimited amount of power at your disposal without even asking, that’s awesome. Most shops aren’t that lucky. Be thankful for that advantage – you won’t have that in every place where you work down the road.

      Reply
  • Jeroen Bosems
    April 27, 2019 1:40 pm

    Yeah I’m going to copy pasta this as my default answer. Good stuff!

    Reply
  • Mike Vessey
    May 13, 2019 1:28 am

    I’ve spent 23 years putting in missing indexes, using persisted computed columns to remove functions in where clauses, staring blindly at query plans that make no sense,, and hearing a developer say , cant you put better hardware in? ….. If a 64 Core machine with 256 GB of ram cant run your query then nothing can.

    I’d rather fix it at build time. help the devs to figure out third normal form and good database practices.

    Reply
    • Mike – you’ll learn a couple of things as your career progresses.

      First, you can’t always fix it at build time. Eventually, you’re going to get another job, and you weren’t around when the app was first built. 😉

      Second, there’s this thing called “concurrency.” The query might have worked fine by itself with only a few people running it, but when the business explodes in popularity, you can hit new challenges you’ve never seen before. Random example: query that needs a 20GB memory grant due to sorting on something. Works fine when only one person is running it, but when 20-30 of people start using the app simultaneously…yes, you can actually fix that temporarily with hardware above and beyond 256GB RAM.

      The best advice I can give you is to keep an open mind. Hope that helps!

      Reply
  • Who: Developer who was added as sysadmin by my backup dba

    Writes a query that does a Google search out to the internet to get the GPS coordinates for every address search. Was fine with one every manual call.

    He deployment night then realizes he’s not sysadmin in prod and tells me to enable something (can’t recall) for prod SQL. Prod runs thousands a minute.

    Let’s just say it wasn’t deployed and he and my backup dba had a chat about planning, communication and sharing and a reduction of powers and access

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.