How Knowing the End User Helped Query Server Uptime

Published in

Amplitude Engineering

6 min readOct 28, 2017

As an engineer, what are the important things to know about? I’ve heard many common answers: algorithms, parallelism, systems design, security, [insert favorite language/framework], etc. However, two things are often left out: the product you work on and the customers who use it.

After all, knowing the product and customers is for the product managers and designers to worry about. An engineer can just leave it to them and focus on writing good code.

I would argue this stance is wrong, and not just because I work at a product analytics company where “Customer Focus” is a guiding principle. Understanding how and why customers use your product is super valuable and something you’ve probably leveraged. Not only does this context help with prioritizing what to build with your limited time, but it could also be a valuable source of insights for seemingly “unrelated” problems like improving query performance and stability.

Performance Problems in Nova

In previous blog posts, we’ve talked about Nova, our in-house distributed columnar query engine. Nova powers almost everything our users do. So, a key metric for our engineering team is dashboard uptime: the % of time that nova responds in a reasonable period of time from within the application.

Earlier this year, we saw a worrying trend in dashboard uptime. The culprit? Query load spikes in nova.

Spikes in load (~8:55, top-left) would trigger garbage collection (bottom-left) reducing query throughput (top-right) and causing queries to pile up. This led to a bad state of sustained GC that was difficult to recover from without intervention.

An easy solution would’ve been to massively scale up nova, but this wasn’t cost effective, especially as average CPU utilization had a lot of room to grow already. With this in mind, we began thinking of other ways to improve uptime.

Using Customer Insights to Reduce Query Load

Unsurprisingly (because I’m writing this post), we were successful and improved dashboard uptime. To achieve this, we made many improvements to reduce load, including some pretty neat query-specific memory and performance optimizations. But, what had the biggest impact for reducing query load were improvements to query throttling and query caching — inspired of course by product related insights.

Improved Query Throttling

A classic solution for smoothing out load is throttling. Our initial throttler assigned costs to incoming queries based on the time range and properties of the query. The properties — such as the query type, number of segments, etc— determined a base “per-month cost” which was scaled by the time range of the query. We then kept track of the total cost of all actively running queries in the cluster. If the total cost ever exceeded a configurable cluster capacity, new queries started being placed in a FIFO queue of pending queries, which we would start running as capacity freed up.

An illustration of the throttling system in action. Here, the size of the block represents the cost of the query. (1) In high load situations, queries would run normally until the total cost of running queries reached the cluster capacity. (2) At this point, new queries begin accumulating in the pending query queue. (3) As active queries finish, capacity becomes available for the next query in the queue to run.

By specifying conservative limits on cluster capacity in this throttler, we were able to avoid overloading the system, while still ensuring that everything would run eventually.

However, with these limits we soon saw a degraded user experience in the form of very slow response times for queries that normally returned in fractions of a second due to unnecessary throttling. So, the results were better, but still lacking according to our uptime metric. Luckily, we were able spot some easy wins after leveraging our knowledge of the product and customers.

Insight #0: Customer data comes in all shapes… but it’s always an event

This one was pretty obvious. Although the data tracked varies widely across customers, it’s always in the form of events. This made a customer’s event volume a simple and effective heuristic for the relative cost of a query in Amplitude on their data. By factoring in a customer’s event volume in the query time range, we significantly improved the accuracy of our query costing and could be much more aggressive with the cluster capacity.

Insight #1: Customer data also comes in all sizes

Again, the size of the block shows the relative cost of the query. In theory, a FIFO queue is good as a simple way to provide fairness and guarantee eventual execution. However, based on the distribution of our users, it doesn’t make much sense to make the tiny queries wait on capacity freeing up for much larger queries ahead.

Our users range from single person startups to large enterprise organizations. Of course, their data volume also spans the whole range. As it turns out, this means the vast majority of queries are many orders of magnitude smaller than the ones that cause throttling. Each of these queries has a marginal impact on load. Armed with this knowledge, we added a mechanism for low cost queries to skip ahead of significantly higher cost queries. This greatly improved response times for most queries during high load situations.

Insight #2: Some types of queries can wait

Not all queries require the same sense of urgency. To achieve interactive analysis, the queries powering charts or dashboards need to finish within seconds. On the other hand, queries powering asynchronous updates and automated reports can be delayed for several minutes with less impact. Realizing this, we changed the throttler to use a priority queue, where the priority of a query was based on the issuing context. This let us prioritize user initiated actions like loading charts over background tasks like updating preview images in our Slack integration.

Contextual Caching

What better way to reduce load than doing less work! That’s the logic behind caching. Turns out, a lot of our users look at the same dashboard and charts throughout the day. We had already been performing caching, but always specified a maximum cache time of 5 minutes to Nova in order to provide the latest data to users. However, after considering the intention of our users, we realized we could do much better.

Insight #3: Freshness is all relative

Timescales for “freshness” are relative. For fruits, it’s a few days. For Spam, it’s more like months or years.

When examining hourly data or event counts for the current day, 5 minutes can matter a lot. But for a monthly chart, the impact of 5 minutes or even an hour is trivial. Similarly, when users examine a daily metric over the last 90 days, they are likely more interested in the long term trend than the exact value on the current day.

From this insight, we decided to abstract away the concept of specific cache times. Instead, clients just request “fresh” data from Nova. Nova can then use heuristics derived from product knowledge to generate an appropriate maximum cache time for the query.

Of course, because no heuristic is meant to be perfect, users can also specifically request the latest data and bust the cache. Even so, the increased flexibility allowed us to increase our cache hit rate by about 1.5x.

Closing Thoughts

Often, I’m tempted to just focus in on the details of the code — optimizing inner loops and trying to come up with algorithmic optimizations. When that happens, I always stop and remind myself to think about the customer, because context is important. Without understanding the context of a problem, you can’t make effective decisions and trade-offs, such as what data structure to use or how important speed is relative to consistency.

At the end of the day, an engineer isn’t just writing code for its own sake. They are writing code to solve a problem, and (usually) that problem is how to build a better product for some end users. So it definitely doesn’t hurt to get to know your product and those users better.