Friday, November 6, 2020

Max row per group, sad answers only

Today I learned that frequently asked questions on StackOverflow get their own tag. The tag greatest-n-per-group is for answers to questions about writing SQL to find the max row per group. By max row I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group. By sad answers only I mean there is a lot of confusion about this, StackOverflow has over 3000 posts, and the query is harder to write than it needs to be.

I am writing about SQL rather than SQL engines and I am not an expert on writing SQL queries, but that might be appropriate given that I am writing about something that confuses users and could be easier. My motivation for writing this was a slow query plan for MySQL 8.0.22 while implementing the Time Series Benchmark Suite (TSBS) for MySQL.

Reproduction SQL is here and here and output from MySQL 8.0.22 is here and here. I updated the output for the queries with SHOW STATUS printed after each query.

The easy way

The answer is easy if you only want the aggregated and group by columns:

SELECT MAX(agg_col), gb_col FROM t GROUP BY gb_col

But it isn't easy when you want other columns -- columns that are not used for group by or aggregation.  This would be easy had MySQL continued on the ANY_VALUE theme by adding FIRST_VALUE and LAST_VALUE as MongoDB does via the $first and $last aggregation accumulator operators. Well, MySQL has FIRST_VALUE and LAST_VALUE, but for window functions and they don't provide the desired semantics. If they existed with semantics similar to MongoDB then the following query would work and is easy to write:

SELECT MAX(agg_col), gb_col, FIRST_VALUE(other_col) FROM t GROUP BY gb_col

I am not an expert. Perhaps one day I will learn why there isn't an easy way to do this. MySQL docs have a useful page on this type of query. I have yet to try a variant that uses LATERAL.

The less easy ways

There are many ways to write this query. None of them are easy as the example I wrote above that isn't valid SQL. I tried: left join, correlated subquery, uncorrelated subquery and a rank() window function. The performant solutions were uncorrelated subquery and rank() window function. I was surprised that the rank() window function approach was performant because the explain output looked slow. But the runtime and slow log output were OK. By performant I mean a query plan that examines few rows, and loose index scan is an example of that.

This table shows the response time for each query type and the number of rows examined from the slow log. For the window function approach I am confused both by the low value for rows examined and the plan that shows a table scan. I am curious if there is a bug.

ApproachResponse time (secs)Rows examined
Uncorrelated 0.00 20
Window function 0.10 10
Left join 1.07 245750
Correlated 134.11 671170560

Loose index scan

Update - when I published this I claimed the index was on (j DESC, pk) but that was a mistake.

Before I show the queries, my goal is to get a plan that uses the loose index scan optimization. The test table is: create table tq(pk int primary key, j int, k int), there is an index on (j, pk DESC) and a NOT NULL constraint on j. This query uses a loose index scan, however it can't provide the value for the column k. The loose index scan is performant because it fetches one entry from the index per distinct value for (j,pk).

SELECT max(pk), j FROM tq GROUP BY j
EXPLAIN: -> Group aggregate (computed in earlier step): max(tq.pk)
    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10) 

Uncorrelated subquery

This plan is performant because t2 is materialized via a loose index scan and the result from that does one point query per distinct value in j.

SELECT t1.pk, t1.j, t1.k
FROM tq t1, (SELECT max(pk) as maxpk, j FROM tq GROUP BY j) t2
WHERE t2.maxpk = t1.pk

EXPLAIN: -> Nested loop inner join
    -> Filter: (t2.maxpk is not null)
        -> Table scan on t2  (cost=3.62 rows=10)
            -> Materialize
                -> Group aggregate (computed in earlier step): max(tq.pk)
                    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10)
    -> Single-row index lookup on t1 using PRIMARY (pk=t2.maxpk)  (cost=0.26 rows=1)

Rank window function

This query is slightly slower than the uncorrelated subquery above. I didn't expect that given the plan that has a table scan on tq. Adding hints to use the index on (j,pk) don't change the query plan. I wonder if this explain outpout is correct as the query doesn't do a full scan when run. Also the query is almost as fast as the uncorrelated approach.

WITH t1 AS (SELECT pk, j, k,
    RANK() OVER (PARTITION by j ORDER BY pk DESC) AS myrank FROM tq)
SELECT pk, j, k from t1 WHERE myrank=1

EXPLAIN: -> Index lookup on t1 using <auto_key0> (myrank=1)
    -> Materialize CTE t1
        -> Window aggregate: rank() OVER (PARTITION BY tq.j ORDER BY tq.pk desc )
            -> Sort: tq.j, tq.pk DESC  (cost=8273.25 rows=82170)
                -> Table scan on tq  (cost=8273.25 rows=82170)
Left join

I don't expect this query to be performant because there isn't an equality predicate on pk. This might be a useful approach when there isn't an index on (j,pk), but that is not the case here and this plan examines too many rows.

SELECT t1.pk, t1.j, t1.k FROM tq t1
LEFT JOIN tq t2 ON t1.j = t2.j AND t1.pk < t2.pk
WHERE t2.j IS NULL

EXPLAIN: -> Filter: (t2.j is null)  (cost=76209940.12 rows=750212100)
    -> Nested loop antijoin  (cost=76209940.12 rows=750212100)
        -> Table scan on t1  (cost=8273.25 rows=82170)
        -> Filter: (t1.pk < t2.pk)  (cost=14.38 rows=9130)
            -> Index lookup on t2 using x (j=t1.j)  (cost=14.38 rows=9130)

Correlated subquery

The correlated subquery isn't performant. It examines too many rows. That isn't a surprise.

SELECT pk, j, k FROM   tq t1
WHERE  pk=(SELECT MAX(t2.pk) FROM tq t2 WHERE t1.j = t2.j)

EXPLAIN: -> Filter: (t1.pk = (select #2))  (cost=8273.25 rows=82170)
    -> Table scan on t1  (cost=8273.25 rows=82170)
    -> Select #2 (subquery in condition; dependent)
        -> Aggregate: max(t2.pk)
            -> Index lookup on t2 using x (j=t1.j)  (cost=927.37 rows=9130)

9 comments:

  1. If you want to use window functions, you shouldn't use group by. MAX and FIRST_VALUE can both be used as window functions. You should partition OVER the group by and use FIRST_VALUE and MAX.

    I /think/ this does what you want: https://www.db-fiddle.com/f/mviyTpFou9FWbqNSwx9tk5/0

    Can you double check?

    ReplyDelete
    Replies
    1. I had a query that used FIRST_VALUE() while writing the post but didn't share it. I am a beginner WRT window functions, but they produce an output row per input row. The example I shared for rank() wraps the window function query with an outer SELECT that filters all but one row per group. I would have to do something similar for FIRST_VALUE.

      Delete
  2. Oops: https://www.db-fiddle.com/f/mviyTpFou9FWbqNSwx9tk5/1

    ReplyDelete
  3. It's worth to note that ClickHouse has support for:
    - argMax and argMin aggregate functions:
    https://clickhouse.tech/docs/en/sql-reference/aggregate-functions/reference/argmax/
    - LIMIT BY relational operator:
    https://clickhouse.tech/docs/en/sql-reference/statements/select/limit-by/
    - any and anyLast aggregate functions.

    ReplyDelete
    Replies
    1. My inspiration for writing this was adding MySQL support to TSBS. Clickhouse support is already there. I will look at the queries and schema to see what was used for it.

      Delete
  4. I was very surprised to read that Window Functions solution worked fast (I compared window function implementations with MariaDB and didn't see anything that would help to optimize this).

    When I try to reproduce this, I cannot: https://gist.github.com/spetrunia/3f15adbed37c175cf9a87a040bc69909

    It uses full scan and does sorting (because of mismatched ASC/DESC in the "PARTITION BY tq.j ORDER BY tq.pk desc" clause). If both were ASC or DESC the sorting would go away, but the full table scan would still be there, I believe.

    ReplyDelete
    Replies
    1. Thanks for reminding me that the SHOW STATUS counters are useful, and for spending time on this.

      My blog post had a typo, that wasn't repeated in the sample SQL. The index should be on (j, pk DESC)

      I modified your example to do the window function query in 3 cases: index on k, index on (j, pk), index on (j, pk DESC)

      And then in the last case I added the uncorrelated subquery. The plan for the uncorrelated subquery is better than the plan for the window function query.

      SQL: https://gist.github.com/mdcallag/2140797c81d4e97088a8555ea5041dab

      Output:
      https://gist.github.com/mdcallag/ce7575d9ea578390d5c1a87679039b77

      Delete
    2. Yes, the uncorrelated subquery is the only way currently. I am not aware of any other.

      For window functions, I'm not aware of any optimizer that would be able to note the "WHERE rownum=1" in the upper query and infer that "we only need the first row (and its peers) in each window function's partition"

      Btw,

      > I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group

      If there are two "peer" rows - rows that both have the min value in the group but different other columns, would you need any of them or both?

      Delete
    3. I would accept either of the peer rows in that case

      Delete