How to Filter a SQL Nested Collection by a Value

I stumbled upon a very interesting question on Stack Overflow about how to use jOOQ’s MULTISET operator to nest a collection, and then filter the result by whether that nested collection contains a value.

The question is jOOQ specific, but imagine, you have a query that nests collections using JSON in PostgreSQL. Assuming, as always, the Sakila database. Now, PostgreSQL doesn’t support the SQL standard MULTISET operator, but we can use ARRAY, which works almost the same way.

SELECT
  f.title,
  ARRAY(
    SELECT ROW(
      a.actor_id,
      a.first_name,
      a.last_name
    )
    FROM actor AS a
    JOIN film_actor AS fa USING (actor_id)
    WHERE fa.film_id = f.film_id
    ORDER BY a.actor_id
  )
FROM film AS f
ORDER BY f.title

This produces all films and their actors as follows (I’ve truncated the arrays for readability purposes. You get the point):

title                      |array                                                                                 
---------------------------+--------------------------------------------------------------------------------------
ACADEMY DINOSAUR           |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)"
ACE GOLDFINGER             |{"(19,BOB,FAWCETT)","(85,MINNIE,ZELLWEGER)","(90,SEAN,GUINESS)","(160,CHRIS,DEPP)"}   
ADAPTATION HOLES           |{"(2,NICK,WAHLBERG)","(19,BOB,FAWCETT)","(24,CAMERON,STREEP)","(64,RAY,JOHANSSON)","(1
AFFAIR PREJUDICE           |{"(41,JODIE,DEGENERES)","(81,SCARLETT,DAMON)","(88,KENNETH,PESCI)","(147,FAY,WINSLET)"
AFRICAN EGG                |{"(51,GARY,PHOENIX)","(59,DUSTIN,TAUTOU)","(103,MATTHEW,LEIGH)","(181,MATTHEW,CARREY)"
AGENT TRUMAN               |{"(21,KIRSTEN,PALTROW)","(23,SANDRA,KILMER)","(62,JAYNE,NEESON)","(108,WARREN,NOLTE)",
AIRPLANE SIERRA            |{"(99,JIM,MOSTEL)","(133,RICHARD,PENN)","(162,OPRAH,KILMER)","(170,MENA,HOPPER)","(185
AIRPORT POLLOCK            |{"(55,FAY,KILMER)","(96,GENE,WILLIS)","(110,SUSAN,DAVIS)","(138,LUCILLE,DEE)"}        
ALABAMA DEVIL              |{"(10,CHRISTIAN,GABLE)","(22,ELVIS,MARX)","(26,RIP,CRAWFORD)","(53,MENA,TEMPLE)","(68,

Now, the question on Stack Overflow was, how to filter this result by whether the ARRAY (or MULTISET) contains a specific value.

Filtering the ARRAY

We can’t just add a WHERE clause to the query. Because of the logical order of operations in SQL, the WHERE clause “happens before” the SELECT clause, so the ARRAY is not yet available to WHERE. We could, however, wrap everything in a derived table and do this, instead:

SELECT *
FROM (
  SELECT
    f.title,
    ARRAY(
      SELECT ROW(
        a.actor_id,
        a.first_name,
        a.last_name
      )
      FROM actor AS a
      JOIN film_actor AS fa USING (actor_id)
      WHERE fa.film_id = f.film_id
      ORDER BY a.actor_id
    ) AS actors
  FROM film AS f
) AS f
WHERE actors @> ARRAY[(
  SELECT ROW(a.actor_id, a.first_name, a.last_name)
  FROM actor AS a 
  WHERE a.actor_id = 1
)]
ORDER BY f.title

Excuse the unwieldy ARRAY @> ARRAY operator. I’m not aware of a better approach here, because it’s hard to unnest a structurally typed RECORD[] array in PostgreSQL, if we don’t use a nominal type (CREATE TYPE ...). If you know a better way to filter, please let me know in the comments section. Here’s a better version:

SELECT *
FROM (
  SELECT
    f.title,
    ARRAY(
      SELECT ROW(
        a.actor_id,
        a.first_name,
        a.last_name
      )
      FROM actor AS a
      JOIN film_actor AS fa USING (actor_id)
      WHERE fa.film_id = f.film_id
      ORDER BY a.actor_id
    ) AS actors
  FROM film AS f
) AS f
WHERE EXISTS (
  SELECT 1 
  FROM unnest(actors) AS t (a bigint, b text, c text) 
  WHERE a = 1
)
ORDER BY f.title

Anyway, this produces the desired result:

title                |actors                                                                                           
---------------------+-------------------------------------------------------------------------------------------------
ACADEMY DINOSAUR     |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)","(40,JOHNN
ANACONDA CONFESSIONS |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(22,ELVIS,MARX)","(150,JAYNE,NOLTE)","(164,HUMPHREY
ANGELS LIFE          |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(7,GRACE,MOSTEL)","(47,JULIA,BARRYMORE)","(91,CHRIS
BULWORTH COMMANDMENTS|{"(1,PENELOPE,GUINESS)","(65,ANGELA,HUDSON)","(124,SCARLETT,BENING)","(173,ALAN,DREYFUSS)"}      
CHEAPER CLYDE        |{"(1,PENELOPE,GUINESS)","(20,LUCILLE,TRACY)"}                                                    
COLOR PHILADELPHIA   |{"(1,PENELOPE,GUINESS)","(106,GROUCHO,DUNST)","(122,SALMA,NOLTE)","(129,DARYL,CRAWFORD)","(163,CH
ELEPHANT TROJAN      |{"(1,PENELOPE,GUINESS)","(24,CAMERON,STREEP)","(37,VAL,BOLGER)","(107,GINA,DEGENERES)","(115,HARR
GLEAMING JAWBREAKER  |{"(1,PENELOPE,GUINESS)","(66,MARY,TANDY)","(125,ALBERT,NOLTE)","(143,RIVER,DEAN)","(155,IAN,TANDY

Now, all the results are guaranteed to be films in which 'PENELOPE GUINESS' was an ACTOR. But is there a better solution?

Using ARRAY_AGG instead

However, in native PostgreSQL, it would be better (in this case) to use ARRAY_AGG, I think:

SELECT
  f.title,
  ARRAY_AGG(ROW(
    a.actor_id,
    a.first_name,
    a.last_name
  ) ORDER BY a.actor_id) AS actors
FROM film AS f
JOIN film_actor AS fa USING (film_id)
JOIN actor AS a USING (actor_id)
GROUP BY f.title
HAVING bool_or(true) FILTER (WHERE a.actor_id = 1)
ORDER BY f.title

This produces the exact same result:

title                |actors                                                                                          
---------------------+------------------------------------------------------------------------------------------------
ACADEMY DINOSAUR     |{"(1,PENELOPE,GUINESS)","(10,CHRISTIAN,GABLE)","(20,LUCILLE,TRACY)","(30,SANDRA,PECK)","(40,JOHN
ANACONDA CONFESSIONS |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(22,ELVIS,MARX)","(150,JAYNE,NOLTE)","(164,HUMPHRE
ANGELS LIFE          |{"(1,PENELOPE,GUINESS)","(4,JENNIFER,DAVIS)","(7,GRACE,MOSTEL)","(47,JULIA,BARRYMORE)","(91,CHRI
BULWORTH COMMANDMENTS|{"(1,PENELOPE,GUINESS)","(65,ANGELA,HUDSON)","(124,SCARLETT,BENING)","(173,ALAN,DREYFUSS)"}     
CHEAPER CLYDE        |{"(1,PENELOPE,GUINESS)","(20,LUCILLE,TRACY)"}                                                   
COLOR PHILADELPHIA   |{"(1,PENELOPE,GUINESS)","(106,GROUCHO,DUNST)","(122,SALMA,NOLTE)","(129,DARYL,CRAWFORD)","(163,C
ELEPHANT TROJAN      |{"(1,PENELOPE,GUINESS)","(24,CAMERON,STREEP)","(37,VAL,BOLGER)","(107,GINA,DEGENERES)","(115,HAR
GLEAMING JAWBREAKER  |{"(1,PENELOPE,GUINESS)","(66,MARY,TANDY)","(125,ALBERT,NOLTE)","(143,RIVER,DEAN)","(155,IAN,TAND

How does it work?

  • We’re grouping by FILM and aggregate the contents per film into a nested collection.
  • We can now use HAVING to filter on groups.
  • BOOL_OR(TRUE) is TRUE as soon as the GROUP is non-empty
  • FILTER (WHERE a.actor_id = 1) was that filter criteria, which we place in the group

So, the HAVING predicate is TRUE if there is at least one ACTOR_ID = 1, or NULL otherwise, which has the same effect as FALSE. If you’re a purist, wrap the predicate in COALESCE(BOOL_OR(...), FALSE)

Clever or neat, or a bit of both?

Doing this with jOOQ

Here’s the jOOQ version, that works on any RDBMS that supports MULTISET_AGG (ARRAY_AGG emulation is still pending):

ctx.select(
        FILM_ACTOR.film().TITLE,
        multisetAgg(
            FILM_ACTOR.actor().ACTOR_ID,
            FILM_ACTOR.actor().FIRST_NAME,
            FILM_ACTOR.actor().LAST_NAME))
   .from(FILM_ACTOR)
   .groupBy(FILM_ACTOR.film().TITLE)
   .having(boolOr(trueCondition())
       .filterWhere(FILM_ACTOR.actor().ACTOR_ID.eq(1)))
   .orderBy(FILM_ACTOR.film().TITLE)
   .fetch();

While the powerful MULTISET value constructor gets most of the fame with jOOQ users, let’s not forget there is also a slightly less powerful, but occasionally really useful MULTISET_AGG aggregate function, which can be used for aggregations or as a window function!

2 thoughts on “How to Filter a SQL Nested Collection by a Value

  1. There is another option on Postgres; to use JOIN LATERAL. Most if not all multiset computations can be performed already in FROM clause:

    SELECT
         f.title,
         a.actors
     FROM film AS f
     CROSS JOIN LATERAL (
     SELECT ARRAY(
                 SELECT ROW(
                            a.actor_id,
                            a.first_name,
                            a.last_name
                            )
                 FROM actor AS a
                          JOIN film_actor AS fa USING (actor_id)
                 WHERE fa.film_id = f.film_id
                 ORDER BY a.actor_id
             ) AS actors
     ) a
    WHERE actors @> ARRAY[(
        SELECT ROW(a.actor_id, a.first_name, a.last_name)
        FROM actor AS a
        WHERE a.actor_id = 1
    )]
    ORDER BY f.title;
    

    This produces exactly the same plan as your query with derived table.

    We can also propagate array_aggr there too:

    SELECT f.title,
           a.actors
    FROM film AS f
             cross JOIN LATERAL (
        SELECT ARRAY_AGG(ROW (
                             a.actor_id,
                             a.first_name,
                             a.last_name
                             ) ORDER BY a.actor_id) AS actors
        FROM film_actor AS fa
                 JOIN actor AS a USING (actor_id)
        WHERE f.film_id = fa.film_id
        HAVING bool_or(a.actor_id = 1)
        ) a
    ORDER BY f.title;
    

    This is equally slow as the previous query but has a more concise plan.

    Btw, the trick with `bool_or(true) FILTER (WHERE a.actor_id = 1)` can be shortened to just `bool_or(a.actor_id = 1)`.

Leave a Reply