MongoDB $lookup in 5 Minutes

Preface: New to the aggregation pipeline? Check out our 5 minute explanation before diving into these $lookup examples....

Also be sure to understand indexing and it's performance implications before reviewing these examples...

$lookup allows you to perform joins on collections in the same database. $lookup works by returning documents from a "joined" collection as a sub-array of the original collection.

$lookup supports both basic equality matches as well as uncorrelated sub-queries. We'll provide an example of each scenario.

The examples are based on this sample data...

post collection

{
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5
},
{
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2
},
{
    "title" : "hello world",
    "author" : "Joe",
    "likes" : 3
}

comment collection

{
    "postTitle" : "my first post",
    "comment" : "great read",
    "likes" : 3
},
{
    "postTitle" : "my second post",
    "comment" : "good info",
    "likes" : 0
},
{
    "postTitle" : "my second post",
    "comment" : "i liked this post",
    "likes" : 12
},
{
    "postTitle" : "hello world",
    "comment" : "not my favorite",
    "likes" : 8
},
{
    "postTitle" : "my last post",
    "comment" : null,
    "likes" : 0
}

Notice how we have two collections posts and comments. The postTitle field in the comments collection corresponds to the title field in the posts collection.

Both comments and posts have likes.

$lookup example: equality match

db.posts.aggregate([
    { $lookup:
        {
           from: "comments",
           localField: "title",
           foreignField: "postTitle",
           as: "comments"
        }
    }
])

Notice how $lookup takes a document with the following fields:

  • from: the collection we want to join with
  • localField: the field we want to join by in the local collection (the collection we are running the query on)
  • foreignField: the field we want to join by in the foreign collection (the collection we want to join with)
  • as: the name of the output array for the results

This query returns the following..

{
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5,
    "comments" : [
        {
            "postTitle" : "my first post",
            "comment" : "great read",
            "likes" : 3
        }
    ]
},
{
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2,
    "comments" : [
        {
            "postTitle" : "my second post",
            "comment" : "good info",
            "likes" : 0
        },
        {
            "postTitle" : "my second post",
            "comment" : "i liked this post",
            "likes" : 12
        }
    ]
},
{
    "title" : "hello world",
    "author" : "Joe",
    "likes" : 3,
    "comments" : [
        {
            "postTitle" : "hello world",
            "comment" : "not my favorite",
            "likes" : 8
        }
    ]
}

Notice how the original 3 documents from the post collection are returned with an additional field comments.

For each post, this new comments array has all of the documents from the comments collection whose postTitle field matches the post's title field.

$lookup example: pipeline example with condition

db.posts.aggregate([
 { $lookup:
     {
       from: "comments",
       let: { post_likes: "$likes", post_title: "$title"},
       pipeline: [
            { $match:
                { $expr:
                    { $and:
                        [
                           { $gt: [ "$likes", "$$post_likes"] },
                           { $eq: ["$$post_title", "$postTitle" ] }
                        ]
                    }
                }
            }
        ],
        as: "comments"
        }
 }
])

Notice how $lookup takes slightly different fields. Specifically the localField and foreignField fields have been replaced with:

  • let (optional): an expression defining variables to use in the pipeline stage. This is how you access fields from the input collection in the pipeline stage.
  • pipeline: an aggregation pipeline to execute on the collection to join

Notice how we define two variables in the let expression. We define post_likes and post_title so that we can reference the input documents in the pipeline stage.

This is the only way for us to make comparisons across the different collections...

{ $gt: [ "$likes", "$$post_likes"] },
{ $eq: ["$$post_title", "$postTitle" ] }

Notice how we use $$ to refer to the variables we defined in let. We reference fields in the foreign collection with a single $.

This query returns the following..

{
    "title" : "my first post",
    "author" : "Jim",
    "likes" : 5,
    "comments" : []
},
{
    "title" : "my second post",
    "author" : "Jim",
    "likes" : 2,
    "comments" : [
        {
            "postTitle" : "my second post",
            "comment" : "i liked this post",
            "likes" : 12
        }
    ]
},
{
    "title" : "hello world",
    "author" : "Joe",
    "likes" : 3,
    "comments" : [
        {
            "postTitle" : "hello world",
            "comment" : "not my favorite",
            "likes" : 8
        }
    ]
}

Like the first example, this query returns the comments whose postTitle matches the title field for each post. This query adds an additional condition for returning comments whose likes count is greater than the corresponding post's likes.

By running this query, we now know if any posts have comments with more likes than the post itself :).

Conclusion

You've now seen two examples of using $lookup in the aggregation pipeline. Remember that $lookup can be used for both equality checks (first example) as well as more complex sub-queries (second example).

Depending on which type of query you run, $lookup will take slightly different parameters (let, pipeline vs localField, foreignField).

For more on the aggregation pipeline, be sure to check out this 5 minute guide to the MongoDB aggregation framework.

Your thoughts?

|

While it's good that $lookup exists in MongoDb, it goes against every reason why you would use MongoDB vs another relational datastore (MySQL).

Remember that a $lookup operation comes with the cost of performing a JOIN operation on your collections. Remember that one of the key advantages of MongoDb is faster performance with unstructured data...the irony...

To further illustrate, let's say you set up a blog using Mongo. A typical blog comes with Users, Posts, Comments, Likes, Followers, etc. This is a lot of RELATIONSHIPS to manage. If you use MongoDb for this use case, you can quickly find yourself in trouble (especially if each entity is represented by it's own collection!)

Let's say you want to display a users most liked blog posts. This will require one UGLY Mongo aggregation where you aggregate likes, comments, etc for a given post. That's a lot of $lookups...and starts to beg the question "WHY NOT JUST USE SQL"...

In a nutshell ...ask yourself if you made the right architecture decisions when you find yourself frequently using $lookup...

|

BE CAREFUL WITH $lookup

Why? Because it goes against everything MongoDb. Remember that MongoDb is a non-relational data store. This means it excels with large amounts of unstructured data requiring quick lookups.

When you use $lookup you are performing a join on two different collections. This is STRUCTURED data. If you design a whole schema around a "schemaless" data store then you shouldn't be getting paid to do what you do :)

Relational data stores still have their place. While there is a lot of hype around NoSQL, iOT, and cloud (buzz words), good old MySQL and other RDBMS are still widely used in production ...and at SCALE.

|

very clear explanations. huge thanks!

|

so easy to understand now. wow.

|

it's crazy how you can run pipelines within pipelines. It sounds like Inception but when you realize that is all you are doing on these $lookup examples you can start to have some fun with it. What's also important to remember though is that there is a performance cost to all the joins. If you have too many joins in your Mongo operations then you should probably consider a relational model.

Remember that these data stores are only as good as your use case...

|

what are the measurable performance costs of a join in Mongo?