Optimizing $lookup Performance Using the Power of Indexing
Rate this tutorial
Have you ever had to join data from two collections and wanted to know how to improve the queries performance? Let's walk through how indexing can help power faster search results when using $lookup.
$lookup is a stage in an aggregation pipeline that performs a left outer join to a collection to filter in documents from the "joined" collection for processing. $lookup creates a new array in the pipeline, where each field is a returned document from the $lookup.
Indexes within a database are extremely powerful, as they can significantly reduce resource usage and query duration. Without them, the queries would have to scan every document in a collection in order to return the query result, which takes a lot of time and resources. It is therefore very important to have indexes in place when using $lookup to “join” collections.
For $lookup performance to be optimized, it is good practice to have indexes on both collections, the source collection and joining collection. That way, the main query on the source collection will benefit from an index as will the query on the joined collection.
To walk you through how we can optimize $lookup queries, we will use movie data as an example. Our query uses data from two collections,
genres
and movies
. We want to find all the movies that fall under the genres of comedy and drama.Documents from source and joining collections
Collection: genres
1 db.genres.insertMany([ 2 { "_id": 0, "genre": "Comedy" }, 3 { "_id": 1, "genre": "Drama" }, 4 { "_id": 2, "genre": "Action" }, 5 { "_id": 3, "genre": "Romance" }, 6 { "_id": 4, "genre": "Adventure" }, 7 { "_id": 5, "genre": "Family" }, 8 { "_id": 6, "genre": "Fantasy" }, 9 { "_id": 7, "genre": "Thriller" } 10 ])
Collection: movies
1 db.movies.insertMany([ 2 { "_id": 0, "title": "Billy Madison", "genre_codes": [0], "year": 1995, "directors": ["Tamra Davis"] }, 3 { "_id": 1, "title": "Happy Gilmore", "genre_codes": [0, 5], "year": 1996, "directors": ["Dennis Dugan"] }, 4 { "_id": 2, "title": "The Waterboy", "genre_codes": [0, 5], "year": 1998, "directors": ["Frank Coraci"] }, 5 { "_id": 3, "title": "Big Daddy", "genre_codes": [0, 5], "year": 1999, "directors": ["Dennis Dugan"] }, 6 { "_id": 4, "title": "Little Nicky", "genre_codes": [0], "year": 2000, "directors": ["Steven Brill"] }, 7 { "_id": 5, "title": "Mr. Deeds", "genre_codes": [0], "year": 2002, "directors": ["Dennis Dugan"] }, 8 { "_id": 6, "title": "50 First Dates", "genre_codes": [0, 3], "year": 2004, "directors": ["Peter Segal"] }, 9 { "_id": 7, "title": "The Longest Yard", "genre_codes": [0, 2], "year": 2005, "directors": ["Peter Segal"] }, 10 { "_id": 8, "title": "Grown Ups", "genre_codes": [0, 5], "year": 2010, "directors": ["Dennis Dugan"] }, 11 { "_id": 9, "title": "Just Go with It", "genre_codes": [0, 3], "year": 2011, "directors": ["Dennis Dugan"] }, 12 { "_id": 10, "title": "Blended", "genre_codes": [0, 5], "year": 2014, "directors": ["Frank Coraci"] }, 13 { "_id": 11, "title": "The Do-Over", "genre_codes": [0, 2], "year": 2016, "directors": ["Steven Brill"] }, 14 { "_id": 12, "title": "Murder Mystery", "genre_codes": [0, 7], "year": 2019, "directors": ["Kyle Newacheck"] }, 15 { "_id": 13, "title": "You Are So Not Invited to My Bat Mitzvah", "genre_codes": [0], "year": 2023, "directors": ["Sammi Cohen"] }, 16 { "_id": 14, "title": "Punch-Drunk Love", "genre_codes": [1], "year": 2002, "directors": ["Paul Thomas Anderson"] }, 17 { "_id": 15, "title": "Reign Over Me", "genre_codes": [1], "year": 2007, "directors": ["Mike Binder"] }, 18 { "_id": 16, "title": "Funny People", "genre_codes": [1], "year": 2009, "directors": ["Judd Apatow"] }, 19 { "_id": 17, "title": "Uncut Gems", "genre_codes": [1, 7], "year": 2019, "directors": ["Josh Safdie", "Benny Safdie"] }, 20 { "_id": 18, "title": "Bedtime Stories", "genre_codes": [5, 6], "year": 2008, "directors": ["Adam Shankman"] }, 21 { "_id": 19, "title": "Hotel Transylvania", "genre_codes": [5, 6], "year": 2012, "directors": ["Genndy Tartakovsky"] }, 22 { "_id": 20, "title": "Pixels", "genre_codes": [0, 2], "year": 2015, "directors": ["Chris Columbus"] } 23 ])
Desired Output:
1 { 2 "_id": 1, 3 "genre": "Drama", 4 "movies": [ 5 { "title": "Funny People", "year": 2009 }, 6 { "title": "Punch-Drunk Love", "year": 2002 }, 7 { "title": "Reign Over Me", "year": 2007 }, 8 { "title": "Uncut Gems", "year": 2019 } 9 ] 10 } 11 12 { 13 "_id": 0, 14 "genre": "Comedy", 15 "movies": [ 16 { "title": "50 First Dates", "year": 2004 }, 17 { "title": "Big Daddy", "year": 1999 }, 18 { "title": "Billy Madison", "year": 1995 }, 19 { "title": "Blended", "year": 2014 }, 20 { "title": "Grown Ups", "year": 2010 }, 21 { "title": "Happy Gilmore", "year": 1996 }, 22 { "title": "Just Go with It", "year": 2011 }, 23 { "title": "Little Nicky", "year": 2000 }, 24 { "title": "Mr. Deeds", "year": 2002 }, 25 { "title": "Murder Mystery", "year": 2019 }, 26 { "title": "Pixels", "year": 2015 }, 27 { "title": "The Do-Over", "year": 2016 }, 28 { "title": "The Longest Yard", "year": 2005 }, 29 { "title": "The Waterboy", "year": 1998 }, 30 { "title": "You Are So Not Invited to My Bat Mitzvah", "year": 2023 } 31 ] 32 }
We can achieve our task of finding all the comedy and drama movies by creating an aggregation pipeline on the
genres
collection and joining data from the movies
collection. It is possible to create the aggregation on the movies
collection, but for illustration purposes, we are going to assume we are joining from genres
to movies
.- We want to ensure that we are only going to return the documents that are under the genre of comedy and drama by using the $match stage.
- In the $lookup stage, we will perform a left join on the
movies
collection to retrieve the movie titles and years:
- The
from
field is our joining collection. In this case, it's themovies
collection. - In the
let
, we are assigning the “_id” field in thegenres
collection to the variable “genre_id”, which we will use in the pipeline. - In the
pipeline
, we will use$match
to take the documents from themovies
collection where thegenre_id
(from thegenres
collection) corresponds to the array “genre_codes” (from the movies collection). Then, we will use$project
to only return the title and year of each movie.
Let's run the aggregation and see how it is performing:
1 db.genres.aggregate([ 2 { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } }, 3 { 4 "$lookup": { 5 "from": "movies", 6 "let": { "genre_id": "$_id" }, 7 "pipeline": [ 8 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } }, 9 { "$project": { _id: 0, "title": 1, "year":1 } } 10 ], 11 "as": "movies" 12 } 13 } 14 ]).explain("allPlansExecution")
Results:
1 executionStats: { 2 nReturned: 2, 3 totalKeysExamined: 0, 4 totalDocsExamined: 8, 5 executionStages: { 6 stage: 'COLLSCAN', 7 filter: { genre: { '$in': [ 'Comedy', 'Drama' ] } }, 8 nReturned: 2, 9 ... 10 "totalDocsExamined": 42, 11 "totalKeysExamined": 0, 12 "collectionScans": 2, 13 "indexesUsed": [], 14 "nReturned": 2
Now, we know that using indexes improves the performance of the program, so let's create them on their respective collections:
1 db.genres.createIndex({ "genre": 1 }) 2 db.movies.createIndex({ "genre_codes": 1 })
Let's run the aggregation and see how the performance improved by adding indexes:
1 db.genres.aggregate([ 2 { "$match": { "genre": { "$in": [ "Comedy", "Drama" ] } } }, 3 { 4 "$lookup": { 5 "from": "movies", 6 "let": { "genre_id": "$_id" }, 7 "pipeline": [ 8 { "$match": { "$expr": { "$in": [ "$$genre_id", "$genre_codes" ] } } }, 9 { "$project": { _id: 0, "title": 1, "year":1 } } 10 ], 11 "as": "movies" 12 } 13 } 14 ]).explain("allPlansExecution")
Results:
1 "executionStats": { 2 "nReturned": 2, 3 "totalKeysExamined": 3, 4 "totalDocsExamined": 2, 5 "stage": "IXSCAN", 6 "nReturned": 2, 7 "indexName": "genre_1", 8 ... 9 "totalDocsExamined": 42, 10 "totalKeysExamined": 0, 11 "collectionScans": 2, 12 "indexesUsed": [], 13 "nReturned": 2,
Uh oh! We can see through the
executionStats
output that this aggregation did not use the index on the movies
that we created. The totalDocsExamined
is 42, meanwhile there are only eight documents in the genres
collection and 21 in the movies
collection! The reason the number of documents examined can get so high is because for each of the two genres
, it scans the entire movies
collection. This leads to two collection scans, which is suboptimal and can lead to significant performance issues. When a $lookup takes place, each document of the source collection will scan the entire joining collection, causing the amount of documents scanned to be significantly high compared to the amount of documents returned.The reason that this method failed to use an index is that
$expr
cannot use an index when one of the operands is an array, like the $genre_codes
field in this example.To avoid using an array operand with
$expr
to evaluate if the genre codes are present, we can use localField
and foreignField
instead, which will provide the same functionality:localField
is the field withingenres
, the source collection that we want to perform an equality match withforeignField
, in this case_id
.foreignField
is the field withinmovies
, the joining collection where we want to perform an equality match withlocalField
, in this casegenre_codes
.- Since we already established which variables we are going to be using in
pipeline
through thelocalField
andforeignField
, we can leave thelet
empty, and remove the$match
stage from ourpipeline
.
Aggregate:
1 db.genres.aggregate([ 2 { $match: { genre: { $in: ["Comedy", "Drama"] } } }, 3 { 4 $lookup: { 5 from: "movies", 6 localField: "_id", 7 foreignField: "genre_codes", 8 let: {}, 9 pipeline: [{ $project: { _id: 0, title: 1, year: 1 } }], 10 as: "movies" 11 } 12 } 13 ]).explain("allPlansExecution")
By utilizing
localField
and foreignField
, we are able to perform an equality match with fields from different collections without using $expr
and $in
.1 "executionStats": { 2 "nReturned": 2, 3 "totalKeysExamined": 3, 4 "totalDocsExamined": 2, 5 "stage": "IXSCAN", 6 "nReturned": 2, 7 "indexName": "genre_1", 8 ... 9 "totalDocsExamined": 19, 10 "totalKeysExamined": 19, 11 "collectionScans": 0, 12 "indexesUsed": ["genre_codes_1"], 13 "nReturned": 2
We can see from the
.explain()
output that there was a 65% improvement in performance. We have successfully used the index genre_codes_1
on the movies
collection. This improves performance as we see only 19 documents were examined. However, we can improve performance even more by including the projection fields in the index.Now that we see that implementing
localField
and foreignField
will use the indexes we created, let's optimize the query by including the projection fields as part of the index to perform a covered query.1 db.movies.createIndex({ "genre_codes":1, "title":1, "year":1}) 2 3 "totalDocsExamined": 0, 4 "totalKeysExamined": 19, 5 "collectionScans": 0, 6 "indexesUsed": [ "genre_codes_1_title_1_year_1" ], 7 "nReturned": 2,
Yay! Our index was used and it performs a covered query (a query that only uses the index to retrieve results)! :) The explain output shows that totalDocsExamined is 0, which means it did not have to fetch any documents, whether for evaluating or projecting. It was all in the index already, ready for use.
Hopefully, this has been a thorough illustration of how you can optimize the performance of $lookup when one of the operands is an array. If you would like to test this in your local environment, please find instructions on creating the
movies
and genres
collections below.Thanks for reading!
Top Comments in Forums
There are no comments on this article yet.