2 / 8
May 2024

I am attempting to execute a query to count documents based on the ledgerName field, which is used to search for the required ledger name. Additionally, I need to obtain distinct counts of each status for the searched ledger name and store all data that matches each status into a list.

Below is the demo data for three collections involved in my query:

1. Settings: (Collection Size: 30K documents) { _id: // ObjectId of doc company_id: // ObjectId of company ledgerName_id: // ID of ledger in String } 2. History: (Collection Size: 600K documents) { _id: // ObjectId of doc company_id: // ObjectId of company ledgerName_id: // ID of ledger in String createdAt: // Date of creation of history status: // Status 1 or 0 } 3. Ledger_Name: (Collection Size 50M documents) { _id: // ObjectId of doc company_id: // ObjectId of company name_id: // ID of ledger in String ledgerName: // Name of ledger in string }

Here is the query that utilizes indexed scans across the Settings collection and also leverages indices in the other two collections during the lookup stages:

Query: db.getCollection("Settings").aggregate( [{ $match:{ "company_id": // Some Company ID } }, { $lookup:{ as: "SettingsData", from: "Settings", "let": { "company_id": "$company_id", "ledgerName_id": "$ledgerName_id" }, pipeline:[ { "$match": { "$expr": { "$and": [ { "$eq": [ "$company_id", "$company_id" ] }, { "$eq": [ "$ledgerName_id", "$ledgerName_id" ] } ] }, "createdAt": { "$gte": ISODate("2024-04-01T00:00:00.000+0530"), "$lt": ISODate("2025-04-01T00:00:00.000+0530") }, } }, { "$sort": { "createdAt": -1 } }, // Sort by createdAt in descending order { "$limit": 1 }, // { "$project": { "_id": 1, "createdAt": 1, "status": 1, } } ] } }, { "$unwind": { "path": "$SettingsData", "preserveNullAndEmptyArrays": true } }, { "$lookup": { "as": "ledgersData", "from": "Ledger_Name", "let": { "company_id": "$company_id", "ledgerName_id": "$ledgerName_id" }, "pipeline": [ { "$match": { "$expr": { "$and": [ { "$eq": [ "$company_id", "$company_id" ] }, { "$eq": [ "$name_id", "$ledgerName_id" ] } ] } } }, { "$project": { "_id": 0, "ledgerName": 1 // Search Term } } ] } }, }, { "$unwind": { "path": "$ledgersData", "preserveNullAndEmptyArrays": true } }, { "$count": "count" }

When I run this query without any aggregation operations like $count or $group, it completes within 500 milliseconds. However, adding these stages extends the execution time to 15-20 seconds. Could anyone help me in optimizing this query?

$group and $count are so-called “blocking stages”, which means you need to wait for them to count (or group) ALL the results from the previous stage. This is expected behaviour.

Are you sure the query is actually completing after 500ms, and not just a cursor batch??

Please share the code you use without the $count stage.

Simply doing result = …aggregate( [ the pipeline without $count ] ) does not process all the documents of the result set so in fact you just get

Calling result.toArray() should get you all the documents.

Since you $unwind, just to count, you might want to simply $project the size of $ledgersData, which will not be blocking. And then $group to $sum of size of each array. This would at lease reduce the potential extra RAM usage of unwind.

And with a second look at the code, you could try a $count in the last $lookup pipeline. I am just not too sure of the empty array case.

@steevej Appologies for late reply.

Answer to your 1st question: the query that I have provided, just remove $count stage from it. It is the same query.

Additional pipelines will include $group stage that will fetch summation of docmentations with different business logics. (I can’t share that thing with you that’s why i have created a simple sample query with count stage. If that query is executed in minimum amount of time then I could also get those thing too)

So you confirm, as Peter_Hubbard hinted, that your query is NOT completing after 500ms because just removing the $count stage means that you

To complete the query, rather than a simple cursor batch, you have to

Please share the explain plan.

Explain output for 1st Lookup stage:

"totalDocsExamined" : NumberLong(42526), "totalKeysExamined" : NumberLong(42526), "collectionScans" : NumberLong(0), "indexesUsed" : [ "company_id_1_ledgerName_id_1_createdAt_1" ], "nReturned" : NumberLong(42544), "executionTimeMillisEstimate" : NumberLong(639)

Explain output for 2nd Lookup stage:

"totalDocsExamined" : NumberLong(42534), "totalKeysExamined" : NumberLong(42534), "collectionScans" : NumberLong(0), "indexesUsed" : [ "company_id_1_ledgerName_id_1_name_1" ], "nReturned" : NumberLong(42544), "executionTimeMillisEstimate" : NumberLong(9909)