Slow group on the aggregate on 1.3 million documents collection

Darko_Drazic · 2024-11-11T10:58:01.180Z

Hello,
we have a collection of 1.3 million documents.
Documents have a flat structure, with no arrays.

Aggregation pipeline has 2 stages - $match by stage_id, uq_value and timestamp, and group by stage_id and uq_value, accumulating last element of the group with $last: “$$ROOT”.

stage_id and uq_value are strings, timestamp is Date field.

There is also compound index having stage_id ASC, uq_value ASC and timestamp ASC.

According to Explain Plan, query takes ~ 12 secs.

Here’s the Explain Plan output:

image1235×424 26 KB

Mongo database should have enough memory for the covered query, memory is on 75%, so I do not get why it doesn’t happen and how to improve this query to run faster, preferably bellow 1 sec. I assume it’s possible.

Here is query example:

[
  {
    $match: {
      stage_id: {
        $in: [
          "id_1",
          "id_2",
          //and so on, 30 values here
        ]
      },
      timestamp: {
        $gte: ISODate("date_from"),
        $lte: ISODate("date_to")
      }
    }
  },  
  {
    $group: {
      _id: {stage_id: "$stage_id", uq_value: "$uq_value"},
      last_event: {
        $last: "$$ROOT"
      }
    }
  }
 ]

uq_value is missing here in $match, because all values are included. If there is filter by that field, it will appear in $match after stage_id condition.

Can anyone help with this?

Thanks in advance!

steevej · 2024-11-19T02:45:14.181Z

Most likely the $group stage is part of the problem. It blocks until all documents are processed and potentially uses a lot of memory before the first resulting document gets out.

I had success in moving out the computation of the $group (the last_event code) and put it in the aggregation pipeline of a $lookup that is added after the $group. Something like

{ "$group": {
    "stage_id" : "$stage_id" , "up_value" : "$up_value"
} } ,
{ "$lookup": {
    "from": "collection" ,
    "as": "last_event" ,
    "let": {
        "stage_id" : "$stage_id" ,
        "up_value" : "$up_value"
    } ,
    "pipeline": [
        { "$match" : { timestamp, stage_id, uq_value, ... } } ,
        { "$sort" : { timestamp : -1 } } ,
        { "$limit" : 1 }
    ]
} }

The theory is that $group now uses a lot less memory.
You also uses $$ROOT in your group so all documents of a given group must be fetched. With the $lookup only 1 per group is fetch if the index covers the $match and $sort.

I think it could be even improved by doing the $group in 2 steps. The first is $group on stage_id, do a $lookup that matches stage_id and timestamp, in this $lookup you have another $group but with up_value this time. With this you need a 2nd lookup that would be a lot like the one I shared before.

Another thing I would try is to use $documents as the first stage rather than $in. Then the first $group would instantaneous.

To get a more detailed answer I would need a playground with sample documents so that I can work with. Time is scarce so I try to avoid creating documents that others already have.

steevej · 2024-11-25T15:49:33.369Z

@Darko_Drazic, it has been a week since I provided different avenues to investigate.

I would appreciate a followup from you. What have you tried? What works or fails?

Thanks in advance.

Darko_Drazic · 2024-12-03T12:06:40.837Z

Hi Steeve,
sorry for the late response. Your answer helped to look at the problem from the different angle, but still, we did not manage to write performant query that is using more than one field at the $group._id, while using $lookup seems to be an overkill.

steevej · 2024-12-04T02:48:58.558Z

Darko_Drazic:

did not manage to write performant query that is using more than one field at the $group._id

Because $group is blocking and potentially uses a lot of memory since all incoming documents must be processed before the first document comes out of $group. This what the overkill $lookup is trying to avoid.