Most likely the $group stage is part of the problem. It blocks until all documents are processed and potentially uses a lot of memory before the first resulting document gets out.

I had success in moving out the computation of the $group (the last_event code) and put it in the aggregation pipeline of a $lookup that is added after the $group. Something like

{ "$group": {
    "stage_id" : "$stage_id" , "up_value" : "$up_value"
} } ,
{ "$lookup": {
    "from": "collection" ,
    "as": "last_event" ,
    "let": {
        "stage_id" : "$stage_id" ,
        "up_value" : "$up_value"
    } ,
    "pipeline": [
        { "$match" : { timestamp, stage_id, uq_value, ... } } ,
        { "$sort" : { timestamp : -1 } } ,
        { "$limit" : 1 }
    ]
} }

The theory is that $group now uses a lot less memory.
You also uses $$ROOT in your group so all documents of a given group must be fetched. With the $lookup only 1 per group is fetch if the index covers the $match and $sort.

I think it could be even improved by doing the $group in 2 steps. The first is $group on stage_id, do a $lookup that matches stage_id and timestamp, in this $lookup you have another $group but with up_value this time. With this you need a 2nd lookup that would be a lot like the one I shared before.

Another thing I would try is to use $documents as the first stage rather than $in. Then the first $group would instantaneous.

To get a more detailed answer I would need a playground with sample documents so that I can work with. Time is scarce so I try to avoid creating documents that others already have.