Aggregation Fails (hangs) Since Atlas upgrade from 6.0.6 to 6.0.8

I have a fairly simple merge aggregation that runs nightly to update some statistics in a users collection. The three collections are decent sized but not humongous. Since my Atlas cluster was updated to 6.0.8, the aggregation just runs forever (days) and I’ve been forced to kill it. I can reproduce the problem on two Atlas clusters running 6.0.8 and confirm the agg runs fine on my local develop machine on 6.0.6. Not even sure where to begin troubleshooting this. Anyone else seen issues with 6.0.8?

[
			 {$match:{ "stats.dateLastLogin":{ $gt:now().add("d",-30) } }}
			 ,{$lookup:{from:"posts", let:{member:"$memberID"}, as:"posts", pipeline:[
				 {$match:{$expr:{$eq:["$member","$$member"]}}}
				,{$group:{_id:nullValue(), cnt:{$sum:1}, cntActive:{$sum:{$cond:[{$eq:["$status","ACT"]},1,0]}}}}
			]}}
			,{$unwind:{path:"$posts", preserveNullAndEmptyArrays:true}}
			,{$addFields:{"stats.numPosts":{$ifNull:["$posts.cnt",0]}, "stats.numPostsActive":{$ifNull:["$posts.cntActive",0]}}}
			,{$project:{"posts":0}}
			,{$lookup:{from:"views", let:{member:"$memberID"}, as:"views", pipeline:[
				  {$match:{post:{$exists:1}, $expr:{$eq:["$member","$$member"]}}}
				 ,{$group:{_id:nullValue(), cnt:{$sum:1}, last:{$max:{$toDate:"$_id"}}}}
			]}}
			,{$unwind:{path:"$views", preserveNullAndEmptyArrays:true}}
			,{$addFields:{"stats.numPostViews":{$ifNull:["$views.cnt",0]}, "stats.dateLastPostView":{$ifNull:["$views.last","$$REMOVE"]}, "stats.searchGeniusCandidate":{$cond:{if:{$lt:[{$ifNull:["$views.last",createDate(1972,9,4)]},createDate(2016,4,2)]},then:"$$REMOVE",else:true}}}}
			,{$project:{"views":0}}
			,{$project:{stats:1}}
			,{$merge:{into:"members", on:"_id"}}
];

Hi @Sean_Daniels - Welcome to the community.

It’s definitely interesting that this was able to be reproduced on two Atlas clusters running the same version. In saying so, I am wondering if you could provide the following details to help reproduce this on my end / troubleshoot the cause:

  1. The output of db.collection.explain("executionStats") from the 6.0.6 local instance
  2. The output of db.collection.explain("executionStats") from the 6.0.8 Atlas instance where you are experiencing the hanging aggregation
  3. The "size", "count" and "avgObjSize" from each of the above 2 environments. You can get these using collStats.
  4. Sample document(s) from each of the collections involved in the aggregation.
  1. How long did the aggregation generally run for in the 6.0.6 instance (when it was working).
  1. Have you tried running 6.0.8 locally on this same test environment you’ve mentioned in the above quote to see if that hangs as well? I understand you’ve noted it hangs on the 2 Atlas instances on 6.0.8 but this would just help round things down for troubleshooting purposes.

Please redact any sensitive and personal information before posting here

Look forward to hearing from you.

Regards,
Jason

1 Like

Hi Jason, thanks for the reply. Answers below:

6.0.6: size, 1199718063; count, 700185; avgObjSize, 1713
6.0.8: size, 1178152738; count, 693630, avgObjSize, 1698

Sample from “members”

Sample from “posts”

Sample from “views”

It took about 32 minutes. However, since I added the $match stage at the front of the pipeline (to try to get the aggregation working on a smaller dataset), it takes only 2 minutes or so on 6.0.6. Even with the $match stage it just hangs on 6.0.8.

I have not yet, only because I do not believe 6.0.8 has been released yet to homebrew, which is how I install/update on my local machine:

nadja:dealstream sdaniels$ brew upgrade mongodb-community@6.0
Warning: mongodb/brew/mongodb-community 6.0.6 already installed

Thanks Sean - Going to take a look at the information provided and will update here if I notice anything that may be causing the hang.

That is fair. If possible, can you try launch it direct from the unpacked download?

Best regards,
Jason

I inspected both of the explain outputs but it seems its using the default. Would you be able to provide the "executionStats" level output?

Additionally, the 32 minutes you noted when it ran on 6.0.6 - Was this the local or Atlas instance?

Just to cover some extra bases, what is the Atlas tier the aggregation is hanging on? I’m curious to know if you noticed any resource pressure when it’s run.

Both.

It’s hanging on my dev instance and my production instance. Dev is M10 and Prod is M50.

I’m not sure how to achieve this for an aggregation. To get the output I posted earlier I used db.runCommand({aggregate:"members", pipeline:myPipeline, explain:true})

How would I modify the above to get the executionStats level output? Thanks.

OK, I think I figured out the executionStats thing. I used db.runCommand({explain:{aggregate:"members",pipeline:pipe, cursor:{}}, verbosity:"executionStats"})

I have updated the gist for 6.0.6 above accordingly. I am still waiting for the results on 6.0.8 (it appears to be HUNG?)

Yeah, I tried the explain multiple times on 6.0.8 and it just hangs. :person_shrugging:

Thanks @Sean_Daniels - I’ve run some tests and got it working on 6.0.8 but i’m now going to expand my test collections. Do you know how many documents inside the other 2 collections mentioned? I assume members collection is the below count:

In the meantime, i’ll generate posts and view collections with similar sizes but let me know the actual values for these 2 other collections.

members: 700,635
posts: 520,801
views: 76,613,887

1 Like

Hi @Sean_Daniels,

Thanks for your patience.

I wasn’t able to replicate this same behaviour on two local 6.0.6 and 6.0.8 instances. The pipelines ran (although they did take some time) and had the exact same execution stats (minus the durationMillis which did not vary in any significant amounts between the two versions).

{$match:{ "stats.dateLastLogin":{ $gt:now().add("d",-30) } }}

I do have one test I am curious to see the results of: On the 6.0.8 environment can you try changing this $match stage to match only a single document using an index? The easiest would probably to use a $match on the _id value of a single document and seeing if the aggregation completes or does it still hang?

Regards,
Jason

I wonder if this is related to the issue I’m getting here: Major performance hit for aggregation with lookup from 5.0.13 to 5.0.18

1 Like

Difficult to say with the current information. One thing I can see in reference to it is that the CPU is choked for minutes but perhaps you can verify on the original post if it ever completes a long with the other previously requested information (if it doesn’t ever complete then you may not be able to extract the execution stats output from the later version on that thread). However, I would continue on that thread for that particular topic since it may be unrelated.

A post was split to a new topic: Need Help Optimizing Empty Query Latency