Aggregation Fails (hangs) Since Atlas upgrade from 6.0.6 to 6.0.8

Sean_Daniels · July 21, 2023, 4:35pm

I have a fairly simple merge aggregation that runs nightly to update some statistics in a users collection. The three collections are decent sized but not humongous. Since my Atlas cluster was updated to 6.0.8, the aggregation just runs forever (days) and I’ve been forced to kill it. I can reproduce the problem on two Atlas clusters running 6.0.8 and confirm the agg runs fine on my local develop machine on 6.0.6. Not even sure where to begin troubleshooting this. Anyone else seen issues with 6.0.8?

[
			 {$match:{ "stats.dateLastLogin":{ $gt:now().add("d",-30) } }}
			 ,{$lookup:{from:"posts", let:{member:"$memberID"}, as:"posts", pipeline:[
				 {$match:{$expr:{$eq:["$member","$$member"]}}}
				,{$group:{_id:nullValue(), cnt:{$sum:1}, cntActive:{$sum:{$cond:[{$eq:["$status","ACT"]},1,0]}}}}
			]}}
			,{$unwind:{path:"$posts", preserveNullAndEmptyArrays:true}}
			,{$addFields:{"stats.numPosts":{$ifNull:["$posts.cnt",0]}, "stats.numPostsActive":{$ifNull:["$posts.cntActive",0]}}}
			,{$project:{"posts":0}}
			,{$lookup:{from:"views", let:{member:"$memberID"}, as:"views", pipeline:[
				  {$match:{post:{$exists:1}, $expr:{$eq:["$member","$$member"]}}}
				 ,{$group:{_id:nullValue(), cnt:{$sum:1}, last:{$max:{$toDate:"$_id"}}}}
			]}}
			,{$unwind:{path:"$views", preserveNullAndEmptyArrays:true}}
			,{$addFields:{"stats.numPostViews":{$ifNull:["$views.cnt",0]}, "stats.dateLastPostView":{$ifNull:["$views.last","$$REMOVE"]}, "stats.searchGeniusCandidate":{$cond:{if:{$lt:[{$ifNull:["$views.last",createDate(1972,9,4)]},createDate(2016,4,2)]},then:"$$REMOVE",else:true}}}}
			,{$project:{"views":0}}
			,{$project:{stats:1}}
			,{$merge:{into:"members", on:"_id"}}
];

Jason_Tran · July 24, 2023, 2:45am

Hi @Sean_Daniels - Welcome to the community.

It’s definitely interesting that this was able to be reproduced on two Atlas clusters running the same version. In saying so, I am wondering if you could provide the following details to help reproduce this on my end / troubleshoot the cause:

The output of db.collection.explain("executionStats") from the 6.0.6 local instance
The output of db.collection.explain("executionStats") from the 6.0.8 Atlas instance where you are experiencing the hanging aggregation
The "size", "count" and "avgObjSize" from each of the above 2 environments. You can get these using collStats.
Sample document(s) from each of the collections involved in the aggregation.

How long did the aggregation generally run for in the 6.0.6 instance (when it was working).

Have you tried running 6.0.8 locally on this same test environment you’ve mentioned in the above quote to see if that hangs as well? I understand you’ve noted it hangs on the 2 Atlas instances on 6.0.8 but this would just help round things down for troubleshooting purposes.

Please redact any sensitive and personal information before posting here

Look forward to hearing from you.

Regards,
Jason

Sean_Daniels · July 24, 2023, 3:13pm

Hi Jason, thanks for the reply. Answers below:

gist.github.com

https://gist.github.com/sjdaniels/19826d399dfac57446da4fc094cada79

explain-6.0.6.bson

{
  explainVersion: '1',
  stages: [
    {
      '$cursor': {
        queryPlanner: {
          namespace: 'dealstream.members',
          indexFilterSet: false,
          parsedQuery: {
            'stats.dateLastLogin': { '$gt': ISODate("2023-06-24T00:00:00.000Z") }

This file has been truncated. show original

gist.github.com

https://gist.github.com/sjdaniels/642ec64545d0187c785ee4ac88f2984a

explain-6.0.8.bson

{
  explainVersion: '1',
  stages: [
    {
      '$cursor': {
        queryPlanner: {
          namespace: 'dealstream.members',
          indexFilterSet: false,
          parsedQuery: {
            'stats.dateLastLogin': { '$gt': ISODate("2023-06-24T00:00:00.000Z") }

This file has been truncated. show original

6.0.6: size, 1199718063; count, 700185; avgObjSize, 1713
6.0.8: size, 1178152738; count, 693630, avgObjSize, 1698

Sean_Daniels · July 24, 2023, 3:13pm

Sample from “members”

gist.github.com

https://gist.github.com/sjdaniels/f7b7ba775ab4e110bc6f420008a5b5c8

member.bson

{
    _id: ObjectId("6454140325d01b11f53e5772"),
    browserlang: 'en-us',
    currency: 'USD',
    dateCreated: ISODate("2023-05-04T20:22:27.670Z"),
    dateTOU: ISODate("2023-05-04T20:22:27.654Z"),
    dateUpdated: ISODate("2023-05-04T20:22:29.055Z"),
    dismissChangelog: true,
    email: 'buyer1@padawan.net',
    ip: '96.67.22.213',

This file has been truncated. show original

Sample from “posts”

gist.github.com

https://gist.github.com/sjdaniels/c505e5ce75a8c9bce3e49efd0cedf827

posts.bson

{
    _id: ObjectId("644ff409dc978c52800bcda9"),
    currency: 'USD',
    industry: {
      codetype: 'MN',
      ruleID: '3637A6DC6A3EF4DA0EC2B87A72E54E5C',
      code: '422',
      id: 422,
      tags: [ 152, 422 ],
      aggregator: 422

This file has been truncated. show original

Sample from “views”

gist.github.com

https://gist.github.com/sjdaniels/be9ec2cb8d84da1e526ef4bf68e1f769

views.bson

{
    _id: ObjectId("6481dc273ed2fb4d633d0cb9"),
    lang: 'en',
    member: '22914580-04E2-96CD-6E90ACA6AC93BD9C',
    memberType: 0,
    objectStatus: 'ACT',
    objectType: 5,
    post: ObjectId("644c37b0e96ccf1a31136af1"),
    promoted: true,
    site: 'us'

This file has been truncated. show original

It took about 32 minutes. However, since I added the $match stage at the front of the pipeline (to try to get the aggregation working on a smaller dataset), it takes only 2 minutes or so on 6.0.6. Even with the $match stage it just hangs on 6.0.8.

I have not yet, only because I do not believe 6.0.8 has been released yet to homebrew, which is how I install/update on my local machine:

nadja:dealstream sdaniels$ brew upgrade mongodb-community@6.0
Warning: mongodb/brew/mongodb-community 6.0.6 already installed

Jason_Tran · July 26, 2023, 3:25am

Thanks Sean - Going to take a look at the information provided and will update here if I notice anything that may be causing the hang.

That is fair. If possible, can you try launch it direct from the unpacked download?

Best regards,
Jason

Jason_Tran · July 26, 2023, 4:22am

I inspected both of the explain outputs but it seems its using the default. Would you be able to provide the "executionStats" level output?

Additionally, the 32 minutes you noted when it ran on 6.0.6 - Was this the local or Atlas instance?

Just to cover some extra bases, what is the Atlas tier the aggregation is hanging on? I’m curious to know if you noticed any resource pressure when it’s run.

Sean_Daniels · July 26, 2023, 1:53pm

Both.

It’s hanging on my dev instance and my production instance. Dev is M10 and Prod is M50.

Sean_Daniels · July 26, 2023, 1:56pm

I’m not sure how to achieve this for an aggregation. To get the output I posted earlier I used db.runCommand({aggregate:"members", pipeline:myPipeline, explain:true})

How would I modify the above to get the executionStats level output? Thanks.

Sean_Daniels · July 26, 2023, 2:06pm

OK, I think I figured out the executionStats thing. I used db.runCommand({explain:{aggregate:"members",pipeline:pipe, cursor:{}}, verbosity:"executionStats"})

I have updated the gist for 6.0.6 above accordingly. I am still waiting for the results on 6.0.8 (it appears to be HUNG?)

Sean_Daniels · July 26, 2023, 2:40pm

Yeah, I tried the explain multiple times on 6.0.8 and it just hangs.

Jason_Tran · July 27, 2023, 4:04am

Thanks @Sean_Daniels - I’ve run some tests and got it working on 6.0.8 but i’m now going to expand my test collections. Do you know how many documents inside the other 2 collections mentioned? I assume members collection is the below count:

In the meantime, i’ll generate posts and view collections with similar sizes but let me know the actual values for these 2 other collections.

Sean_Daniels · July 27, 2023, 2:01pm

members: 700,635
posts: 520,801
views: 76,613,887

Jason_Tran · August 1, 2023, 1:09am

Hi @Sean_Daniels,

Thanks for your patience.

I wasn’t able to replicate this same behaviour on two local 6.0.6 and 6.0.8 instances. The pipelines ran (although they did take some time) and had the exact same execution stats (minus the durationMillis which did not vary in any significant amounts between the two versions).

{$match:{ "stats.dateLastLogin":{ $gt:now().add("d",-30) } }}

I do have one test I am curious to see the results of: On the 6.0.8 environment can you try changing this $match stage to match only a single document using an index? The easiest would probably to use a $match on the _id value of a single document and seeing if the aggregation completes or does it still hang?

Regards,
Jason

Jean-Francois_Lebeau · August 15, 2023, 6:04pm

I wonder if this is related to the issue I’m getting here: Major performance hit for aggregation with lookup from 5.0.13 to 5.0.18

Jason_Tran · August 17, 2023, 3:53am

Difficult to say with the current information. One thing I can see in reference to it is that the CPU is choked for minutes but perhaps you can verify on the original post if it ever completes a long with the other previously requested information (if it doesn’t ever complete then you may not be able to extract the execution stats output from the later version on that thread). However, I would continue on that thread for that particular topic since it may be unrelated.

Kushagra_Kesav · August 20, 2023, 7:07am

A post was split to a new topic: Need Help Optimizing Empty Query Latency