Larger indexes in MongoDB 5 compared to MongoDB 4

Franz_van_Betteraey · 2022-05-30T13:09:28.871Z

Hi all,

I updated a system from MongoDB 4 to MongoDB 5 and observe much larger index sizes in MongoDB 5 than in MongoDB 4. I now run two systems in parallel (one with MongoDB 4 and one with MongoDB 5) and these are my observations for a collection with lots of data:

	Mongo 4	Mongo 5
count	1.638.164.323	1.096.569.012
avgObjSize	495	495
indexSizes._id_	18.808.434.688	39.714.480.128
avg indexSizes._id_	11	36

Other indexes also increase, maybe because of the increased space for the _id_ field. What can be the cause of the index growth? The only change in configuration I did was to seperate the database to an own directry (directoryperdb=true).

Records are inserted to the collection with a rate of about 1000 docs/second and get removed by a TTL-index.

Thanks for any hint
Franz

PS: The update was not a real update. I could just install the new system with MongoDB 5 and start with a fresh database.

MaBeuLux88_xxx · 2022-05-30T17:09:33.578Z

Hi @Franz_van_Betteraey,

That’s weird and most probably unexpected.
Are the documents exactly the same? (i.e content of the _id identical?)
Does the _id contain the default ObjectId or something else? If it’s something else, does its size varies?
Which versions of MongoDB are you using exactly? This could help to track a ticket eventually.

Cheers,
Maxime.

FrVaBe · 2022-05-31T06:56:57.661Z

Hi @MaBeuLux88_xxx,

the versions are 4.2.19 and 5.0.8. The document content is generated (test data), thus not exactly the same but comparable. My client is a SpringBoot Application using the Spring Data MongoDB framework. The id is of the org.bson.types.ObjectId type. I do not set the id myself. This is done by MongoDB (or the Spring Framework). The ids look ‘normal’ like this (in both server versions):

        "_id" : ObjectId("623f2e0f200e061cb71ca9ae")

With the server update I also updated the client to use the java driver version 4.6.0 instead of 3.11.2. I have also observed a drop in performance here (in my use case). I cannot say whether this is connected to the larger index. It could also be due to connection pooling or something else. When I use the old driver version (also with the new MongoDB 5 version), I do not observe any performance loss. Therefore I think the problem is more on the client side.

Thank you for your efforts
Franz

MaBeuLux88_xxx · 2022-05-31T09:54:14.212Z

The first thing that comes to mind when I see these indexes is to understand how the index was built and how the collection lived so far.

When an index is freshly built (as it’s _id in this case, when the collection was created) it’s very compact and optimized. But as docs are added, removed, added, removed and updated, the entries start to spread and keep space in between.

If you rebuild an index it will be very compact but also have no space in it, if you then add things to it that are spread through the index it can rapidly grow as it has to split every block to make room for new entries.

So depending if the collection are freshly loaded or used for years, this can make a huge difference. It doesn’t mean that this makes the index less efficient though. Performance issue could be related to a bunch or other reasons.

Franz_van_Betteraey · 2022-06-09T14:13:11.814Z

The observations were made on a fresh collection. But it is good to know that there is no fundamental changes expected here.
I will try to test this again in isolation, so that it can be reproduced in case of doubt. I still need time for that though. Thanks for now!

Franz_van_Betteraey · 2022-09-05T12:36:25.267Z

Hi @MaBeuLux88_xxx and others,

I have to come back to this issue because it looks like my observation was not accidental. To proof my observation I now did a dedicted test with the following test steps

export (mongoexport) sample data of 100.000 documents out of some test environment (documents have MongoDB ObjectIds included)
import (mongoimport) these data to a fresh MongoDB 4.2.22 and a fresh Mongo 5.0.11 database
compare the collection statistics

Test-Result:

same avgObjSize of 495 bytes in MongDB 4 and 5
different avgIndex Size of _id-Index
- MongoDB 4 → 10 bytes
- MongoDB 5 → 19 bytes

Do you know what causes this difference? I can provide the test data and also the collection stats() reports if you would like to reproduce the behaviour.

Kind regards
Franz

Test environment

Windows
MongoDB 4.2.22 / 5.0.11

MongoDB Configuration

# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# Where and how to store data.
storage:
  dbPath: c:\Progs\mongodb-win32-x86_64-2012plus-4.2.22\data\
  journal:
    enabled: true

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path:  c:\Progs\mongodb-win32-x86_64-2012plus-4.2.22\log\mongod.log

# network interfaces
net:
  port: 27020
  bindIp: 127.0.0.1

Franz_van_Betteraey · 2022-09-12T10:40:47.406Z

I did further research and observed the following:

an update process from 4.2 → 4.4 → 5.0 does not increase the index size of the existing data
new data, which is then added to the 5.0 version, needs more space for the indices

I would appreciate it very much if my observations here were confirmed by official side. Then I could exclude a cause for this observation on my side (which I do not see).

kevinadi · 2022-09-14T02:48:17.155Z

Hi @Franz_van_Betteraey

I did a similar repro using a similar document sizes, 100,000 of them, using the procedure you described and came across these results:

	_id index size vs. 4.2.22	
4.2.22	2088960	100.00%
4.4.16	2805760	134.31%
5.0.12	2809856	134.51%
6.0.1	2650112	126.86%
		
	secondary index size vs. 4.2.22	
4.2.22	1277952	100.00%
4.4.16	1417216	110.90%
5.0.12	1417216	110.90%
6.0.1	1417216	110.90%

Before taking the size of each collection, I executed db.adminCommand({fsync:1}) to ensure that WiredTiger does a checkpoint. This will make the sizes consistent as written on disk. Without fsync, you might find that the sizes keeps fluctuating before it settles after a minute (WiredTiger does a checkpoint every minute).

In addition to the _id index, I also created a secondary index just to double check.

What I found is that secondary index sizes are quite consistent from 4.4 to 6.0, with 4.2 being the odd one out. With regard to _id, 4.4 to 6.0 are about 130% larger than 4.2.

I believe what you’re seeing was caused by the new-ish (from MongoDB 5.0) WiredTiger feature of Snapshot History Retention. The introduction of this feature changes a lot of WiredTiger internals, and this is one of the side effect of the change. To be complete, this issue was known, and was mentioned in SERVER-47652, WT-6082, and WT-6251.

Hope this explains what you’re seeing here

Best regards
Kevin

Franz_van_Betteraey · 2022-09-14T06:33:18.785Z

Hi @kevinadi,

that’s just the information I was looking for. Big thanks! I also posted this question on SQ here. Feel feel to give your answer also on this site and I will be glad to accept ist.

Kind regars
Franz

system · 2022-09-19T06:33:26.269Z

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.