MongoDB query becomes unexpectedly slow and broken

Tianzhu_Wang · January 15, 2025, 6:17pm

I am encountering performance issues with MongoDB running in a Docker container on a Debian 12 server, using the default ext4 filesystem. The server specifications include dual E5 2696v3 processors (72 CPUs), 320GB DDR4 2133MHz ECC DRAM, and a 16TB HDD configured in RAID5 with read/write speeds up to 600-700MB/s. I am using MongoDB Community Server version 8.0.
Here is the query I am working with (through pymongo):

comment_data = list(stock_database.find({‘ticker’: ticker}, {‘_id’: 0, ‘comment’:0}).sort(‘post_publish_time’, pymongo.ASCENDING))

This querys the comment data of a given stock and sort it by post_publish_time. I have created indexes for both post_publish_time and ticker. There are a total of 5368 different tickers. My purpose is to read the data into memory using Python and export them to csv. The first 5363 tickers were just fine. I was able to read data from MongoDB sucessfully. However, when it was down to the last 5 tickers, there was problem. The query suddenly became very very slow. What is strange is that the distribution of docs across tickers is sort of even. The last 5 tickers doesn’t come with significantly more documents than the previous ones.
I have tried to troubleshoot the problem in the following ways:

The most important one of all: After processing the first 5363 tickers, I attempted to re-query tickers that had already been successfully exported to csv. It turned out that None of the tickers works fine any more. I thought the problem was with the last 5 tickers, but it is not. It seems that MongoDB is broken at some point (after data of the first 5363 tickers are retrieved and processed).
I tried to export the query result from Compass

e684c4a2cd3c04a95e4c110b20165c92390×1458 127 KB

I don’t know much about how Compass exports data. But it seems that compass is handling documents in chunks of 1000. As the spikes in the network suggests, MongoDB doesn’t take much time to process the first several batches of documents. However, it is becoming more and more slow as the batches it has processed grow. Since my specific query matches 789140 documents in the database. It is likely that the query would take a long long time to complete, which appears to be stuck when working with pymongo.
I have examined my hardware and there is no failure. I was worring if the problem is related to disk IO. I tested IO speed via fio and everything is fine. Memory is also working fine according to pcm-memory.

What is the possible cause of this problem and how should I fix it?

Tianzhu_Wang · January 17, 2025, 2:54pm

I figured out a fix. But there is still something I don’t understand.

The problem is the query is taking a long time. This query I was working on was like:

{
    "find": "AListedAll",
    "filter": {
        "ticker": "600519"
    },
    "sort": {
        "post_publish_time": 1
    }
}

This query identifies documents with a specific ticker and sorts the results by post_publish_time. When I was reading data using pymongo and loading them as dataframe, the cursor is calling the next() method to iterate over the entire collection. My guess is, as the iteration process goes, it takes longer and longer time for MongoDB to get the desired documents. In order to test this theory, I took the steps below:

Ran the query with a limit of 25 documents. The response was swift.
Modified the query to limit to 25 documents while skipping the first 100. This was slightly slower, but still quick.
Increased the skip to 10,000 documents. Here, the query performance drastically deteriorated, failing to return results within 10 seconds.

There are about 780k documents that matched my query. If skipping 10000 documents takes more than 10 seconds, then getting the complete result will take a considerable amount of time.

In order to troubleshoot this, I also tried to drop and re-create the index for ticker and post_publish_time. But that doesn’t work either. A workaround I discovered involved omitting the sort operation, which markedly improved performance:

{
    "find": "AListedAll",
    "filter": {
        "ticker": "600519"
    }
}

I later sort the result in Python using Python code. There the problem is fixed.

But I still have questions. Since I have created index for both fields that are in my previous query, why would my previous query (that contains sort) be so slow?

I thought the problem is related to use sort and filter at the same time, so I tried to use both fields in filter. My query is

{ ticker: "600519", post_publish_time: {$gt: ISODate('2025-01-01')}}

It still takes a considerable amount of time. Now I am able to summarize the real problem:

MongoDB becomes very slow in query when the collection is very large and multiple fields are used

I wish to understand more about this issue and would appreciate any insights or suggestions for potential workarounds.

Shane · January 17, 2025, 7:14pm

I have created indexes for both post_publish_time and ticker.

Exactly what indexes are created on this collection? You can show the output of collection.index_information().

Tianzhu_Wang · January 18, 2025, 9:48am

Hello Shane, Thanks for reaching out! Here is the result of index_information()

{'_id_': {'v': 2, 'key': [('_id', 1)]},
 'post_id_1': {'v': 2, 'key': [('post_id', 1)]},
 'user_id_1': {'v': 2, 'key': [('user_id', 1)]},
 'comments.reply_id_1': {'v': 2, 'key': [('comments.reply_id', 1)]},
 'comments.reply_publish_time_1': {'v': 2,
  'key': [('comments.reply_publish_time', 1)]},
 'comments.reply_user.user_id_1': {'v': 2,
  'key': [('comments.reply_user.user_id', 1)]},
 'scrape_time_1': {'v': 2, 'key': [('scrape_time', 1)]},
 'post_content_1': {'v': 2, 'key': [('post_content', 1)]},
 'ticker_1': {'v': 2, 'key': [('ticker', 1)]},
 'post_publish_time_1': {'v': 2, 'key': [('post_publish_time', 1)]}}

The index types for both post_publish_time and ticker are 1(asc)

Shane · January 21, 2025, 11:08pm

You need a compound index on {"ticker": 1, "post_publish_time":1}

For example:

coll.create_index([("ticker", ASCENDING), ("post_publish_time", ASCENDING)])
# or 
coll.create_index({"ticker": ASCENDING, "post_publish_time": ASCENDING})

See https://mongodb.prakticum-team.ru/docs/manual/core/indexes/index-types/index-compound/