I figured out a fix. But there is still something I don’t understand.
The problem is the query is taking a long time. This query I was working on was like:
{
"find": "AListedAll",
"filter": {
"ticker": "600519"
},
"sort": {
"post_publish_time": 1
}
}
This query identifies documents with a specific ticker and sorts the results by post_publish_time. When I was reading data using pymongo and loading them as dataframe, the cursor is calling the next() method to iterate over the entire collection. My guess is, as the iteration process goes, it takes longer and longer time for MongoDB to get the desired documents. In order to test this theory, I took the steps below:
- Ran the query with a limit of 25 documents. The response was swift.
- Modified the query to limit to 25 documents while skipping the first 100. This was slightly slower, but still quick.
- Increased the skip to 10,000 documents. Here, the query performance drastically deteriorated, failing to return results within 10 seconds.
There are about 780k documents that matched my query. If skipping 10000 documents takes more than 10 seconds, then getting the complete result will take a considerable amount of time.
In order to troubleshoot this, I also tried to drop and re-create the index for ticker and post_publish_time. But that doesn’t work either. A workaround I discovered involved omitting the sort operation, which markedly improved performance:
{
"find": "AListedAll",
"filter": {
"ticker": "600519"
}
}
I later sort the result in Python using Python code. There the problem is fixed.
But I still have questions. Since I have created index for both fields that are in my previous query, why would my previous query (that contains sort) be so slow?
I thought the problem is related to use sort and filter at the same time, so I tried to use both fields in filter. My query is
{ ticker: "600519", post_publish_time: {$gt: ISODate('2025-01-01')}}
It still takes a considerable amount of time. Now I am able to summarize the real problem:
MongoDB becomes very slow in query when the collection is very large and multiple fields are used
I wish to understand more about this issue and would appreciate any insights or suggestions for potential workarounds.