Hi
I have a Spark connector that reads from my MongoDB Database with version information:
"clientMetadata": {
"driver": {
"name": "mongo-java-driver|legacy|mongo-spark",
"version": "3.12.3|2.4.1"
},
"os": {
"type": "Linux",
"name": "Linux",
"architecture": "amd64",
"version": "3.10.0-1160.99.1.el7.x86_64"
},
"platform": "Java/Red Hat, Inc./1.8.0_382-b05|Scala/2.11.12:Spark/2.4.8.7.1.9.0-387"
},
and for some reason that I cannot find the root cause of, every read query creates a filter:
{
"$match": {
"_id": {
"$lt": "747877945yrhduwedu"
}
}
}
which i do not specify in the aggregation pipeline at all - this then causes the query to scan the entire collection and creates slow queries - if I test an aggregation pipeline removing this $match the query is lightning fast.
Any assistance would be greatly appreciated
Kindest Regards
Gareth Furnell
dha24
(Dharmendra)
2
can you share the complete pipeline that you are using to get data through the spark
Spark settings:
{ 'pipeline': [ { '$match': { 'date': 202311,
'day': 30,
'hour': 11,
'array.field': 'data'}},
{ '$project': { 'array.field': 0,
'array.field.headers': 0,
'array.field': 0}}],
'spark.mongodb.input.batchSize': '1000',
'spark.mongodb.input.localThreshold': '15',
'spark.mongodb.input.readPreference.name': 'secondary',
'spark.mongodb.input.registerSQLHelperFunctions': False,
'spark.mongodb.input.sampleSize': '1000',
This is in the spark settings but when the query goes through and i check it when it becomes a slow query it includes the:
"command": {
"aggregate": "collection",
"pipeline": [
{
"$match": {
"_id": {
"$lt": "23487fhisjdkcn"
}
}
},
{
"$match": {
"date": 202311,
"day": 30,
"hour": 11,
"array.field": "field"
}
},
{
"$project": {
"array.field": 0,
"array.field.headers": 0,
"array.field": 0
}
}
],
"cursor": {
"batchSize": 1000
},