MongoDB with Spark connector and Java read issue

Gareth_Furnell · November 29, 2023, 1:55pm

Hi

I have a Spark connector that reads from my MongoDB Database with version information:

"clientMetadata": {
        "driver": {
            "name": "mongo-java-driver|legacy|mongo-spark",
            "version": "3.12.3|2.4.1"
        },
        "os": {
            "type": "Linux",
            "name": "Linux",
            "architecture": "amd64",
            "version": "3.10.0-1160.99.1.el7.x86_64"
        },
        "platform": "Java/Red Hat, Inc./1.8.0_382-b05|Scala/2.11.12:Spark/2.4.8.7.1.9.0-387"
    },

and for some reason that I cannot find the root cause of, every read query creates a filter:

{
                "$match": {
                    "_id": {
                        "$lt": "747877945yrhduwedu"
                    }
                }
            }

which i do not specify in the aggregation pipeline at all - this then causes the query to scan the entire collection and creates slow queries - if I test an aggregation pipeline removing this $match the query is lightning fast.

Any assistance would be greatly appreciated
Kindest Regards
Gareth Furnell

dha24 · November 30, 2023, 8:21am

can you share the complete pipeline that you are using to get data through the spark

Gareth_Furnell · November 30, 2023, 9:08am

Spark settings:
 
{   'pipeline': [   {   '$match': {   'date': 202311,
                                      'day': 30,
                                      'hour': 11,
                                      'array.field': 'data'}},
                    {   '$project': {   'array.field': 0,
                                        'array.field.headers': 0,
                                        'array.field': 0}}],
    'spark.mongodb.input.batchSize': '1000',
    'spark.mongodb.input.localThreshold': '15',
    'spark.mongodb.input.readPreference.name': 'secondary',
    'spark.mongodb.input.registerSQLHelperFunctions': False,
    'spark.mongodb.input.sampleSize': '1000',

This is in the spark settings but when the query goes through and i check it when it becomes a slow query it includes the:

 "command": {
        "aggregate": "collection",
        "pipeline": [
            {
                "$match": {
                    "_id": {
                        "$lt": "23487fhisjdkcn"
                    }
                }
            },
            {
                "$match": {
                    "date": 202311,
                    "day": 30,
                    "hour": 11,
                    "array.field": "field"
                }
            },
            {
                "$project": {
                    "array.field": 0,
                    "array.field.headers": 0,
                    "array.field": 0
                }
            }
        ],
        "cursor": {
            "batchSize": 1000
        },