MongoDB with Spark connector and Java read issue

Hi

I have a Spark connector that reads from my MongoDB Database with version information:

"clientMetadata": {
        "driver": {
            "name": "mongo-java-driver|legacy|mongo-spark",
            "version": "3.12.3|2.4.1"
        },
        "os": {
            "type": "Linux",
            "name": "Linux",
            "architecture": "amd64",
            "version": "3.10.0-1160.99.1.el7.x86_64"
        },
        "platform": "Java/Red Hat, Inc./1.8.0_382-b05|Scala/2.11.12:Spark/2.4.8.7.1.9.0-387"
    },

and for some reason that I cannot find the root cause of, every read query creates a filter:

{
                "$match": {
                    "_id": {
                        "$lt": "747877945yrhduwedu"
                    }
                }
            }

which i do not specify in the aggregation pipeline at all - this then causes the query to scan the entire collection and creates slow queries - if I test an aggregation pipeline removing this $match the query is lightning fast.

Any assistance would be greatly appreciated
Kindest Regards
Gareth Furnell

can you share the complete pipeline that you are using to get data through the spark

Spark settings:
 
{   'pipeline': [   {   '$match': {   'date': 202311,
                                      'day': 30,
                                      'hour': 11,
                                      'array.field': 'data'}},
                    {   '$project': {   'array.field': 0,
                                        'array.field.headers': 0,
                                        'array.field': 0}}],
    'spark.mongodb.input.batchSize': '1000',
    'spark.mongodb.input.localThreshold': '15',
    'spark.mongodb.input.readPreference.name': 'secondary',
    'spark.mongodb.input.registerSQLHelperFunctions': False,
    'spark.mongodb.input.sampleSize': '1000',

This is in the spark settings but when the query goes through and i check it when it becomes a slow query it includes the:

 "command": {
        "aggregate": "collection",
        "pipeline": [
            {
                "$match": {
                    "_id": {
                        "$lt": "23487fhisjdkcn"
                    }
                }
            },
            {
                "$match": {
                    "date": 202311,
                    "day": 30,
                    "hour": 11,
                    "array.field": "field"
                }
            },
            {
                "$project": {
                    "array.field": 0,
                    "array.field.headers": 0,
                    "array.field": 0
                }
            }
        ],
        "cursor": {
            "batchSize": 1000
        },