Searching over a huge geojson collection

Matan_Shaked · November 3, 2022, 8:30am

Hello

My team has a collection of ~45 million documents, each containing a timestamp and a geojson field.
The main use case is querying across both the timestamp and the geojson.

The team had a few solutions:

Using a compound index for the two fields - performance wasn’t on par with product expectations - on some cases search time was over 10 seconds.
Using the bucket pattern (bucket per time range) for reducing search time, but because the query is across two fields we couldn’t properly use the geojson index.
Bucketing by collection, i.e. a collection for every time range. This enables the use of the geojson index while still using buckets. I feel like that’s not a standard solution and requires more implementation at the application layer. This solution however, has the best performance - sub 1 second search time

I would like to hear from you your opinions about the solutions and whether there are some modeling patterns that I am missing

Pavel_Duchovny · November 3, 2022, 9:22am

Hi @Matan_Shaked ,

Can you share some document samples, indexes and queries so we can get a better feel of what you are describing.

If you found a performant solution while still with manageable coding overhead than that is perfectly fine and shows how mongoDB allows you to do whats best for you.

Thanks
Pavel

Matan_Shaked · November 3, 2022, 10:19am

Here is a minimal representation of our documents:

{
    "_id" : ObjectId("63639134eff35d0dcb9b0aae"),
    "geoJson" : {
        "type" : "Feature",
        "properties" : {
            "timestamp" : 1667469522345.0,
            "sensorId" : "12345"
        },
        "geometry" : {
            "type" : "Point",
            "coordinates" : [ 
                9.1594001938029, 
                45.4863023702389
            ]
        }
    }
}

This is one of the indexes that we tried:

{
    "geoJson.properties.sensorId" : 1.0,
    "geoJson.properties.timestamp" : 1.0,
    "geoJson.geometry" : "2dsphere"
}

And here is an example query:

db.getCollection('telemetries').aggregate([
    {
        $match: {
            $and: [
                {
                    "geoJson.geometry": {
                        $geoWithin: {
                            $center: [
                                [
                                    9.159400193802895,
                                    45.48630237023892
                                ],
                                0.3
                            ]
                        }
                    }
                }, {
                    $and: [
                        { "geoJson.properties.timestamp": { $gt: 1647469522345 } },
                        { "geoJson.properties.timestamp": { $lt: 1687469522345 } }
                    ]
                }
            ]
        }
    }, {
        $group: {
            _id: "$geoJson.properties.sensorId"
        }
    }
]);

Pavel_Duchovny · November 3, 2022, 10:35am

Hi @Matan_Shaked ,

It looks as an index order of fields might work better if you index:

{
       "geoJson.geometry" : "2dsphere",
    "geoJson.properties.sensorId" : 1.0,
    "geoJson.properties.timestamp" : 1.0
}

Consider that since it follows better the ESR rule:

Thanks
Pavel