Scale MongoDB to support collection with 150 million documents per day
We could use some advice on how to configure mongodb sharding for a collection that needs to be able to store around 150 millions documents per day. Our near term goal is to retain 1 years worth of data (~54 billion documents) and long term store 5 years worth of data. We have deployed a 3 node mongodb (v4.) cluster using kubernetes as a test bed on AWS.
Use Case
Our primary use case is for two collections one which has tracks and the other that stores position updates for tracks.
The tracks collection will contain around 300K to 1M document and each document is around 1.2 KB. Track roughly look documents look like this:
{
_id: "Int64"
name: "string"
details:
prop1: "string"
prop2: "string"
propN: "string"
lastUpdated: "DateTime"
latestPositionTime: "DateTime"
}
The positions collections will be loaded with around 150 millions positions updates for tracks EACH DAY. This is the collection we are trying to configure to be sharded. Position document are roughly 1KB each and look something this:
{
"_id": number
trackId: number
originalTrackId: number
timeOfPosition: DateTime
# We tried create a range base shared based on time
timestampRangeSharedKey = minuteOfPosition + (hourOfPosition * 60) + ((dayOfYearOfPosition -1) * 24 * 60 )
sensor: string
positGeom:
geometry:
coordinates: [ "longitude: number", "latitude: number"]
properties:
speed: number
heading: number
positScore: number
columnName: positScore
type: number
details:
prop1: "string"
prop2: "string"
prop15: "string"
}
We have setup index on _id, timeOfPosition, sensor, 2dsphere(positGeom.geometry")
Queries
There are four main type of queries that we need to support (NOTE time is always included the position collection criteria):
- Find tracks that match some criteria then find all the positions for a given timespan ( hopefully used graph capabilities of mongo)
- Find all the positions within a given polygon for a given timespan. Timespan are typical between hours to update to 4 to 7 days.
- Find tracks that match some criteria then find LATEST position for that track within a given timespan.
- Find all the tracks within a given polygon for a given timespan and return the latest position for each track.
What we have tried
- Setup shardKey using the _id.
- Result: Query performance was pretty slow due to having to hit every chuck. Also our _id are NOT ideal shard keys. Plus this is not recommended approach.
- Tried to creating a property name
timestampRangeSharedKey
and calculated how many minutes of data would fit into chunks at less than 1GB. We created a script to the zones within mongo (roughly 30K zones).- Result: Inconclusive, took about a day to create all the zones (on an empty collection). Cluster fail during test so was never able to finish test. Might try again, but it NOT feel right. Hopefully the mongodb community has a better approach.
- We did NOT try time series collections since due it it’s limitations.
- Does not support 2dsphere indexes, which is key for our queries.
- NOTE: This does look a good match for our use case
Questions
Below are the key questions:
- Can mongodb scale to support our use case?
- Recommendable on how to configure sharding for the positions collection?
- Should we use a time series collection?
- If yes, what do you recommend for
timeseries.metaField
(trackId?) andtimeseries.granularity
. - Is there a why to support geo spatial indexes and queries?
- If yes, what do you recommend for
- Should we use a time series collection?
- Any other recommendation for how to configure mongodb?
- How many shards should we configure?
Thanks in advance for any help!!