Scale MongoDB to support collection with 150 million documents per day

Scale MongoDB to support collection with 150 million documents per day

We could use some advice on how to configure mongodb sharding for a collection that needs to be able to store around 150 millions documents per day. Our near term goal is to retain 1 years worth of data (~54 billion documents) and long term store 5 years worth of data. We have deployed a 3 node mongodb (v4.) cluster using kubernetes as a test bed on AWS.

Use Case

Our primary use case is for two collections one which has tracks and the other that stores position updates for tracks.

The tracks collection will contain around 300K to 1M document and each document is around 1.2 KB. Track roughly look documents look like this:

  _id: "Int64"
  name: "string"
    prop1: "string"
    prop2: "string"
    propN: "string"
  lastUpdated: "DateTime"
  latestPositionTime: "DateTime"

The positions collections will be loaded with around 150 millions positions updates for tracks EACH DAY. This is the collection we are trying to configure to be sharded. Position document are roughly 1KB each and look something this:

  "_id": number
  trackId: number
  originalTrackId: number
  timeOfPosition: DateTime
  # We tried create a range base shared based on time
  timestampRangeSharedKey = minuteOfPosition + (hourOfPosition * 60) + ((dayOfYearOfPosition -1) * 24 * 60 )
  sensor: string
        coordinates: [ "longitude: number", "latitude: number"]
        speed: number
        heading: number
    positScore: number
      columnName: positScore
      type: number
      prop1: "string"
      prop2: "string"
      prop15: "string"

We have setup index on _id, timeOfPosition, sensor, 2dsphere(positGeom.geometry")


There are four main type of queries that we need to support (NOTE time is always included the position collection criteria):

  1. Find tracks that match some criteria then find all the positions for a given timespan ( hopefully used graph capabilities of mongo)
  2. Find all the positions within a given polygon for a given timespan. Timespan are typical between hours to update to 4 to 7 days.
  3. Find tracks that match some criteria then find LATEST position for that track within a given timespan.
  4. Find all the tracks within a given polygon for a given timespan and return the latest position for each track.

What we have tried

  • Setup shardKey using the _id.
    • Result: Query performance was pretty slow due to having to hit every chuck. Also our _id are NOT ideal shard keys. Plus this is not recommended approach.
  • Tried to creating a property name timestampRangeSharedKeyand calculated how many minutes of data would fit into chunks at less than 1GB. We created a script to the zones within mongo (roughly 30K zones).
    • Result: Inconclusive, took about a day to create all the zones (on an empty collection). Cluster fail during test so was never able to finish test. Might try again, but it NOT feel right. Hopefully the mongodb community has a better approach.
  • We did NOT try time series collections since due it it’s limitations.
    • Does not support 2dsphere indexes, which is key for our queries.
    • NOTE: This does look a good match for our use case


Below are the key questions:

  1. Can mongodb scale to support our use case?
  2. Recommendable on how to configure sharding for the positions collection?
    • Should we use a time series collection?
      • If yes, what do you recommend for timeseries.metaField (trackId?) and timeseries.granularity.
      • Is there a why to support geo spatial indexes and queries?
  3. Any other recommendation for how to configure mongodb?
    • How many shards should we configure?

Thanks in advance for any help!!

Hi @Bryan_Golding and thank you for reaching out to the MongoDB Community Forum.

Firstly, to answer your questions point wise:

Among the different key features of MongoDB Atlas available, scalability is one feature that makes the MongoDB Atlas more efficient.
Since you’re already deploying your workload on AWS, you might want to consider using Atlas to simplify your ops and allow you to scale quickly and easily if you find that you need more (or less) hardware to handle your workload.

The documentation on Atlas Tiers and Clusters would be a good reference point to analyse the cluster and tier need for your use case.

The following blog post on Best Practices to Shard your Cluster would be a good document to start with to know about what components make the shard cluster performance better.

Hence, for the positions collection, you can select the field as the shard key which distributes the data evenly between the shards.

Since for your collection, you need to use the geospatial data in the document, I would not recommend using the time series collection.

Depending on the cluster you choose, you can refer to the Cluster Limits documentation to understand the number of shards allowed on each of the clusters present.

As I understand it, you have 4 queries that you need to run all the time, and most of them involve geo queries. Is this correct? At this early design phase, it’s difficult to say what the problems would be, so I would encourage you to experiment with the schema design and the queries, be familiar with the query explain output to determine the efficiency of the queries, and use a simulated workload (inserts + queries) using a semi-random data. These can be created using mgeneratejs, and these random dataset in combination with explain('executionStats') should be able to tell you how efficient your queries are.

Lastly. to plan a deployment of this magnitude, I’d suggest seeking Professional Advice. There are numerous considerations, and an experienced consultant can provide better guidance based on a more comprehensive knowledge of your needs. Some scalability decisions (such as shard key selection) become more difficult to course correct after a large quantity of production data has been collected.

Let us know if you have any other concerns.

Best Regards