Storing ~20 Billion documents

Sason_Braha · October 13, 2020, 7:37pm

Hi,
We need to store in mongo around 20B documents, where each document is tied into a specific id.
Also, each document has an expiration and needs to be deleted after a configurable amount of time.

We came up with this model for storing the documents:
{
_id: uuid,
data: Array
}
The documents are small, and from our testing we can fit around 190,000 documents per _id, but we only expect around 5000.
The document are queried by the _id and a date for the nested documents. We use a simple $unwind aggregation to get the results.

My only fear is the deletion time. Each document has a date, but from what I understand one can’t create TTL index on nested document, so we needed to create a cron for it.

Can you suggest a better solution?

Pavel_Duchovny · October 14, 2020, 4:57am

Hi @Sason_Braha,

I would say that having 5000 nested documents in one array is also very high. It may impact many operations as MongoDB has to serelize and desirialize those arrays in many operations against the documents (cpu/ memory overhead).

Also it impose a high risk on expiration mechanism as constantly pulling and pushing array elements to large arrays is unadvisable…

I would consider keeping each document in a seperate one and indexing a field named uuid. If your HW will not be able to operate with this design consider sharding the environment on hash shard key for this uuid if you only have to query based on the uuid.

This way you can have a TTL index as your createDate will be on the main level of a document.

{
_id : ObjectID,
uuid: ... ,
createdDate: ...,
data: ...,
...
}

Best
Pavel

Sason_Braha · October 14, 2020, 7:09pm

Hi @Pavel_Duchovny, thank you for the response.
Currently sharding between clusters is not an option for us, but from what I understand from your comment, it’ll be better performance wise to insert the documents into single big collection and index the id and date?
I’ll note that the documents are never updated, only $push and $pull are used.

Pavel_Duchovny · October 15, 2020, 8:37am

Hi @Sason_Braha,

Well $pull and $push on existing docs are updates.

If you can’t shard the environment please consider splitting the collections into partitioned ones. Having single documents with up to 5000 array objects which you constantly pull or push might mean trouble…

For example based on a time range, where you can do a daily/weekly collections by integrating the date in their names . This way you will have smaller collections and your application will need to do some mapping to understand what collections to query in real time.

coll_20200101_20200107
coll_20200107_20200114
...

On the other hand you can use a hash value to store based on your UUID’s where you will have a collection based on 2 hashes (lower and upper limit):

coll_xxxxxx_yyyyyy
coll_yyyyy_zzzzz
...

However, your application will still need to store some mapping to which collection you should go when looking for this uuid hash.

If you wish to keep it all in one collection note that you will need to scale your HW or shard as you grow.

Best regards,
Pavel

Sason_Braha · October 15, 2020, 7:45pm

Thank you.
We’ll not use the nested document approach, we decided to follow your recommendation and create partitioned collections by week and create expire index for documents.

system · October 20, 2020, 7:45pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.