Want to read a Billion records from a collection

Shantanu_Bansal · October 15, 2021, 2:03pm

Requirement:
We have a Billion documents in a collection.
We want to read them efficiently every day.
We also have to perform a update query on each doc.

to reduce oplog we are going with a set approach rather then keeping values an array

    {
  "_id": "5c55221cfa264abc900b31",
  "metadata" : {},
  "spec": {
    "lastSquashTime" : 123,
    "1634121096": {
      "values": {
        "1634121096": 10000001,
        "1634121097": 10000002,
        "1634121216": 10000003,
        "1634121276": 10000004
      },
      "daily": 0,
      "period": 10
    },
    "1634121900": {
      "values": {
        "1634121096": 10000001,
        "1634121156": 10000002,
        "1634121216": 10000003,
        "1634121276": 10000004
      },
      "period": 10
    },
    "1634122900": {
      "values": {
        "1634122900": 10000001,
        "1634121156": 10000002
      },
      "daily": 10000001,
      "period": 60
    },
     "1634132900": {
      "values": {
        "1634132900": 10000001
      },
      "daily": 10000001,
      "period": 1440
    },
    "latest": {
      "timestamp": "2021-10-13T16:30:36.375Z",
      "value": 10000360
    }
  }
}

Pavel_Duchovny · October 17, 2021, 5:12am

Hi @Shantanu_Bansal ,

The amount of documents is not that significant as opposed the data size of the data read and written and with it your host RAM and IO capabilities to sustain this amount of data and compute.

What is the the total size read and written and what is the avg document size?

What is the current bottle neck in the process? Do you see your system cannot perform this task?

If every document in this large collection needs to be updated you should consider bulk updates or rewriting the collection with $out

Thanks
Pavel