Is a collection scan of a million documents feasible?

Big_Cat_Public_Safety_Act · December 22, 2022, 10:05pm

I am building an APP. Suppose that it will have

1 million active users.
everytime a user logs in, the backend with hit Atlas with a query that requires a collection scan of the 1 million document collection above.
The app must respond on average in 1 second.

This is assuming worst case scenario without any indexes.

Is this APP feasible?

santimir · December 23, 2022, 10:54am

It seems unfeasible without querying with indexes. By the way, a username is easily indexable, and you can probably run the query while the user is still typing the password.

I run this experiment in a mongo playground, against a free Atlas cluster:

Code

// MongoDB Playground
// Select the database to use.
use('mongodbVSCodePlaygroundDB');

// The drop() command destroys all data from a collection.
// Make sure you run it against the correct database and collection.
db.sales.drop();

const docs = (new Array(10**5)).fill(0)
docs.map((_, index) => {
  return {
      '_id': index,
      'username': 'user' + index,
      'item': 'abc',
    }
 })
// Insert documents into the sales collection.
db.sales.insertMany(docs);
db.sales.find({username:'user1000'}).limit(1).explain('executionStats');

This is just a hundred thousand docs, and I’d indeed expect it to scale at least a factor of ten in your case. Also mind that the query is very simple.

Result



{
//...
    "winningPlan": {
      "stage": "LIMIT",
      "limitAmount": 1,
      "inputStage": {
        "stage": "COLLSCAN",
        "filter": {
          "username": {
            "$eq": "user1000"
          }
        },
        "direction": "forward"
      }
    },
    "rejectedPlans": []
  },
  "executionStats": {
    "executionSuccess": true,
    "nReturned": 0,
    "executionTimeMillis": 988, // <---- i.e about 1 second
    "totalKeysExamined": 0,
    "totalDocsExamined": 100000,
    "executionStages": {
      "stage": "LIMIT",
      "nReturned": 0,
      "executionTimeMillisEstimate": 543,
      "isEOF": 1,
      "limitAmount": 1,
      "inputStage": {
        "stage": "COLLSCAN",
        "filter": {
          "username": {
            "$eq": "user1000"
          }
        },
        "nReturned": 0,
        "isEOF": 1,
        "docsExamined": 100000
      }
    }
  },
//...

So I think the answer is “no” in the most crappy cluster, but remember that upgrading hardware you may increase this several orders of magnitude.

I wonder whether it is possible to use _id field as your username field, I mean { _id: username }. It would make sense and you would still have an index to query.

Kobe_W · February 14, 2023, 6:33am

You will be screwed by disk use without any indexing.