Single or Compound index? Number of scanned documents

Hi, assuming I have a collection called files, in which I store all files of all users of my application; each file is identified by its _id property. I have the following document design (dummy example):

{
_id: ObjectId
userId: string
valid: boolean
metadata: Object
}

In my application, every time a user acts on a file identified by its _id property, I want:
(1) Check if the file belongs to that user
(2) Check if the file is valid or not (valid set to false means the file has been deleted, so the action requested cannot be performed)

If both checks are ok, I can perform the action requested by the user (it could be both read or write operation, such as retrieves or update metadata property as specified in the dummy example).
I want always perform both read and write operation with a single query!!.
In order to do that, each filter operation (both for read and write) is:

const userId: string = "..."
const fileId: ObjectId = "..."

const query = {
     _id: fileId,
     userId: userId,
     valid: true
}

// Here I can have "findOne" or "updateOne" operation
// in which I pass the aforementioned query to filter data

NOW THE QUESTION IS:
My goal is to scan as less documents as possible in the files collection when I perform the aforementioned filter operation. Which of these two indices allow me to achieve the required goal?

(1) db.files.createIndex({_id: 1})
(2) db.files.createIndex({_id : 1, userId: 1, valid: 1})

Any other solution to accomplish the required goal is welcome, thanks in advance!

If any of those fields is unique, and it’s always in the query, I would only index that field (as unique). For example, _id, or path.to._id

When the database is queried by non unique fields, then the more specific the index is, probably the faster the query is done.

If it is not unique, and we don’t know whether all of them are present in the query, then the outer index should include the most common field first. In this case

db.files.createIndex({userId: 1, valid: 1})

queries only containing userId will also use the index.

But maybe wait for other opinions…

1 Like

Thanks for your comment!!

I will also await other opinions in order to converge towards the optimal solution :grinning:

1 Like

Hi @Matteo_Tarantino,

The best way to check which index will be chosen and how many index keys or documents will be scanned for a given query is to set up a test comparison and review the allPlansExecution explain results.

An index cannot be more selective than identifying a document by unique _id, so a query with _id will bypass query planning and use the _id index to fetch a single document (as noted by an IDHACK stage in the explain output).

The _id index is both unique and always present, so the second index does not provide any value unless you are aiming to create a covered query. A covered query isn’t going to be a useful optimisation if you’re doing a findOne or updateOne that needs to fetch the matching document.

The secondary index suggested by @santimir could be helpful if your application has a common query to look up valid files by userId where _id isn’t known yet.

Regards,
Stennie

2 Likes

I want to add my 2 cents.

Since the use case you presented implies queries valid:true I would consider having the partial index userId:1 with value:true as the partialFilterExpression.

The index size would be smaller because only valid files will be included. This means better chance to have the working set in memory. I also suspect less comparisons since there is less entries in the index (specially if you keep valid:false files for a long retention period) and the server does not care anymore about valid:true or valid:false once the index is chosen.

The server will have to do more work when the valid state changes because the index will have to be updated.

2 Likes

Thanks to all guys, your comments have been extremely explanatory and helpful!

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.