Improving my indexes

These are my indexes today :

db.files_v2.createIndex({“container”: 1, “_id”: 1});
db.files_v2.createIndex({“container”: 1, “blockId”: 1});
db.files_v2.createIndex({“container”: 1, “path”: 1, “_id”: 1});
db.files_v2.createIndex({“container”: 1, “path”: 1, “fileNameLower”: 1}, {unique: true});

A little explanation about the fields :slight_smile::

  • This is a files collection
  • Container - an identifier for a user (there is not much variety)
  • _id - ObjectId → unique
  • blockId → its UUID every internal use → must be indexed
  • path → path of a single file
  • fileNameLower → unique string under the same path for same user
  • size → size of each file (Long)

I think I have some duplications in there between the first and the third one but one of my queries is

db.files_v2.find({“container”:"…"}).sort({"_id":1})

so for that query, I must keep the first index (I think?)

Secondly, I have a new query :

db.getCollection(‘files_v2’).aggregate(
   [
      { $match : {
         container : “bbbf087e-6d2d-4941-b812-f035407becba:storage”,
         path:{"ne":""}
      }},
      { $group: {
         _id:"",
         sum :{ $sum: “$size” },
         count : { $sum : 1 }
      }}
   ]
)

if this collection scales up the size value is not indexed and you can go over 1mil documents in the disk.
should i create new index?

db.files_v2.createIndex({“container”: 1, “path”: 1, “size”: 1});

or should i add it to that index

db.files_v2.createIndex({“container”: 1, “path”: 1, “_id”: 1,“size”:1});

should i even index size field?

Thanks!!!

Hi @guy_more,

Question 1:

No, there is no duplication here. If you have a compound index {a:1, b:1, c:1}, it means you also have access to the indexes “for free”: {a:1} and {a:1, b:1} or even {a:-1} and {a:-1, b:-1} I think. But definitely not access to {b:1} or {a:1, c:1}.

If you had {“container”:1, “_id”:1, “path”:1}, it would then be redundant to have {“container”:1, “_id”:1} that you have at the top of your list.

The order of the field really depends on the query you are running. With an index, you are trying to avoid the full collection scan. Which is really the worst of all evil here. But then you can potentially avoid in memory sorts and also eventually on disk fetch because you made a “covered query” - meaning you don’t even need the on disk document - everything you need to answer the query is already in the index.

To avoid collection scans, put in your index some fields of your find query. All if you want to avoid useless index entry scans.
To avoid in memory sort, you must reach the sort part of the index before any range query like $in or $gt for example. The rule of thumbs here is EQUALITY => SORT => RANGE. If they are in this order in your compound index, then you should not see a sort step in your explain plan - but potentially some extra index entries scans - that’s usually an acceptable trade of as in memory sort are usually more costly in ressources.
To have covered queries, you need all the fields you need to resolve the query in your index… Which brings me to…

Question 2:
An index on {“container”:1, “path”:1, “size”:1} would make this a covered queries. The aggregation pipeline optimises the query automatically in the background and projects the documents to remove unnecessary fields from the pipeline.
You are not using the “_id” field in this query so {“container”: 1, “path”: 1, “_id”: 1,“size”:1} would also be used but only the first part: {“container”:1, “path”:1}. The rest of the index would be useless with this query and just take more space in RAM.

To sum up:
Try to syndicate your indexes if you can to avoid redundancy. If you can’t afford 3 indexes in RAM like ABC, ABD and ABE because they are too big, maybe just create one of them and the 2 other queries will be able to use at least the AB part of that index which will at least avoid the collection scan. Maybe that would be already acceptable for your target response time. If not then you will need more RAM to create the 3 indexes.
Also keep in mind that indexes need to be updated each time you touch that collection. Inserts, deletes, updates, etc. All these operations will be a tiny bit slower each time you add a new index.
When in doubt between 2 indexes. In 99.999% of the cases, the MongoDB query optimizer will choose the right one for you - given that you are running this on a realistic data set which represents your prod data. So in doubt, create both if you can afford it (RAM, time, etc) and run your query with an explain. Keep the one that is used by MongoDB.

I hope this helps :smiley: !
Cheers,
Maxime.

Thanks for your quick response !! it was really helpful.
regarding what you said about
the new index {“container”:1,“path”:1,"_id":1,“size”:1}

Today without the new index the query execution time for a large collection is 6 sec.

this is the queryPlanner:

/* 1 */
{
“serverInfo” : {
“host” : “TLVR-00018-D”,
“port” : 27017,
“version” : “3.6.18”,
“gitVersion” : “2005f25eed7ed88fa698d9b800fe536bb0410ba4”
},
“stages” : [
{
“$cursor” : {
“query” : {
“container” : “bbbf087e-6d2d-4941-b812-f035407becbe:storage”,
“path” : {
“$ne” : “.”
}
},
“fields” : {
“size” : 1,
“_id” : 0
},
“queryPlanner” : {
“plannerVersion” : 1,
“namespace” : “storagedb.files_v2”,
“indexFilterSet” : false,
“parsedQuery” : {
“$and” : [
{
“container” : {
“$eq” : “bbbf087e-6d2d-4941-b812-f035407becbe:storage”
}
},
{
“$nor” : [
{
“path” : {
“$eq” : “.”
}
}
]
}
]
},
“winningPlan” : {
“stage” : “FETCH”,
“inputStage” : {
“stage” : “IXSCAN”,
“keyPattern” : {
“container” : 1.0,
“path” : 1.0,
“_id” : 1.0
},
“indexName” : “container_1_path_1__id_1”,
“isMultiKey” : false,
“multiKeyPaths” : {
“container” : ,
“path” : ,
“_id” : 
},
“isUnique” : false,
“isSparse” : false,
“isPartial” : false,
“indexVersion” : 2,
“direction” : “forward”,
“indexBounds” : {
“container” : [
“[“bbbf087e-6d2d-4941-b812-f035407becbe:storage”, “bbbf087e-6d2d-4941-b812-f035407becbe:storage”]”
],
“path” : [
“[MinKey, “.”)”,
“(”.", MaxKey]"
],
“_id” : [
“[MinKey, MaxKey]”
]
}
}
},
“rejectedPlans” : [
{
“stage” : “FETCH”,
“filter” : {
“$nor” : [
{
“path” : {
“$eq” : “.”
}
}
]
},
“inputStage” : {
“stage” : “IXSCAN”,
“keyPattern” : {
“container” : 1.0,
“_id” : 1.0
},
“indexName” : “container_1__id_1”,
“isMultiKey” : false,
“multiKeyPaths” : {
“container” : ,
“_id” : 
},
“isUnique” : false,
“isSparse” : false,
“isPartial” : false,
“indexVersion” : 2,
“direction” : “forward”,
“indexBounds” : {
“container” : [
“[“bbbf087e-6d2d-4941-b812-f035407becbe:storage”, “bbbf087e-6d2d-4941-b812-f035407becbe:storage”]”
],
“_id” : [
“[MinKey, MaxKey]”
]
}
}
},
{
“stage” : “FETCH”,
“inputStage” : {
“stage” : “IXSCAN”,
“keyPattern” : {
“container” : 1,
“path” : 1,
“fileNameLower” : 1
},
“indexName” : “container_1_path_1_fileNameLower_1”,
“isMultiKey” : false,
“multiKeyPaths” : {
“container” : ,
“path” : ,
“fileNameLower” : 
},
“isUnique” : true,
“isSparse” : false,
“isPartial” : false,
“indexVersion” : 2,
“direction” : “forward”,
“indexBounds” : {
“container” : [
“[“bbbf087e-6d2d-4941-b812-f035407becbe:storage”, “bbbf087e-6d2d-4941-b812-f035407becbe:storage”]”
],
“path” : [
“[MinKey, “.”)”,
“(”.", MaxKey]"
],
“fileNameLower” : [
“[MinKey, MaxKey]”
]
}
}
}
]
}
}
},
{
“$group” : {
“_id” : {
“$const” : “”
},
“sum” : {
“$sum” : “$size”
},
“count” : {
“$sum” : {
“$const” : 1.0
}
}
}
}
],
“ok” : 1.0
}

How ever with the new index the execution time takes 2 sec and this is the queryPlanner

/* 1 */
{
“serverInfo” : {
“host” : “TLVR-00018-D”,
“port” : 27017,
“version” : “3.6.18”,
“gitVersion” : “2005f25eed7ed88fa698d9b800fe536bb0410ba4”
},
“stages” : [
{
“$cursor” : {
“query” : {
“container” : “bbbf087e-6d2d-4941-b812-f035407becbe:storage”,
“path” : {
“$ne” : “.”
}
},
“fields” : {
“size” : 1,
“_id” : 0
},
“queryPlanner” : {
“plannerVersion” : 1,
“namespace” : “storagedb.files_v2”,
“indexFilterSet” : false,
“parsedQuery” : {
“$and” : [
{
“container” : {
“$eq” : “bbbf087e-6d2d-4941-b812-f035407becbe:storage”
}
},
{
“$nor” : [
{
“path” : {
“$eq” : “.”
}
}
]
}
]
},
“winningPlan” : {
“stage” : “PROJECTION”,
“transformBy” : {
“size” : 1,
“_id” : 0
},
“inputStage” : {
“stage” : “IXSCAN”,
“keyPattern” : {
“container” : 1.0,
“path” : 1.0,
“_id” : 1.0,
“size” : 1.0
},
“indexName” : “container_1_path_1__id_1_size_1”,
“isMultiKey” : false,
“multiKeyPaths” : {
“container” : ,
“path” : ,
“_id” : ,
“size” : 
},
“isUnique” : false,
“isSparse” : false,
“isPartial” : false,
“indexVersion” : 2,
“direction” : “forward”,
“indexBounds” : {
“container” : [
“[“bbbf087e-6d2d-4941-b812-f035407becbe:storage”, “bbbf087e-6d2d-4941-b812-f035407becbe:storage”]”
],
“path” : [
“[MinKey, “.”)”,
“(”.", MaxKey]"
],
“_id” : [
“[MinKey, MaxKey]”
],
“size” : [
“[MinKey, MaxKey]”
]
}
}
},
“rejectedPlans” : 
}
}
},
{
“$group” : {
“_id” : {
“$const” : “”
},
“sum” : {
“$sum” : “$size”
},
“count” : {
“$sum” : {
“$const” : 1.0
}
}
}
}
],
“ok” : 1.0
}

Are you sure he doesn’t use the “_id” field inside the index because from run time preceptive its looks like it using the index? I’m using Mongo 3.6

Your pipeline is filtering out the _id field. You are not outputting the _id nor using its value in this pipeline so yes, I think it’s useless to have it in this index if it’s not used by another query.
Please create this index without the _id and make another explain to see which one is selected. If I’m reading this correctly, this index is still covering this pipeline as I don’t see a FETCH stage.

Also, I guess your “container” value will change from one pipeline to the next. But I assume you are only interested in documents where “path” is not an empty string. If this index is only ment to be used by this query, you could consider using a partial index. Sadly you can’t use $ne in a PartialFilterExpression. However, maybe you could run a script like db.coll.updateMany({path:""},{$unset: {path:1}}) to remove the field “path” when it’s an empty string and instead use in your aggregation pipeline and your PartialFilterExpression {"path": {$exists: true}}

If you have many fields where “path” is an empty string, this would make the index smaller and save you some RAM.

2 comments:

Cheers,
Maxime.