$lookup (aggregation) and $project based on performance

Petar_Dragnev · October 21, 2022, 8:55am

Hey all i was doing some performance tests on large aggregation mongo scripts and I noticed when I move the project line from

         "$lookup":{
        "from": "metrics",
        "as":   "metrics",
        "let": {"userId": "$_id"},
        "pipeline": {
        {"$match":{"$expr": {"$in": {"$userId","$ownerIds"}}}},
  ----> {"$project": {"fieldName": 1, "ownerIds": 1, "auth0Cache": 1}},
},
},

to

      "$lookup":{
        "from": "metrics",
        "as":   "metrics",
        "let": {"userId": "$_id"},
        "pipeline": {
---->   {"$project": {"fieldName": 1, "ownerIds": 1, "auth0Cache": 1}},
        {"$match":{"$expr": {"$in": {"$userId","$ownerIds"}}}},
},
},

It made performance jump. It ran 2-3times faster. It kinda makes sense that when you limit the data since this collection is a big one it should run faster, but I didn’t find any documentation about it. Does anyone know why this is an important change and can explain to me why exactly is happening and provide some more information about it? Thanks in advance.

steevej · October 21, 2022, 1:18pm

Yes , limiting the quantity of data usually speed up things.

But, in your case, the $project is the same, so the original amount of data (read from disk/cache) and the output amount of data is the same. It should not have a big influence.

What could make a big difference are:

How did you timed your operations? Are your numbers coming from the explain plan or wall clock including the total round trip of sending the query and processing the result? So it might be network delays if client and server are running on different machines. Resource contention if both client and server are running on the same machine.
Did you run your tests multiple times? If not, it could be that the documents and indexes were not in memory for the first and slow test and were already in cache for the second and fast test.

In principal, $match-ing unmodified (before $project) documents directly from a collection is faster because indexes can be used. You do have an index on ownerIds, don’t you? So, in principal, your 2nd faster test should be slower since you $project before you $match. I italicized should because I think the query optimizer detects that your $match uses fields from the original documents and performs the same.

Since your $match is using $in for $userId and $userId is a single value you could use the following syntax:

Petar_Dragnev · October 24, 2022, 8:39am

Hi again,
On point 1 : I timed it both with Golang performance tests and also how much time took the mongodb to respond on the script. We are seeing this results both on local setups where client and server and running on the same machine and also on deployed ones.
On point 2: Ive run it a lot of times and timings are the same.
Here is the explain:

{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					"$and" : [
						{
							"$and" : [
								{
									"accountId" : {
										"$eq" : ObjectId("634d6e4e8b661fe80d614513")
									}
								},
								{
									"$nor" : [
										{
											"isDeleted" : {
												"$eq" : true
											}
										}
									]
								}
							]
						},
						{
							"$nor" : [
								{
									"isSystem" : {
										"$eq" : true
									}
								}
							]
						}
					]
				},
				"fields" : {
					"accountId" : 1,
					"auth0Cache.usermetadata.firstName" : 1,
					"auth0Cache.usermetadata.lastName" : 1,
					"dynamicMetricCount" : 1,
					"manualMetricCount" : 1,
					"metrics" : 1,
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "test_employees_v2.users",
					"indexFilterSet" : false,
					"parsedQuery" : {
						"$and" : [
							{
								"accountId" : {
									"$eq" : ObjectId("634d6e4e8b661fe80d614513")
								}
							},
							{
								"$nor" : [
									{
										"isDeleted" : {
											"$eq" : true
										}
									}
								]
							},
							{
								"$nor" : [
									{
										"isSystem" : {
											"$eq" : true
										}
									}
								]
							}
						]
					},
					"winningPlan" : {
						"stage" : "FETCH",
						"filter" : {
							"$and" : [
								{
									"$nor" : [
										{
											"isDeleted" : {
												"$eq" : true
											}
										}
									]
								},
								{
									"$nor" : [
										{
											"isSystem" : {
												"$eq" : true
											}
										}
									]
								}
							]
						},
						"inputStage" : {
							"stage" : "IXSCAN",
							"keyPattern" : {
								"accountId" : 1,
								"subscriptionType" : 1
							},
							"indexName" : "accountId_1_subscriptionType_1",
							"isMultiKey" : false,
							"multiKeyPaths" : {
								"accountId" : [ ],
								"subscriptionType" : [ ]
							},
							"isUnique" : false,
							"isSparse" : false,
							"isPartial" : false,
							"indexVersion" : 2,
							"direction" : "forward",
							"indexBounds" : {
								"accountId" : [ "[ObjectId('634d6e4e8b661fe80d614513'), ObjectId('634d6e4e8b661fe80d614513')]" ],
								"subscriptionType" : [ "[MinKey, MaxKey]" ]
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$lookup" : {
				"from" : "metrics",
				"as" : "metrics",
				"let" : {
					"userId" : "$_id"
				},
				"pipeline" : [
					{
						"$project" : {
							"fieldName" : 1,
							"accountId" : 1,
							"isDeleted" : 1,
							"ownerIds" : 1,
							"auth0Cache" : 1,
							"goalId" : 1
						}
					},
					{
						"$match" : {
							"$and" : [
								{
									"accountId" : ObjectId("634d6e4e8b661fe80d614513")
								},
								{
									"isSystem" : {
										"$ne" : true
									}
								}
							]
						}
					},
					{
						"$match" : {
							"$expr" : {
								"$in" : [ "$$userId", "$ownerIds" ]
							}
						}
					}
				]
			}
		},
		{
			"$addFields" : {
				"manualMetricCount" : {
					"$size" : [
						{
							"$filter" : {
								"input" : "$metrics",
								"as" : "metrics",
								"cond" : {
									"$eq" : [
										"$$metrics.fieldName",
										{
											"$const" : ""
										}
									]
								}
							}
						}
					]
				},
				"dynamicMetricCount" : {
					"$size" : [
						{
							"$filter" : {
								"input" : "$metrics",
								"as" : "metrics",
								"cond" : {
									"$ne" : [
										"$$metrics.fieldName",
										{
											"$const" : ""
										}
									]
								}
							}
						}
					]
				}
			}
		},
		{
			"$match" : {
				"dynamicMetricCount" : {
					"$gt" : 1
				}
			}
		},
		{
			"$facet" : {
				"items" : [
					{
						"$sort" : {
							"sortKey" : {
								"auth0Cache.usermetadata.firstName" : 1,
								"_id" : 1
							},
							"limit" : 100
						}
					},
					{
						"$project" : {
							"_id" : true,
							"dynamicMetricCount" : true,
							"accountId" : true,
							"manualMetricCount" : true,
							"auth0Cache" : {
								"usermetadata" : {
									"lastName" : true
								}
							}
						}
					}
				],
				"totalCount" : [
					{
						"$group" : {
							"_id" : {
								"$const" : null
							},
							"totalCount" : {
								"$sum" : {
									"$const" : 1
								}
							}
						}
					},
					{
						"$project" : {
							"_id" : false,
							"totalCount" : true
						}
					}
				]
			}
		},
		{
			"$project" : {
				"_id" : true,
				"items" : true,
				"totalCount" : {
					"$arrayElemAt" : [
						"$totalCount",
						{
							"$const" : 0
						}
					]
				}
			}
		},
		{
			"$project" : {
				"_id" : true,
				"items" : true,
				"totalCount" : "$totalCount.totalCount"
			}
		}
	],
	"ok" : 1,
	"operationTime" : Timestamp(1666600486, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1666600486, 1),
		"signature" : {
			"hash" : BinData(0,"RTt0l2k9ECaNmPQz+Tt9eQHxIEA="),
			"keyId" : NumberLong("7132835261448192001")
		}
	}
}

For the lookup it seems to be no data for the explain, maybe i should provide some parameter to the explain. Here the project is the first thing in the lookup pipeline. Before adding it there there are no project and this script ran 2-3 times slower than now.

steevej · October 31, 2022, 1:54am

I see you do some $project before $match. It is always better to $match before $project. The query optimizer might not be able to use the indexes.

Can you test for isDeleted:false and isSystem:false rather than $nor:isDeleted:$eq:true?

You did not share you whole pipeline so it is hard to make sense of the explain plan.

system · December 30, 2022, 1:55am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.