Newest document by unique set of fields

Adam_Adam · February 2, 2022, 5:07pm

Hello,

I am struggling with an aggregation pipeline which is asking for allowDiskUse for sorting while I think it should not.

Here is the code for Mongo shell:

db.four_monthly.aggregate(
  [
    { $match: { } },
    { $project: { scatter_plot: 0, fit_plot: 0 } }, { $sort: { date_end: -1 } },
    { $group: { _id: { road: "$road", direction: "$direction", pk: "$pk" }, doc: { $first: "$$ROOT" } } }
  ],
  { allowDiskUse: true }
)

So basically what I am looking for in newest document (on date_end) for each unique set of road+direction+dk.
This query does not work without allowDiskUse, which is not what I want.

The following indexes are defined:

[
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"date_end" : -1,
			"road" : 1,
			"direction" : 1,
			"pk" : 1
		},
		"name" : "date_end_-1_road_1_direction_1_pk_1",
		"ns" : "flow-speed-calculator.four_monthly"
	},
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "flow-speed-calculator.four_monthly"
	},
	{
		"v" : 2,
		"key" : {
			"road" : 1,
			"direction" : 1,
			"pk" : 1,
			"date_end" : -1
		},
		"name" : "road_1_direction_1_pk_1_date_end_-1"
	}
]

What I guess is that it sort documents first and then go throught the sorted list to extract one document for each road+direction+pk document, that’s why it requires disk use. However, what I expected MongoDb to do is first request all unique set of road+direction+pk then sort on date_end and return newest for each.

Is that possible ? If so, how do I fix my aggregation ?

Thanks a lot in advance,

Best regards, Adam.

Asya_Kamsky · February 2, 2022, 6:34pm

Simple - remove the unnecessary $project stage before $sort and $group and see what happens (you can use explain to see exactly what that changes.

Asya

Adam_Adam · February 2, 2022, 6:57pm

Thanks a lot Asya, it actually seems to help yes.

Here is what explain has to say when running with $project:

	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "flow-speed-calculator.four_monthly",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "8B3D4AB8",
					"planCacheKey" : "8B3D4AB8",
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$project" : {
				"fit_plot" : false,
				"scatter_plot" : false
			}
		},
		{
			"$sort" : {
				"sortKey" : {
					"date_end" : -1
				}
			}
		},
		{
			"$group" : {
				"_id" : {
					"road" : "$road",
					"direction" : "$direction",
					"pk" : "$pk"
				},
				"doc" : {
					"$first" : "$$ROOT"
				}
			}
		}
	],

And without:

	"stages" : [
		{
			"$cursor" : {
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "flow-speed-calculator.four_monthly",
					"indexFilterSet" : false,
					"parsedQuery" : {
						
					},
					"queryHash" : "F446870C",
					"planCacheKey" : "F446870C",
					"winningPlan" : {
						"stage" : "FETCH",
						"inputStage" : {
							"stage" : "IXSCAN",
							"keyPattern" : {
								"date_end" : -1,
								"road" : 1,
								"direction" : 1,
								"pk" : 1
							},
							"indexName" : "date_end_-1_road_1_direction_1_pk_1",
							"isMultiKey" : false,
							"multiKeyPaths" : {
								"date_end" : [ ],
								"road" : [ ],
								"direction" : [ ],
								"pk" : [ ]
							},
							"isUnique" : true,
							"isSparse" : false,
							"isPartial" : false,
							"indexVersion" : 2,
							"direction" : "forward",
							"indexBounds" : {
								"date_end" : [
									"[MaxKey, MinKey]"
								],
								"road" : [
									"[MinKey, MaxKey]"
								],
								"direction" : [
									"[MinKey, MaxKey]"
								],
								"pk" : [
									"[MinKey, MaxKey]"
								]
							}
						}
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$group" : {
				"_id" : {
					"road" : "$road",
					"direction" : "$direction",
					"pk" : "$pk"
				},
				"doc" : {
					"$first" : "$$ROOT"
				}
			}
		}
	],

So I’m not sure to understand why the $project option slow it down like this but at least this topic will have all informations in case someone else is interressted in. We can clearly see the second query is refering indexes, while the first one isn’t.

If you are wiling to explain me a bit why it goes like this, and how I can re-enable $project (but after grouping, to avoid killing performance) that would be awesome

steevej · February 3, 2022, 3:22am

After a $project/$addFields/$group, the documents are not the original documents from the collection anymore. This means that $sort cannot use the index of the collection.

A $project after the $group should help performance if the result of $project is smaller than the source.

Adam_Adam · February 3, 2022, 9:30am

Okay, that makes sense indeed, thanks !
Any way to re-enable projection after $group ? If I had $project after $group (in the array) it is no-op.

Regards

Asya_Kamsky · February 4, 2022, 8:56pm

The issue is that your $project tricks the pipeline into not using the very helpful index (see COLLSCAN in the first explain).

This is a limitation in pipeline analysis - it should know to either ignore the $project or swap it with $sort in order to improve its ability to push down sort to the query subsystem. I believe fixing SERVER-26442 would automatically “do the right thing” but meanwhile you can do it yourself by putting $sort before $project.

Asya

Asya_Kamsky · February 4, 2022, 8:59pm

You can do it before $group (but after $sort).

system · February 21, 2022, 8:51am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.