MongoDB Aggregation - Does $unwind order documents the same way as the nested array order?

karaimin · July 28, 2020, 8:35am

I am wandering whether using $unwind operator in aggregation pipeline for document with nested array will return the deconstructed documents in the same order as the order of the items in the array.
Example:
Suppose I have the following documents

{ "_id" : 1, "item" : "foo", values: [ "foo", "foo2", "foo3"] }
{ "_id" : 2, "item" : "bar", values: [ "bar", "bar2", "bar3"] }
{ "_id" : 3, "item" : "baz", values: [ "baz", "baz2", "baz3"] }

I would like to use paging for all values in all documents in my application code. So, my idea is to use mongo aggregation framework to:

sort the documents by _id
use $unwind on values attribute to deconstruct the documents
use $skip and $limit to simulate paging

So the question using the example described above is:

Is it guaranteed that the following aggregation pipeline:

[
    {$sort: {"_id": 1}},
    {$unwind: "$values"}
]

will always result to the following documents with exactly the same order?:

{ "_id" : 1, "item" : "foo", values: "foo" }
{ "_id" : 1, "item" : "foo", values: "foo2" }
{ "_id" : 1, "item" : "foo", values: "foo3" }
{ "_id" : 2, "item" : "bar", values: "bar" }
{ "_id" : 2, "item" : "bar", values: "bar2" }
{ "_id" : 2, "item" : "bar", values: "bar3" }
{ "_id" : 3, "item" : "baz", values: "baz" }
{ "_id" : 3, "item" : "baz", values: "baz2" }
{ "_id" : 3, "item" : "baz", values: "baz3" }

michael_hoeller · July 28, 2020, 11:30am

Hello @karaimin welcome to the community!

There is a chance that you get the same results but no guarantee. Personally I always use the sort after the unwind, ideal with an $match before the unwind to try to get as few documents as possible.

[
    {$match: { <query> } }
    {$unwind: "$values"}
    {$sort: {"_id": 1}},
]

The above will provide for sure the wanted result. However, I hope that one oft the MDB internals can elaborate on

the performance aspects - is there an option to prevent an in memory sort after the $unwind? Not sure if a wildcardIndex can help here.

Cheers,
Michael

karaimin · July 28, 2020, 12:50pm

Thank you @michael_hoeller for you response.
However, maybe this is not an option for me since I expect the real sample to have a small number documents which I will $match by some query (this is why I am sorting them by _id) and large number of items in values array.
As I mentioned I want to use pagination for the items in the values attribute. So I need a consistent order of them. Sorting them is not an option since after $unwind I loose the DB Index and it will exceed the memory of 100MB available for the pipeline. I can use "allowDiskUse: true", but that is also not a good option because it will slow down queries a lot.

michael_hoeller · July 28, 2020, 2:02pm

Hello @karaimin

Yes, that should be avoid when you are acting in a realtime app.

Some thoughts

One thought on this, large arrays are almost always difficult to handle. Maybe you can change your schema and use embedding for $values? This would open completely different options.

You can check if a multikey Index on one or some fields in your array can help. In case you can find a good index you can walk along the index without the $unwind

I am not sure what you mean. When you deconstruct an array with $unwind you will get an document per field and element. You can add the option includeArrayIndex to $unwind. This will add an field to the new document which just numbers the fields. This could also provide a path to go.

{ "_id" : 1, "arrayIndex" : NumberLong(0), "item" : "foo", values: "foo" }
{ "_id" : 1, "arrayIndex" : NumberLong(1),  "item" : "foo", values: "foo2" }
{ "_id" : 1, "arrayIndex" : NumberLong(2),  "item" : "foo", values: "foo3" }
{ "_id" : 2, "arrayIndex" : NumberLong(1),  "item" : "bar", values: "bar" }
{ "_id" : 2, "arrayIndex" : NumberLong(2),  "item" : "bar", values: "bar2" }
{ "_id" : 2, "arrayIndex" : NumberLong(3),  "item" : "bar", values: "bar3" }

The common approch on pagination Using skip() and limit()

// Page 1
db.users.find().limit (10)
// Page 2
db.users.find().skip(10).limit(10)
// Page 3
db.users.find().skip(20).limit(10)

In general, to retrieve page ‘n’ the code looks like this:

db.users.find().skip(pagesize*(n-1)).limit(pagesize)

As the size of your data increases, this approach has performance problems. The reason is that every time the query is executed, the full result set is built up, then the server has to walk from the beginning of the collection to the specified offset. As your offset increases, this process gets slower. Also, this process does not make efficient use of the indexes. So typically the ‘skip()’ and ‘limit()’ approach is useful when you have small data sets,

Approach Using find() and limit()

The reason the previous approach does not scale well is the skip() command. Depending how you have build your _id you may can use the natural order in the stored data like a timestamp, or an index (pls s. above).

Retrieve the _id of the last document in the current page
Retrieve documents greater than this “_id” in the next page

// Page 1
db.users.find().limit(pageSize);
// Find the id of the last document in this page
last_id = ...

// Page 2
users = db.users.find({` `'_id'` `> last_id}). limit(10);
// Update the last id with the id of the last document in this page
last_id = ...

This approach leverages the inherent order that exists in the “_id” field. Also, since the “_id” field is indexed by default, the performance of the find operation is very good.

There are further options to do pagination:

using $slice or
using range queries

both I do not think will help here so I skip the examples.

Cheers,
Michael

karaimin · July 28, 2020, 7:59pm

Hi @michael_hoeller

I didn’t get your idea here:

If you mean changing my schema to looks similar to what $unwind does (reversing One-to-Many relation) and adding appropriate key index for each value will do the trick, but this schema also has a lot of concerns related to my other scenarios.

I mean that now I may have a single key index on values field (used for queries), but after $unwind it is useless. So, next stage (which is $sort) won’t benefit from the sorted structure of the indexes in the B-Tree just like regular $sort operation on indexed field.

My example is similar to the bucket pattern. The large document (with a too large array) is partitioned to some small documents (with smaller array size) in order to fit in the max document size.
Keeping too many documents will lead to heavy update operations (sometimes including massive amount of documents ) which I try to avoid.

qtax · August 5, 2020, 5:20pm

The same question:

Does $unwind keep the “unwound” documents in the same order as they were ordered in the source array?

Need to know this as we would like to avoid an unneeded $sort that would require allowDiskUse: true (when the aggregation otherwise doesn’t). Couldn’t find anything about this in the docs.

It seems potentially reasonable to assume that it would be in the same order, but can we be sure? @karaimin did you figure it out?

karaimin · August 6, 2020, 6:05am

No @qtax, I didn’t find appropriate solution. I didn’t find any information in the official documentation and according to @michael_hoeller the order is not guaranteed, so I assume that I don’t have any evidence to rely on the returned order.
I solve the problem with aggregation pipeline that sort the documents and give me the length of each nested array. After that with some application logic I am able to reduce the documents where my next page is. Then, I query only the filtered documents and this way the pagination is achieved. It is not so efficient like the assumption described in the first question but for now it is the best I have found.

michael_hoeller · August 6, 2020, 6:39am

Hello @karaimin and @qtax,

thanks for updating, I lost your message on my bookmarks.

This was my experience in an previous customer project. The “unwound” documents are identical to the input document except for the value of the array field which now holds a value from the original array element. It seems that the documents are created by the sequence they occur in the array. I assume that one unwind can be idempotent but within the the flow of a process you might get side effects (e.g. empty array).
In case you need to keep the sequence you can add includeArrayIndex which will add a field to the unwound document which holds the position of the field in the previous array.

To be absolutely sure what is correct I hope that @Asya_Kamsky can step in?

regards,
Michael

karaimin · August 6, 2020, 7:02am

Hi @michael_hoeller

I know the existence of includeArrayIndex option. However I would like to avoid application level sorting. So, saying in other words:

Is there any chance that this:

{ "_id" : 1, "item" : "foo", values: [ "foo", "foo2", "foo3"] }

may be destructed to this.

{ "_id" : 1, "item" : "foo", values: "foo2", "arrayIndex" : NumberLong(1 }
{ "_id" : 1, "item" : "foo", values: "foo", "arrayIndex" : NumberLong(0 }
{ "_id" : 1, "item" : "foo", values: "foo3", "arrayIndex" : NumberLong(2 }

Take a look at the order (1 0 2)

Having only the index won’t solve our problem. We need application level sorting if the above sequence is returned.
Best Regards
Aleydin

michael_hoeller · August 6, 2020, 7:12am

Hi @karaimin

I understood your question, I have seen this happen in a project some month back. I did tests just for me, but I have not been able to recreate the issue. To be on the save side on customer projects, I did add an extra(?) sort. This is a good question! With the last message I “pinged” @Asya_Kamsky she probably can add the full insight.
Just do not want to limt this to Asya she is often abroad, anyone else around who can add on this?

Regards,
Michael

karaimin · August 6, 2020, 7:22am

Ok, Thank you @michael_hoeller for all your support. I am looking forward of further updates on this topic. I will be glad to see official statement from the MongoDB staff.

Asya_Kamsky · August 8, 2020, 8:48pm

Hi there folks,

Yes, the order of documents will be the same as the order of elements in the array and the order pre-unwind will be preserved.

But using $skip and $limit for pagination is not really a good idea as it’s not very performant, especially once you move to sharding…

Asya

michael_hoeller · August 8, 2020, 9:04pm

Hello Asya,
thanks for clearing that, lucky to save an extra sort in the future. Concerning pagination I provided some suggestions fairly at the beginning of the post.
Michael

karaimin · August 9, 2020, 3:42pm

Thanks @Asya_Kamsky and @michael_hoeller for your time on revealing the mystery … and the suggestions about the pagination approach.

Since, $unwind will be used, I won’t be able to benefit from an indexed field and $skip and $limit is my only option for pagination (with this data model). I am going to test it if it can suits our needs. Maybe, combination from both “$skip/$limit” for the first few pages and the approach I described above can be appropriate for me.

Best Regards
Aleydin

Benjamin_Lowe · September 29, 2023, 3:35pm

Yes, the order of documents will be the same as the order of elements in the array and the order pre-unwind will be preserved.

Thanks for sharing this statement, and I gather you are a very reputable source, having worked on the aggregation framework itself.

It seems a colleague of yours contradicted this statement in a different forum thread a few months later sayng “MongoDB only gurantee order of documents based on a sort operation” which leaves me unsure whether it would be safe to avoid a sort operation (for performance reasons) when performing $unwind

Is there any up-to-date definititve documentation on what guarantees there are regarding array ordering during aggregation?

steevej · September 30, 2023, 11:22am

There is no contradiction.

The $unwind preserves the order of the documents it receives and the order of the array elements.

To get a predictable order of documents you must $sort.

That is if you $sort before $unwind, documents will be sorted after $unwind.

Benjamin_Lowe · September 30, 2023, 7:43pm

That was my understanding from this thread, but if someone asks “Does X operator preserve the order of an array?” and the response is that no operator other than sort guarantees order, it does somewhat suggest that $unravel might not preserve the intrinsitc order of the array.

I agree though that it seems likely to be exactly as you described though, and $unravel does preserve order (but maybe $in does not!)