MongoDB: imported documents do not keep the same order if were inserted through insertMany()

Manuel_Jordan · December 31, 2021, 2:17pm

If through mongo shell is used the following (for few documents 50 for example)

db.<collection>.insertMany(
[
  {
    "name": "some name 1",
    ...
  },
  {
   "name": "some name 2",
    ...
  },
  ...
  {
    "name": "some name N",
    ...
  }
])

I can confirm that through db.<collection_name>.find() the documents are shown in the exact same order how they were declared and inserted through db.<collection>.insertMany(). Until here it is expected and is fine for me.

I did realize the following, if the collection is empty, and if the same content shown above - keeping the same the order - is declared as follows within the import.js file for importation purposes:

[
  {
    "name": "some name 1",
    ...
  },
  {
   "name": "some name 2",
    ...
  },
  ...
  {
    "name": "some name N",
    ...
  }
]

Therefore, is used the same content and order but db.<collection>.insertMany() was removed/ignored.

So, then if is executed through the console/terminal the following command:

mongoimport --db some_db \
            --collection collection_name \
            --file /path/import.json \
            --jsonArray

I can confirm the importation process is accomplished but if db.<collection_name>.find() is executed then the documents are shown in a different order how they were declared and inserted through db.<collection>.insertMany() as the first approach.

How fix this? It is problematic for aggregations how $first and $last

steevej · December 31, 2021, 2:28pm

You must not depend on the insertion order.

If you want to retrieve documents in a specific order you must sort them.

santimir · December 31, 2021, 4:22pm

Hi,

Take a look at mongoimport --maintainInsertionOrder. By default the order is arbitrary.

On the other hand insertMany respects the order by default.

But I would assume the documents are not inserted in a particular order either.

steevej · January 3, 2022, 1:54pm

As you have seen some tools does not respect the JSON file order. They do that for efficiency reasons. A database server is usually more powerful than the client connection to it. The server can do multiple operations in parallel. Some tools will do multiple bulk writes in multiple threads to achieve better performance. So they cannot guaranty the order of operations. It is then better to rely on repeatable and predictable sort to get the document in a given order.

Stennie_X · January 4, 2022, 4:46am

Hi @Manuel_Jordan,

Without an explicit sort order documents are returned in natural order. Natural order is not guaranteed to match insertion order outside of the special case of capped collections (which have significant usage restrictions).

Natural order is not a stable sort order, it is “as documents are found”:

This ordering is an internal implementation feature, and you should not rely on any particular ordering of the documents.

If you want a predictable sort ordering for documents retrieved, you must include an explicit sort() with your query and have unique values for your sort key. Natural order is described as an internal implementation detail because the storage engine decides how to store data most efficiently and the query planner determines the optimal path to return results meeting the requested query filter and sort criteria.

For a few examples of reasons why natural order may vary, see: What is the default sort order when none is specified?.

Regards,
Stennie

santimir · January 5, 2022, 10:42am

@Manuel_Jordan I believe the actual answer to your question is @Stennie_X 's. I thought natural order followed the _id, which increases with every insertion (afaik).

I believe the are 2 options:

Insert in order and sort by _id.
Use a field like ‘creationDate’ in each document such that insertion order isn’t relevant and just sort by that field.

The advantage of _id is you don’t need to create an index, but you rely on inserting them in the right order.

Stennie_X · January 6, 2022, 11:02am

Hi @santimir,

Natural order is independent of _id and there are a few further considerations:

_id values can be user-defined rather than default ObjectIDs
ObjectIDs are approximately monotonically increasing, but not guaranteed to increase with every insertion (particularly in a deployment with more clients generating ObjectIDs)

Consider the components of an ObjectID:

The 12-byte ObjectId value consists of:

a 4-byte timestamp value , representing the ObjectId’s creation, measured in seconds since the Unix epoch

a 5-byte random value generated once per process. This random value is unique to the machine and process.

a 3-byte incrementing counter , initialized to a random value

By default ObjectIDs are generated on the client side when clients/drivers are about to insert a new document which does not have an _id field. The pseudorandom ObjectID format allows ids to be generated independently with a very low probability of collision, and generation on the client side allows drivers to annotate the original document with an _id without a server round-trip.

A unique index (such as the required unique index on the _id primary key) ensures collisions will not lead to duplicate entries. The timestamp prefix in the ObjectID is a 4-byte unixtime (seconds since the Unix epoch) followed by a random value unique to the machine and process.

The above attributes mean there are several ways to end up with ObjectIDs not strictly increasing:

ObjectIDs generated by multiple processes in the same machine are at best ordered by the timestamp value with a per-second granularity. The random per-process and counter values do not guarantee strict ordering.
A process may generate an ObjectID with some delay before the actual document is inserted by the MongoDB deployment.
Time drift across different servers affects the generated timestamp.

Sorting by _id is generally good enough for most use cases, but the embedded timestamp in an ObjectID is really capturing the time of ObjectID generation rather than time of document insertion.

For more precision on timestamps you could add a Date field which has millisecond precision rather than the seconds for a timestamp embedded in an ObjectID.

Regards,
Stennie

santimir · January 6, 2022, 11:21am

@Stennie

I’ve never really thought that the granularity is in the seconds range, and that the _id will more accurately point to the creation date rather than insertion order.
Thanks a lot for the insights.

system · January 11, 2022, 11:21am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.