Adding a single autoincrementing field on all documents taking too long

Saylent_N_A · August 26, 2023, 2:53pm

I have a problem with the following code below which I run on a 4M+ documents collection, what I wish to do is create a new autoincrementing field for all documents but this takes too long, I can see it change all documents but it’s running for around 30 minutes now, can we optimize this code more?

var counter = 0

db.posts.find().forEach(function (doc) {
	db.posts.updateOne(
    { _id: doc._id},
    { $set : {PostIncrementId: counter }}
	);
	counter++
})

slava · August 26, 2023, 8:56pm

Hello, @Saylent_N_A !

The main issue with your approach is that you update each document with separate request. That means, if you have 4M+ documents, your code will make 4M+ requests to your database server and that is a lot of time and lots network interaction!

Try to send update commands in batch. I can suggest two ways of how it can be done:

Solution 1. Using bulkWrite()

In your application, select documents, that do not have ‘postIncrementId’ yet. It can be 10K or any other number, that can be handled by your mongo driver. Get only document ids to update (use projection) to reduce impact on RAM.
Assign postIncrementId for each document
Build bulkWrite array of update commands.
Execute bulkWrite command to update selected documents
Repeat all 4 steps again. Let it be done with some for-loop, depending on the language you use.

Solution 2. Aggregation pipeline with $merge
The steps are similar to the ones in previous solution, but postIncrementId calculation and document selection is done on the database-server’s side.

Pseudo code (use it just as an example):

const totalDocuments = db.posts.countDocuments();

// documents to process in batch. 
const limit = 10000; 

// decide, how many batch operations we will need  
// to update all documents
const nIterations = Math.ceil(totalDocuments / limit);

for (let i = 0; i++; i <= nIterations) {
  const offet = i * limit;
  // call function with aggregation
  // make sure this aggregation is called sequentially,
  // not in parallel to avoid conflicts
  batchIncrement(limit, offet);
}

function batchIncrement(limit, offset) {
  db.posts.aggregate([
    {
      // $sort is needed so $skip and $limit stages give 
      // predictable results
      $sort: {
        _id: 1
      }
    },
    // match only documents, that do not have postIncrementId yet
    {
      $match: {
        postIncrementId: null,
      }
    }, 
    {
      $skip: offset, // function variable is used here!
    },
    // limit number of documents per batch
    {
      $limit: limit, // function variable is used here!
    },
    // leave only _id of each post, 
    // so each document could take less RAM 
    // and more documents could fit in $group stage
    {
      $project: {
        _id: true,
      }
    },
    {
      // collect all selected posts into array field 
      // for later use in $project + $reduce
      $group: {
        _id: null,
        posts: {
          $push: {
            _id: '$_id'
          }
        }
      }
    },
    {
      // calculate and assign postIncrementId for each post 
      $project: {
        posts: {
          $reduce: {
            input: '$posts',
            initialValue: {
              i: offset + 1,  // function variable is used here!
              incrementedPosts: [],
            },
            in: {
              i: {
                $add: ['$$value.i', 1],
              },
              incrementedPosts: {
                $concatArrays: ['$$value.incrementedPosts', [
                  {
                    _id: '$$this._id',
                    postIncrementId: '$$value.i'
                  }
                ]]
              }
            }
          }
        }
      }
    },
    // convert 'posts' array back to documents
    // with $unwind + $replaceWith stages
    {
      $unwind: '$posts.incrementedPosts',
    },
    {
      $replaceWith: '$posts.incrementedPosts'
    },
    // save documents into collection,
    // this will only add 'postIncrementId' field to each document
    // other fields will not be affected
    {
      $merge: {
        into: 'posts',
        on: '_id',
        whenMatched: 'merge',
        whenNotMatched: 'discard'
      }
    }
  ]);
}

Tested this aggregation pipeline on these sample documents:

[
  { _id: 'P1', title: 'P1-title' },
  { _id: 'P2', title: 'P2-title' },
  { _id: 'P3', title: 'P3-title' },
  { _id: 'P4', title: 'P4-title' },
  { _id: 'P5', title: 'P5-title' },
  { _id: 'P6', title: 'P6-title' },
  { _id: 'P7', title: 'P7-title' },
  { _id: 'P8', title: 'P8-title' },
  { _id: 'P9', title: 'P9-title' }
]

Documents after batchIncrement() function execution with limit=10, offset=0:

[
  { _id: 'P1', title: 'P1-title', postIncrementId: 1 },
  { _id: 'P2', title: 'P2-title', postIncrementId: 2 },
  { _id: 'P3', title: 'P3-title', postIncrementId: 3 },
  { _id: 'P4', title: 'P4-title', postIncrementId: 4 },
  { _id: 'P5', title: 'P5-title', postIncrementId: 5 },
  { _id: 'P6', title: 'P6-title', postIncrementId: 6 },
  { _id: 'P7', title: 'P7-title', postIncrementId: 7 },
  { _id: 'P8', title: 'P8-title', postIncrementId: 8 },
  { _id: 'P9', title: 'P9-title', postIncrementId: 9 }
]

Note: you may want to create index on postIncrementId field, so there won’t be two documents, that have same value for postIncrementId field.

Saylent_N_A · September 11, 2023, 1:13pm

i was able to fix it by running the command on the server, i didnt know that Compass was fetching and sending data

system · September 16, 2023, 1:13pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.