Size of retrieved documents < batch size

Dylan_Dinh · June 26, 2023, 8:26am

Hello,

I have been using mongodb v1.11.1 with golang 1.19.
I was doing some data type migration, and I ran into something.

In some environment, I had to change data type over 100 000 documents. Since everything is running is pods, I couldn’t load in memory 100 000 documents so I used batchSize = 1000 in the FindOptions struct.

And I was using the method find in a infinite loop and breaking if cursor.ID == 0.
The thing is I was doing the check before using cursor.Next() so some clients had a number of documents < batchSize and the cursor was considered dead and I wasn’t updating any of their document.

Is that a wanted behavior so put the ID of cursor = 0 when number of documents is < batchSize ?

FYI, when number of documents > batchSize everything is good !

Kushagra_Kesav · July 11, 2023, 2:29am

Hey @Dylan_Dinh,

Thank you for reaching out to the MongoDB Community forums!

To better understand the problem, may I ask you the following questions:

What specific data type migration were you performing?
Could you provide more details about the environment in which you had to change the data type?
How are the documents distributed across the pods? Are they evenly distributed or specific to certain pods?
Did you encounter any error messages or warnings during the migration process?
Have you tried any alternative approaches to handling the data type migration?
Could you share the code snippet that you are executing?
Also, please share the version of the MongoDB server you are currently using and where it is deployed.

Looking forward to hearing back from you.

Regards,
Kushagra

Dylan_Dinh · July 11, 2023, 8:16am

Hi @Kushagra_Kesav,

I was migrating from a object from BinData to plain text
It was a migration running at the start of our pod using the version I said above, what specific information do you need ?
One pod was in charge to do that so 100 000 documents was too much for the memory available leeding to OOM, I fixed this using batch size options.
No, we discovered in production that when number of documents < batchSize those documents weren’t updated.

Code snippet :

batchSize := int32(1000)
	opts := &options.FindOptions{
		BatchSize: &batchSize,
	}
	for {
		cursor, err := db.Collection(pushedNotificationCollection).Find(context.Background(), bson.M{"payload": bson.M{"$type": "binData"}}, opts)
		if err != nil {
			return err
		}

		if cursor.ID() == 0 {
			break
		}

		for cursor.Next(context.Background()) {
			var ai pushednotification.AlarmInfo
			var pn oldUserPushedNotification

			if err = cursor.Decode(&pn); err != nil {
				return err
			}

			err = json.Unmarshal(pn.Payload, &ai)
			if err != nil {
				return err
			}

			newUpn := buildNewUserPushedNotifFromOld(pn, ai)

			_, err = db.Collection(coll).ReplaceOne(context.Background(), cursor.Current, newUpn)
			if err != nil {
				return err
			}
		}
	}
	return nil

Is it the right way to do that, the fastest as possible ?
Doing that before calling ReplaceOne when number of doc < batch size will in fact break the for loop and you miss some documents :

if cursor.ID() == 0 {
			break
		}

Question is, why when number of doc < batch size then cursor.ID() == 0, I feel like it has to be to value 0 when there is no documents at all.

Maybe I shouldn’t use a for loop so I could get rid of that break call but I have memory limitation on my pod, so loading 100 000 documents is not possible.

db.version() → 4.4.0

John_Sewell · July 11, 2023, 9:14am

Where in the documentation does it say that cursor being zero indicates that the cursor does not contain any documents?