The issue with slow updateMany operation in MongoDB v4.4.25

Michael_Hal · January 23, 2024, 4:34pm

Hello everyone. I have a problem with the updateMany operation on MongoDB Server Community Version v4.4.25. I’m using a three-node cluster running on Google Cloud VMs of type e2-standard-4 (4 vCPUs, 16GB RAM, with half of it for the WiredTiger cache). The collection has around 130 million documents. I’m running an updateMany on around 350,000 documents. The documents are small, and the query is supported by an index. The write concern is set to 1. Sometimes, the operation completes within a few minutes, but other times it takes several hours. During the long execution of this operation, there is almost no disk read/write activity except for a brief spike in the first few seconds, after which it drops to very low values. Initially, I thought it might be a cache issue, but I increased the cache size fourfold (switching to e8-standard-64 instances), and from what I can see, the cache is not even filled to 25%. The CPU is used in several percent. The top command on vm doesn’t indicate that the CPU is waiting on the disk. I’ve been looking for other operations that might be causing issues, but I couldn’t find anything in logs. I’m wondering if there might be some limitations on Google Cloud. Additionally, I have a feeling that when the primary is in a specific zone, the problem occurs less frequently, but it could be coincidental.

This is a fragment of the currentOp during the execution of this (slow) operation.

      microsecs_running: Long("4605218131"),
      op: 'update',
      ns: 'myDb.myCollection',
      command: {
        q: { fieldId: Long("111") },
        u: { '$set': { fieldId: Long("222") } },
        multi: true,
        upsert: false
      },
      planSummary: 'IXSCAN { fieldId: 1 }',
      numYields: 95113,
      locks: {
        ParallelBatchWriterMode: 'r',
        FeatureCompatibilityVersion: 'w',
        ReplicationStateTransition: 'w',
        Global: 'w',
        Database: 'w',
        Collection: 'w'
      },
      waitingForLock: false,
      lockStats: {
        ParallelBatchWriterMode: { acquireCount: { r: Long("95114") } },
        FeatureCompatibilityVersion: { acquireCount: { w: Long("95114") } },
        ReplicationStateTransition: { acquireCount: { w: Long("95114") } },
        Global: { acquireCount: { w: Long("95114") } },
        Database: { acquireCount: { w: Long("95114") } },
        Collection: { acquireCount: { w: Long("95114") } },
        Mutex: { acquireCount: { r: Long("1") } }
      },
      waitingForFlowControl: false,
      flowControlStats: {
        acquireCount: Long("95114"),
        timeAcquiringMicros: Long("143036")

At first glance, the numYields value of 95113 concerns me, but it may be related to the initial execution of this operation (cold cache). However, there is no consistent pattern; sometimes, even with a cold cache or on a warmed-up node, the operation can still be slow. I’ve run out of ideas. Perhaps some of you have encountered something similar or can offer advice on where to look. I’d appreciate any suggestions