findAndModify causes many write conflicts which causes 100% CPU usage

Dev_Develop · March 22, 2022, 12:28pm

Hello!
I’m using mongoDB as a storage for batch processing documents in a background. DB setup is: 1 primary and 2 secondaries. Each document in ‘processing’ collection has given structure:

{
    _id: ObjectId("6239ba1f4f6a3f8e20d16243"),
    data: { ... },
    created_at: ISODate("2022-03-22T11:59:27.795Z"),
    processed_times: 1,
    status: 2, // status of processing (new, failed, in_progress)
    updated_at: ISODate("2022-03-22T11:59:27.795Z"),
    last_picked_at: ISODate("2022-03-22T11:59:32.164Z") // date of last "pick" for processing
  }

Current process of documents in a collection looks like this:

Consumer (separate process) reads from event bus and writes it into processing collection with status: new
Processor (separate worker process) with a little timeout “picks” 50 documents with status new from processing collection. Basically, it runs a loop until 50 documents (or less if no more is presented in a collection) retrieved for processing. The query for “picking” each document is:

{
      findAndModify: 'processing',
      new: true,
      query: { status: 0, last_picked_at: { '$exists': false } },
      sort: { _id: 1 },
      update: {
        '$set': {
          status: 2,
          last_picked_at: current_date
        },
        '$inc': { processed_times: 1 }
}

After fetching all the documents each of them is processed and service runs a transaction, where it stores all processed documents in a different collection and removes all processed documents from processing collection.

All the operations (including transactions) have following read/write concerns:

write - {w: majority, j: false}
read - majority

When we ran service everything was fine, but on 4-th day of work we noticed burst of conflict updates from findAndModify operation which picks documents for processing, the CPU usage of primary was 100% and execution time of such operations was getting higher and higher. We switched over the primary node and the overload has gone, but then we’ve experienced the same behaviour after 5 days.
I’ve tried to understand what is wrong with the code and why there are such bursts of write conflicts after a long period of normal work without much conflicts, but i’ve ran out of guesses.
What would you suggest to identify potential causes for such case?

Jerome_LAFORGE · March 23, 2022, 9:30am

We use mongodb as job queue dispatcher (Queue) for microservice and we faced with the same pb. In order to avoid the write conflict problem, we use semaphore to limit to one findAndModify at the same time.

Dev_Develop · March 23, 2022, 8:53pm

Thanks for your reply. You’re right, using synchronisation in code layer is one of ways to avoid this, I thought about distributed locks, because worker service is running as multiple separated processes. But currently I want to sort out the real cause of problem.

kevinadi · March 24, 2022, 11:53pm

Hi @Dev_Develop welcome to the community!

Are you seeing the same pattern of behaviour every ~5 days? The cause of a write conflict is basically what it said on the tin: two or more threads of operation are trying to simultaneously update a document. I’m curious if using a semaphore like @Jerome_LAFORGE suggested can alleviate the issue. If yes, that would imply that throttling the workload into the database is one answer, and it was (probably) caused by how the application uses the database.

Is it possible for you to check the mongod logs during these times to see how many connections are there, how many slow queries, what are the slow queries, or other signals that may be helpful? The output of db.serverStatus() may also provide a clue into the state of the server during these events.

Best regards
Kevin

Jerome_LAFORGE · March 25, 2022, 7:07am

Yes, this semaphore greatly helps in mitigating this problem even if it doesn’t solve it completely (because the worker is also distributed). But at least it prevents this problem from happening even with one worker (of course it’s worse when there are several workers).