Hello!
I’m using mongoDB as a storage for batch processing documents in a background. DB setup is: 1 primary and 2 secondaries. Each document in ‘processing’ collection has given structure:
{
_id: ObjectId("6239ba1f4f6a3f8e20d16243"),
data: { ... },
created_at: ISODate("2022-03-22T11:59:27.795Z"),
processed_times: 1,
status: 2, // status of processing (new, failed, in_progress)
updated_at: ISODate("2022-03-22T11:59:27.795Z"),
last_picked_at: ISODate("2022-03-22T11:59:32.164Z") // date of last "pick" for processing
}
Current process of documents in a collection looks like this:
- Consumer (separate process) reads from event bus and writes it into
processing
collection withstatus: new
- Processor (separate worker process) with a little timeout “picks” 50 documents with status
new
fromprocessing
collection. Basically, it runs a loop until 50 documents (or less if no more is presented in a collection) retrieved for processing. The query for “picking” each document is:
{
findAndModify: 'processing',
new: true,
query: { status: 0, last_picked_at: { '$exists': false } },
sort: { _id: 1 },
update: {
'$set': {
status: 2,
last_picked_at: current_date
},
'$inc': { processed_times: 1 }
}
- After fetching all the documents each of them is processed and service runs a transaction, where it stores all processed documents in a different collection and removes all processed documents from
processing
collection.
All the operations (including transactions) have following read/write concerns:
- write -
{w: majority, j: false}
- read - majority
When we ran service everything was fine, but on 4-th day of work we noticed burst of conflict updates from findAndModify operation which picks documents for processing, the CPU usage of primary was 100% and execution time of such operations was getting higher and higher. We switched over the primary node and the overload has gone, but then we’ve experienced the same behaviour after 5 days.
I’ve tried to understand what is wrong with the code and why there are such bursts of write conflicts after a long period of normal work without much conflicts, but i’ve ran out of guesses.
What would you suggest to identify potential causes for such case?