Our Prod Mongo is a PSA running on version 4.2 (Read Concern Majority disabled). I currently have a service that is reading off the change stream for a specific collection. Suddenly today outta nowhere I see this error:
- Error occurred and seen within onError, error=[{}]
com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'CollectionScan died due to position in capped collection being deleted. Last seen record id: {number} on server
After this, the stream tried to restart from last persisted token which also gives an error:
- Command failed with error 280 (ChangeStreamFatalError): ‘cannot resume stream; the resume token was not found. {_data: “{resume-token}”}’ on server. The full response is {“errorLabels”: [“NonResumableChangeStreamError”], “operationTime”: {"$timestamp": {“t”: 1590353747, “i”: 3}}, “ok”: 0.0, “errmsg”: "cannot resume stream; the resume token was not found.
Note:
- The last persisted resume token was atmost 5mins old.
- The oldest record in the oplog for the collection I was watching is still 2 days old.
Not sure how the resume token got invalidated. Do you think using a timestamp to resume is a better idea?
Is there any best practice that the service would need to follow, to handle this kind of error?
Would be a of great help to even get some knowledge about these errors and some best practices.