Background:
We create nightly snapshots of EBS Volume of /data mount location of our primary replicaset.
Our goal is to be able to perform a point in time restore of the Mongo database from T-1 FileSysytem Snapshot copy. The database receives multiple updates/deletions/inserts for the records.
Steps we have done so far:
Create new EC2 instance with mongo configuration and attached the Volume created from T-1 day EBS snapshot and mounted on /data
Started the Mongo server as standalone on new EC2 instance.
Created oplog dump from existing running replicaset cluster using below command
I was able to resolve the issue and perform the PITR of mongo 3 node replicaset cluster.
Adding the steps in-case someone else stumbles across this post.
After restoring the T-1 EBS Snapshot of /data mount point in new EC2 instance.
…
Get the latest oplog entry in the new EC2 instance launched from T-1 Snapshot
>use local
>db.oplog.rs.find({op:"i"}).sort({$natural: -1}).limit(1);
# Sample Output for ts value
ts: Timestamp({ t: 1689748934, i: 3 }),
Take oplog dump from existing Running Mongo cluster.
Our oplLog contains entries of last 57 hours; so I do not know why we were having Duplicate key errors despite the fact the oplog should be idempotent. And replaying the oplog again and again should be fine.
We verified the restoration using one document that gets updated very frequently.
Two things that we observed:
a. The transactions were replayed but the oplog entry in the recovered instance was not updated. I am not sure if this is Mongo Server behaviour or not.
b. It took nearly 12 hours just to replay the 14GB oplog on t3.large EC2 instance with /data EBS Volume of gp2 type with 1200 IOPS with no active user connections. This is too long for production environment.
While searching for solutions; we stumbled across
pbm
tool as well. It looks promising, but we did not research much into that implementation.