Data retention on ready only replica

Balaji_Krishnaswamy · September 19, 2024, 11:23am

We are currently maintaining one year data on mongdb and our size of the DB is getting more than 4GB. So we are plannng to retain only 3 months of data on our master and replicas, however we wanted to have one replica that will retain one year data. How do we setup that one off replica that will retain one year data where other replicas and master will reatain 3 month of data.

Michael_Lynn · September 20, 2024, 4:30pm

Are you using MongoDB Atlas? Or, self-hosted? sounds like you’re trying to manage data retention on different replicas differently, which is atypical… can be challenging… MongoDB replication typically keeps all data in sync across all members of a replica set… so to have one replica with a larger retention requires a lil creativity.

Here’s one way to approach this… assuming you’re working with MongoDB on-prem.

Use a Hidden Secondary

MongoDB lets you use hidden secondaries or replicas. A hidden secondary can have the same data as the primary but doesn’t participate in elections or serve read operations unless explicitly queried.

Here’s how you might set it up…

Create a Hidden Secondary

Add a new secondary to the replica set, if you haven’t already.

rs.add({
  host: "<your-hidden-replica-host>",
  priority: 0,
  hidden: true
})

hidden: true ensures that the secondary remains hidden and won’t become a primary.

3. Disable TTL Index on Hidden Secondary

By default, TTL indexes will be applied to all nodes. To avoid TTL purging old data on your hidden secondary, you need to disable the TTL process on this node.

Connect to the hidden secondary and run:

db.adminCommand({setParameter: 1, ttlMonitorEnabled: false})

This will ensure your hidden replica retains all the data and is not affected by the TTL deletion that will happen on the other nodes.

4. Set Up TTL on Primary

To automatically delete older documents from your primary (and other replicas except the hidden one), you can use a TTL index. Let’s assume you have a createdAt field in your documents that records when they were inserted.

Create a TTL index like this:

db.collection.createIndex({ "createdAt": 1 }, { expireAfterSeconds: 7889238 }) // Roughly 3 months in seconds

5. Verify Your Setup

Once the TTL index is in place on your primary and the hidden secondary has TTL disabled, you’ll have:

The primary and its regular secondaries retaining 3 months of data (thanks to TTL).
The hidden secondary retaining the full year’s data (since TTL is disabled on it).

Another way to approach this … is with MongoDB Atlas Online Archive…

Here’s how you can leverage MongoDB Atlas Online Archive for your scenario:

1. Use Online Archive to Offload Older Data

Instead of keeping older data on a separate replica, you can set up Atlas Online Archive to move data older than 3 months from your primary database into an archive. This archived data is still queryable—so, while it’ll be a bit slower, you won’t have to worry about losing access or the ability to query the archived data.

Here’s how this might work:

The primary and replica nodes would retain the last 3 months of data.
Older data (older than 3 months but less than a year, in your case) would automatically move to the Online Archive, which is backed by cloud storage.
This reduces the size of your main database and keeps performance high without losing access to the old data.

Query Archived Data Seamlessly

When querying, you can still access archived data. MongoDB Atlas will automatically merge results from your operational database and your archive, so you can query both data sets as if they’re still in the same collection.

This means:

You won’t have to worry about managing a separate hidden replica for one-year-old data.
Queries can fetch both current (3-month) and archived (older) data seamlessly.

Set Up Online Archive in Atlas

Define the Archive Policy: You can create an archive rule based on a field like createdAt (or any other field that makes sense for your data). You’ll specify that any document older than 3 months should be archived.

In Atlas:

- Go to **Data Services** -> **Online Archive**.
- Choose the collection you want to archive data from.
- Define the archive rule. For example, "archive any documents older than 3 months based on the `createdAt` field."

Set Storage Duration: You can configure how long the data should be stored in the archive before it’s deleted entirely (if you want). In your case, you could keep the data for up to 1 year before purging it.
Query Your Data: MongoDB handles querying archived data transparently… you’ll get a connection string… When you run a query, it will automatically pull from both the active database and the archive, so you don’t need to change how you access the data.

4. Benefits of Using Online Archive

Cost-Effective: Archiving data to cloud object storage (like S3) is cheaper than keeping it in your high-performance operational DB.
Performance Boost: Your primary replica set stays lean and performs well with a smaller working data set (3 months).
Scalability: You don’t need to manage additional infrastructure like hidden secondaries.
Compliance & Retention: You can easily meet data retention policies without cluttering your primary database.

Hope this helps… let us know how you make out.

steevej · September 20, 2024, 4:58pm

Too bad a reply has to be at least 20 words. I had only 2.

Wonderful, Thanks.

Balaji_Krishnaswamy · September 27, 2024, 10:15am

Hi Michael

I tried your suggestion on the hidden secondary. I have the ttlmonitor disabled on hidden secondary and created the TTL Index on primary for retaining 3 months data. But the parameter does not have an impact. It went ahead and deleted the entries from the hideen secondary as well.

Hidden Secondary:

ttlMonitorEnabled: false,
ttlMonitorSleepSecs: 60,

Active Secondary:

ttlMonitorEnabled: true,
ttlMonitorSleepSecs: 60,

Can you please let me know what is going wrong here.

Isla_Bell · October 24, 2024, 12:55am

Any update? I also want to know it.

Balaji_Krishnaswamy · October 24, 2024, 3:54am

Seems like the TTL Index is all controlled on master and when the replication happens the index is also synced and the flag ttlMonitorEnabled: true, has no impact. So we are moving towards a stand alone instance and have scripts to copy the deltas on to this standalone node

Billy_Bui · October 28, 2024, 11:02am

I bumped into this problem years ago. Eventhough we have option not to sync indexes by using { buildIndexes: false } when add member into replica set. But TTL deletion happens in PRIMARY always convert into delete action in oplog then sync to SECONDARYs. This way the whole replica sets always have the same set of data, which is the idea behind the cluster approach.

You can actually test this on your replica set using the following codes in mongosh:

db.getSiblingDB('test').getCollection('abcd').insertOne({ time: ISODate() });
db.getSiblingDB('test').getCollection('abcd').createIndex({ time: 1 }, { expireAfterSeconds: 30 });
db.getSiblingDB('test').getCollection('abcd').find();
// Return 1 doc: [ { _id: ObjectId('671f6dada7485dfd07305aa8'), time: ISODate('2024-10-28T10:55:41.848Z') }

Wait 30 seconds

db.getSiblingDB('test').getCollection('abcd').find();
// Return 0 docs
db.getSiblingDB('local').getCollection('oplog.rs').find({ ns: 'test.abcd' });

You will get these 2 oplogs:

[
  {
    ...
    op: 'i',
    ns: 'test.abcd',
    o: {
      _id: ObjectId('671f6dada7485dfd07305aa8'),
      time: ISODate('2024-10-28T10:55:41.848Z')
    },
    o2: { _id: ObjectId('671f6dada7485dfd07305aa8') },
    wall: ISODate('2024-10-28T10:55:41.872Z'),
    ...
  },
  {
    op: 'd',
    ns: 'test.abcd',
    o: { _id: ObjectId('671f6dada7485dfd07305aa8') },
    wall: ISODate('2024-10-28T10:56:12.707Z'),
    ...
  }
]

So sadly I choose to give up and write a little Node.js program to copy my data manually, set time field to custom flag __archiveTime and create TTL on __archiveTime flag to ensure only copied data is deleted.