Is it possible to move WiredTiger files to different file system?

Wernfried_Domscheit · April 8, 2021, 12:36pm

I have a MongoDB Sharded cluster with a hybrid storage, i.e. some fast SSD and some slower and cheaper spinning rust.

For archiving I like to move some data to the slower disc. For legal reason we have to keep them, they are queried only occasionally.

In principle I would do it like this:

mongo --eval "sh.stopBalancer()" mongos-host:27017

# Repeat below on each shard host:
mongo --eval "db.fsyncLock()" localhost:27018

cp /mongodb/data/collection/3109--6926861682361166404.wt /slow-disc/mongodb/collection/3109--6926861682361166404.wt
ln --force --symbolic /mongodb/data/collection/3109--6926861682361166404.wt /slow-disc/mongodb/collection/3109--6926861682361166404.wt

mongo --eval "db.fsyncUnlock()" localhost:27018

# After all shards are done:
mongo --eval "sh.startBalancer()" mongos-host:27017

The indexes shall remain on the fast disc.

Would this be a reliable way to archive my data? What happens if the collection is read while move?

Another approach would be a file system like this:

/mongodb/data/collection
/mongodb/data/index
/mongodb/archive/collection -> /slow-disc/mongodb/collection 
/mongodb/archive/index

And then move the collection as this:

mongo --eval 'sh.shardCollection("archive.coll", shardKey)' mongos-host:27017
mongodump --uri "mongodb://mongos-host:27017" --db=data --collection=coll --archive=- | mongorestore --uri "mongodb://mongos-host:27017" --nsFrom="data.coll" --nsTo="archive.coll" --archive=-
mongo --eval 'db.getSiblingDB("data").getCollection("coll").drop()' mongos-host:27017

Main disadvantage: the balancer has to distribute the whole data across the shards. It creates additional load on my shared cluster.

Which approach would you recommend?

Joe_Drumgoole · April 13, 2021, 6:59pm

If Atlas is an option you could use Online Archive to archive you data automatically to S3. Much cheaper and yet still queryable.

Stennie_X · April 14, 2021, 7:50am

Hi @Wernfried_Domscheit,

The Online Archive option for Atlas is optimised for archival storage of data that you still may want to query occasionally.

Since you have a self-managed sharded cluster, I would look into using Zone Sharding to influence data locality.

The use case you’ve described is one of the example scenarios: Tiered Hardware for Varying SLA or SLO.

Zone sharding is part of the normal sharded architecture, so application access can continue concurrently with rebalancing activity. This approach allows a DBA to influence data locality and there is no downtime if an admin needs to adjust allocation between existing resource tiers or provision additional fast or slow hardware.

It is possible to use symlinks, but this may affect your backup strategy (for example, if you are using filesystem snapshots) because a single MongoDB dbPath will span multiple filesystems. I recommend using an agent-based backup approach (i.e. MongoDB Ops Manager or Cloud Manager) to avoid complications of backing up shards spanning multiple filesystems.

I would also stop the mongod process (not just fsyncLock) when changing symlinks. Removing or replacing files for a running process may lead to unexpected outcomes, and you’re already stopping all writes to make this administrative change.

If you are planning on doing this frequently (and decide to take the filesystem approach rather than zone sharding), I would consider using database-level mount points via storage options like directoryPerDB and directoryForIndexes. Managing symlinks at a database granularity is less disruptive and error prone than collection-level changes.

Note: In order to change storage level options that affect the physical arrangement of files, you will need to rebuild the data files for your shard replica set members via initial sync. You can do so in a rolling fashion: change the storage options on one secondary at a time, wait for initial sync to complete, and eventually step down and upgrade the primary.

Regards,
Stennie

Wernfried_Domscheit · April 14, 2021, 8:34am

Some more details:
Atlas is no option. Please don’t start a discussion about it, we analyzed it carefully together with MongoDB and there are several reasons against Atlas (e.g. mongoimport does not support Client-Side Field Level Encryption).

The application is rather big. It generates at peak 70’000 documents/second which gives about 100GB (storage size) data every day. For 2 days this data shall be hosted on fast disk, because is it frequently used and potentially modified. Data has to be kept for 6 Months, i.e. in total the DB has a storage size of 20TB - distributed over 6 Shards. The Shards are built as PSA-Replica Sets.

After 2 days I like to move the “old” data on daily basis to the slower archive storage, because then it is used only rare.

So, I don’t see any option for Zoned Sharding, because the same data needs to be first stored on fast hardware and then on slower (i.e. cheap) hardware.

Running a full initial sync of 20 TB every day might not be the best option.

Best Regards
Wernfried

Stennie_X · April 14, 2021, 9:44am

Hi Wernfried,

With zoned sharding you would update the zone ranges on a schedule (eg daily) so older data would end up migrating from recent to archive shards. You would not have to coordinate this change across every member of your sharded cluster (as you will for filesystem symlinks).

This approach does presume that you would want to query your recent & archived data as a single sharded collection, rather than querying across multiple collections.

The extra info you provided in your latest response is that the archived data only needs to be retained for 6 months and indicates that you are concerned about the daily and total volume of data.

If you have already modelled your data so you can archive based on a collection naming convention, your first approach (symlinks) sounds more appropriate for your use case than dumping & restoring data (which includes rebuilding indexes).

However, choice of an approach is up to you. I’m just sharing suggestions based on the information you have provided.

I expect you are already aware, but there are some consequences of arbiters that will have a performance impact if you are routinely taking data-bearing members down for maintenance issues like updating symlinks.

For more background, please see my comment on Replica set with 3 DB Nodes and 1 Arbiter - #8 by Stennie_X.

Definitely not! My mention of initial sync was in the context of a one-off operation if you wanted to change your storage options to use directoryPerDB and/or directoryForIndexes. Grouping of related files by database or type can be helpful if you want to tune different mount point options.

If you are fine maintaining symlinks at a file level, you can skip any notions of changing storage options.

Regards,
Stennie

Wernfried_Domscheit · April 14, 2021, 10:14am

Yes, I did not consider update the zone ranges. I will give it a try.

I create one collection per day. Deleting old data from one big collection takes far to much time. Dropping a daily collection after 6 Months takes only a few seconds.

MajorityReadConcern is disabled, of course. Having a PSS (1 Primary + 2 Secondary) would be nice but is a significant cost-driver due to storage requirement.

Thanks for your suggestions!

Wernfried

Wernfried_Domscheit · April 21, 2021, 7:23am

Just for information, the Zoned Sharding looks promising and would be the preferred method for me.
However, it fails for larger collections, see https://jira.mongodb.org/browse/SERVER-56116

Let’s hope we will get a fix for it.
Wernfried

Asya_Kamsky · May 4, 2021, 6:48pm

You don’t mention it but do you have any other parameters/settings on your shards?

For instance, maxSize or something else? Without seeing the full config database, it’s difficult to speculate about what might be going on.

Asya

Wernfried_Domscheit · May 5, 2021, 8:40pm

No, I am not aware of any. However, today I tried this procedure again and it was working. I am not able to reproduce the error.

Best Regards
Wernfried