Mongodump Query on Timeseries Collection

Daryl_Ang · December 8, 2022, 6:19am

Hi all,

I am trying to schedule backups on a time series collection and I want to do hourly backups of the last hour of data. On top of daily full backups, I would like to schedule a backup every hour of the past hour of data (ie at 3pm, create a dump for data from 2pm-3pm).

I am trying to set up my shell script as such:

EPOCH_DATE=$(date '+%s%3N')
for COLLECTION in "${COLLECTIONS[@]}"
do
  mongodump \
    --db=database\
    --collection=$COLLECTION  \
    --query="{ \"createdOn\": { \"\$gte\": {\"\$date\": $EPOCH_DATE} } }"
    --out=/dir/backup/
done

But I am getting the following error:

Failed: cannot process query [{createdOn [{$gte 1670480228792}]}] for timeseries collection database.collection mongodump only processes queries on metadata fields for timeseries collections.

Is there a better way of achieving the goal of doing scheduled backups? Or should I change the way I structure my script?

Appreciate any help on the matter!

MaBeuLux88_xxx · December 8, 2022, 4:19pm

Hi @Daryl_Ang and welcome in the MongoDB Community !

I reproduced the problem here with an example and - indeed - I get the same error message which is documented in the mongodump doc.

I reached out to a couple of colleagues that might have an idea for you but I don’t have a smart idea at the moment.

I’d love if someone can find a workaround though.

Cheers,
Maxime.

MaBeuLux88_xxx · December 12, 2022, 7:42pm

I spoke with @Tim_Fogarty who worked on mongodump.

He explained to me why this constraint exists and to be fair, it’s quite complex. It’s due to the low level implementation of timeseries in MongoDB and explaining all those details won’t help. The conclusion is that there are currently no workaround using mongodump.

If you do not need a point-in-time snapshot using the oplog though (which I think is the case here), you can use mongoexport in a script and achieve basically the same thing. It won’t be as fast as mongodump - but at least this should work properly.

Else you can still write a script and use find() with the appropriate filter to find these docs but it’s a bit more work.

I hope this helps.
Cheers,
Maxime.

Daryl_Ang · December 14, 2022, 3:03am

Thanks for the assistance Maxime!

Could I also use this opportunity to know if there are any intended approaches to backing up timeseries data?

MaBeuLux88_xxx · December 14, 2022, 6:20pm

First trivial idea that comes to mind would be to mongodump the entire collection.

It would probably be faster to use a disk snapshot though depending on your production env.

A random idea that could be worth exploring though would be to add an extra field in the metadata (a different one every hours) and you could use this field for your query. As it’s in the metadata this time, it would work with the --query.

I guess you have more than one client so you could come up with an algorithm that generates a new unique ID every hours (first that comes to mind could be to have day 1 that goes from 1 to 24. Day 2 from 25 to 48, etc).

Cheers,
Maxime.