I am fixing some design flaws in a quite large database. The data structure looks similar to this (simplyfied):
{
"hex_data": "01234567890abcdef"
}
There is binary data, stored as a string in hex format. As the database grew, I recognized, that storing data in hex string is very inefficient, thus I want to store it in BinData.
I want to create an aggregation pipeline to add a new field, containing the data in binary format, and delete the hex format afterwards. Within the pipeline I tried to add a stage:
{
$set: {
"bin_data": HexData(0, "$hex_data")
}
}
Unfortunately, using HexData(0, "abcd") is not accepted by the database - “Stage value is invalid”. Interestingly, loading data using base64 works.
If I use
First I would like to congratulate you on your solution, a very clever idea that is uncommon to see. I have also used this in some projects, it helps reduce the document size, but it won’t help much on index size, due to the data type used in the index structure.
Talking about your issue, the current version of MongoDB has no function capable of converting the data to binary inside an aggregation pipeline, but, if you wait just a little more, version 8.0 is about to be released and will provide this functionality. You can see it in BinData Conversion
Thanks for your quick answer, and the reference to the new version of MongoDb.
I’m afraid i cannot wait for the release of the upcoming version and will write script instead, which traverses and updates all documents. I just hoped to find a solution which is faster.
If you don’t want to wait you could do this in a bulk update, running 1000 at a time to update and then submitting the batch, you could also filter based on the new field existing so you could run it and kill it as needed and resume processing un-processed items.
(Very) Rough Pseudocode
const feedCursor = db.getCollection('stuff').find({'ConvertedField':{$exists:false}}, {_id:1, BinString:1})
var batchSize = 1000;
var updateCount = 0;
var bulkOp = db.getCollection('stuff').getUnorderedBulkOperation();
while (feedCursor .hasNext()) {
updateCount++;
let currData = feedCursor .next();
bulkOp.find(_id:currData._id).update({$set:{'ConvertedField':FuncToConvert(currData._BinString)}})
if(updateCount % batchSize == 0){
bulkOp.execute();
bulkOp = db.getCollection('stuff').getUnorderedBulkOperation();
}
}
if(updateCount % batchSize > 0){
bulkOp.execute();
}
You could also split it down by something else, say the time / date in the ObjectID or something similar.