Create BinData from Hex in Aggregation Pipeline

Philipp_Kratzer · September 17, 2024, 9:28am

Hello,

I am fixing some design flaws in a quite large database. The data structure looks similar to this (simplyfied):

{
 "hex_data": "01234567890abcdef"
}

There is binary data, stored as a string in hex format. As the database grew, I recognized, that storing data in hex string is very inefficient, thus I want to store it in BinData.

I want to create an aggregation pipeline to add a new field, containing the data in binary format, and delete the hex format afterwards. Within the pipeline I tried to add a stage:

{
  $set: {
      "bin_data": HexData(0, "$hex_data")
  }
}

Unfortunately, using HexData(0, "abcd") is not accepted by the database - “Stage value is invalid”. Interestingly, loading data using base64 works.
If I use

{
  $set: {
      "bin_data": BinData(0, "SGVsbG8gV29ybGQhCg==")  
  }
}

the pipeline works as expected, using dummy data. But how do I load the hex_data as value?

Does anybody know a solution to this, how I can change the data format from hex to binary?

Thanks!

Artur_57972 · September 17, 2024, 10:13am

Hi @Philipp_Kratzer, welcome to the community!

First I would like to congratulate you on your solution, a very clever idea that is uncommon to see. I have also used this in some projects, it helps reduce the document size, but it won’t help much on index size, due to the data type used in the index structure.

Talking about your issue, the current version of MongoDB has no function capable of converting the data to binary inside an aggregation pipeline, but, if you wait just a little more, version 8.0 is about to be released and will provide this functionality. You can see it in BinData Conversion

Philipp_Kratzer · September 17, 2024, 12:05pm

Thanks for your quick answer, and the reference to the new version of MongoDb.

I’m afraid i cannot wait for the release of the upcoming version and will write script instead, which traverses and updates all documents. I just hoped to find a solution which is faster.

Thanks again,
Philipp

John_Sewell · September 17, 2024, 12:11pm

If you don’t want to wait you could do this in a bulk update, running 1000 at a time to update and then submitting the batch, you could also filter based on the new field existing so you could run it and kill it as needed and resume processing un-processed items.

(Very) Rough Pseudocode

const feedCursor = db.getCollection('stuff').find({'ConvertedField':{$exists:false}}, {_id:1, BinString:1})
var batchSize = 1000;
var updateCount = 0;
var bulkOp = db.getCollection('stuff').getUnorderedBulkOperation();
while (feedCursor .hasNext()) {
    updateCount++;
    let currData = feedCursor .next();
    bulkOp.find(_id:currData._id).update({$set:{'ConvertedField':FuncToConvert(currData._BinString)}})
    if(updateCount % batchSize == 0){
        bulkOp.execute();
        bulkOp = db.getCollection('stuff').getUnorderedBulkOperation();
    }
}
if(updateCount % batchSize > 0){
    bulkOp.execute();
}

You could also split it down by something else, say the time / date in the ObjectID or something similar.

system · September 22, 2024, 12:12pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.