Import UUID from CSV using "mongoimport"

We would like to import UUIDs from CSV using “mongoimport”. According to the documentation it is possible to “type” the columns to import. Unfortunately there is no “type” available for UUIDs. Luckily there is a “binary” type present which enables us to encode the UUID as “base64”.

We were able to encode UUIDs as “base64” based on the following code.

So far so good…

We are able to use “mongoimport” the “–columnsHaveTypes” flag and the fields defined as " uuid.binary(base64)". The problem/challenge is to change the “binary subtype”. Right now the default subtype “0” is being used which results in something like this: “BinData(0,“7tW06cUkR2+lOWIRssit5A==”)”. Is there a way to change the “binary subtype” to “4”?

Thank you.

Hi Thomas,

Welcome to the online MongoDB Developer Community Forums!

In order to get a better understanding of your question, I have two questions in return:

  1. What kind of queries will you run on this data after loading? And how do these queries depend on (the type of) this UUID field?

  2. Why do you prefer to store this field as a UUID “type”? Or now with MongoDB as binary? What influence does it have on the way you can work with the data?

One question I saw on the forum that might be relevant for you is Problem inserting UUID field with binary subType via Atlas web UI

Regards,

Emil Zegers

1 Like

Thank you for getting back to me Emil.

Currently we are storing spatial features in a number of (manually) sharded PostgreSQL databases. All the features are being identified by a globally unique id (uuid). In order to keep track of the features per shard, we need to maintain an index.

We plan to start using MongoDB as a “feature id index” toward the PostgreSQL databases. The collection schema will contain a mapping between the UUID of the feature (key) and a PostgreSQL database (value).

  1. The user of our API will request information about a specific UUID (get feature by id). The UUID will be the main entry into our documents of the collection. The query operator will be most likely “$eq” for a single UUID and possibly “$in” for multiple UUIDs.

  2. The index will contain billions of entries. I assume (I might be wrong) that storing a “UUID” type instead of a plain “string” might have a significant effect on the storage used. I also wonder if there is a performance impact execution the queries mentioned above.

Best regards,
Thomas.

Unfortunately there’s no way to cast values in a CSV to a UUID type with mongoimport. I have opened a Jira ticket to add this functionality. You can see the ticket here: TOOLS-2788. You can track that ticket for updates, but it’s unlikely to be something we’ll work on soon.

It is possible to import UUIDs from a JSON file by using the extended JSON format. JSON files can support all BSON types.

You are correct that using a binary type instead of a plain string will vastly improve the storage used. The documentation says the following:

Index keys that are of the BinData type are more efficiently stored in the index if:

  • the binary subtype value is in the range of 0-7 or 128-135, and
  • the length of the byte array is: 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, or 32.

The documentation makes no differentiation between binary subtypes 0 and 4, so it might be easier to stick with the 0 subtype. But I’m not an expert on how indexes are implemented, so to double check there’s no difference, you may want to ask a question about that in the Working with Data category.

1 Like