Feedback on CV database schema: embedded data versus references

Thibaut_Voirand · April 24, 2024, 7:48am

Hi,

I am setting up a database to manage computer vision training data, and I would appreciate your feedback on my current schema and approach.

Current setup

The data consists of images (not stored in the database, I only store their paths as strings), annotations (equivalent to a labelling operation performed by a human operator on one image), and labels (their geometries are stored directly in the database).
Initially, I used separate collections for images, annotations, and labels, as shown below:

# Images collection:
{
    "_id": ObjectId('10'),
    "image_name": "DSC04799",
    "path": "/path/to/DSC04799.tif"
},
...

# Annotations collection:
{
    "_id": ObjectId('11'),
    "class": "Cat",
    "version": 1,
    "image_id": ObjectId('10'),
},
...

# Labels collection:
{
    "_id": ObjectId('12'),
    "geometry": [1, 2, 3, ...],
    "annotation_id": ObjectId('11'),
},
...

After reading the MongoDB documentation on embedding vs. referencing, I decided to switch to a single “images” collection with embedded annotations and labels::

{
    "_id": ObjectId('10'),
    "image_name": "DSC04799",
    "path": "/path/to/DSC04799.tif"
    "annotations":[
        {
            "_id": ObjectId('11'),
            "class": "Cat",
            "version": 1,
            "labels": [
                {
                    "_id": ObjectId('12'),
                    "geometry": [1, 2, 3, ...],
                },
                ...
            ]
        },
        ...
    ]
},
...

Code snippets

When using the previous schema, before inserting a new annotation I could easily check if it was already existing in the database(with pymongo):

annotation_doc = annotations_collection.find_one(
    {"image_id": image_id, "class": class, "version": version}
)
if annotation_doc is None:
    annotation_id = annotations_collection.insert_one(
        {"_id": ObjectId(), "image_id": image_id, "class": class, "version": version, "labels": labels}
    )

After the schema change, the equivalent operation became more complex:

# Look for annotation in DB, and insert it if not already present
pipeline = [
    {"$match": {"_id": image_id}},
    {"$unwind": "$annotations"},
    {"$match": {"annotations.class": class, "annotations.version": version}},
    {"$project": {"_id": "$annotations._id"}},
]
cursor = images_collection.aggregate(pipeline)
try:
    annotation_id = next(cursor)["_id"]
except StopIteration:
    annotation_id = images_collection.update_one(
        {"_id": image_id},
        {"$push": {"annotations": {"_id": ObjectId(), "class": class, "version": version, "labels": labels}},
    )

Concerns and questions

Since switching to the new schema, I find the queries more complex, and I’m wondering if I’m on the right track. Are these complex queries typical of MongoDB, or am I facing a learning curve?

I’m also curious about the pros and cons of using embedding in this scenario. Does the updated schema seem appropriate for my use case? Should I consider going back to separate collections?

Thank you for your help and guidance!

Craig_Crevola · April 29, 2024, 6:31am

Hi Thibaut

From my experience, I generally use a couple of heuristics to determine when to embed vs extract to a collection. If the data to be embedded is unbounded, then you must extract, otherwise you can run the risk of breaching the document size limit. From there I look at the data use characteristics. Is the system read or write biased. If it’s more read heavy then I’d opt for embedding to lower the work of the lookups required to project a full document. Depending on how write heavy it is indicates a turning point to collections (for me anyway). You can utilise findOneandUpdate to bring your update to one command, see: db.collection.findOneAndUpdate() - MongoDB Manual v7.0

I use a combination of embedding and extract to collections in my projects, depending on what the system requirements are and also the non functional requirements.

Again it really depends on your data use cases, the amount of expected data stored and the complexity of the code maintenance. You need to include in your thoughts the cost of the system in maintenance and execution.

Hope that helps.

Craig.

Thibaut_Voirand · May 2, 2024, 9:56am

Hi Craig,

Thanks a lot for your feedback. It confirms what I read in the documentation. It also confirms the impression I had that there isn’t necessarily a right/wrong answer, and that the decision depends on where we set the cursor notably between the read/write workload.

I am going to stick with the embedded documents schema, mainly because I expect the workload to be more read heavy.