Hi,
I am setting up a database to manage computer vision training data, and I would appreciate your feedback on my current schema and approach.
Current setup
The data consists of images (not stored in the database, I only store their paths as strings), annotations (equivalent to a labelling operation performed by a human operator on one image), and labels (their geometries are stored directly in the database).
Initially, I used separate collections for images, annotations, and labels, as shown below:
# Images collection:
{
"_id": ObjectId('10'),
"image_name": "DSC04799",
"path": "/path/to/DSC04799.tif"
},
...
# Annotations collection:
{
"_id": ObjectId('11'),
"class": "Cat",
"version": 1,
"image_id": ObjectId('10'),
},
...
# Labels collection:
{
"_id": ObjectId('12'),
"geometry": [1, 2, 3, ...],
"annotation_id": ObjectId('11'),
},
...
After reading the MongoDB documentation on embedding vs. referencing, I decided to switch to a single “images” collection with embedded annotations and labels::
{
"_id": ObjectId('10'),
"image_name": "DSC04799",
"path": "/path/to/DSC04799.tif"
"annotations":[
{
"_id": ObjectId('11'),
"class": "Cat",
"version": 1,
"labels": [
{
"_id": ObjectId('12'),
"geometry": [1, 2, 3, ...],
},
...
]
},
...
]
},
...
Code snippets
When using the previous schema, before inserting a new annotation I could easily check if it was already existing in the database(with pymongo):
annotation_doc = annotations_collection.find_one(
{"image_id": image_id, "class": class, "version": version}
)
if annotation_doc is None:
annotation_id = annotations_collection.insert_one(
{"_id": ObjectId(), "image_id": image_id, "class": class, "version": version, "labels": labels}
)
After the schema change, the equivalent operation became more complex:
# Look for annotation in DB, and insert it if not already present
pipeline = [
{"$match": {"_id": image_id}},
{"$unwind": "$annotations"},
{"$match": {"annotations.class": class, "annotations.version": version}},
{"$project": {"_id": "$annotations._id"}},
]
cursor = images_collection.aggregate(pipeline)
try:
annotation_id = next(cursor)["_id"]
except StopIteration:
annotation_id = images_collection.update_one(
{"_id": image_id},
{"$push": {"annotations": {"_id": ObjectId(), "class": class, "version": version, "labels": labels}},
)
Concerns and questions
Since switching to the new schema, I find the queries more complex, and I’m wondering if I’m on the right track. Are these complex queries typical of MongoDB, or am I facing a learning curve?
I’m also curious about the pros and cons of using embedding in this scenario. Does the updated schema seem appropriate for my use case? Should I consider going back to separate collections?
Thank you for your help and guidance!