We’re working on a MongoDB database with a substantial size in our application, running into performance issues during both data retrieval and scaling Atlas text search aggregation. Our data model incorporates two fundamental entities: documents identified by doc_id, each linked to an organization denoted by an org_id; and a separate collection where a doc_id may have multiple associated content items.
There are two collections in this process,
one where we store all the document metadata along with its organization id, lets call it doc_metadata.
one where we store multiple contents that are in the document, lets call it doc_content
Our process comprises of two significant steps:
Retrieve Document IDs from doc_metadata: Here, we execute a MongoDB find operation to gather all doc_ids belonging to a specific org_id. As the data volume of doc_metadata grows, this query slows down considerably. It looks something like this:
db.doc_metadata.find({"org_id": specific_org_id})
This retrieval step has become a major performance bottleneck. What are best practices or optimization strategies to improve the retrieval speed from a large MongoDB collection like doc_metadata?
Atlas Text Search Aggregation on doc_content Collection: Once we have the list of doc_ids, we perform Atlas text search aggregation on the doc_content collection that houses associated content items for each doc_id. Our issue is primarily with the $match stage where we match doc_id with a list of doc_ids obtained in the previous step and the Atlas $search stage. Here’s a simplified version of how our aggregation pipeline looks:
[
{"$search": {"text": {"query": search_text, "path": content_path }}},
{"$match": {"doc_id": {"$in": doc_ids_list}}},
{"$group": ...},
{"$limit": ...}
]
The $match stage of this pipeline is a tough spot due to the expanding list of doc_ids and increasing volume of the doc_content collection. What are some effective strategies to optimize Atlas text search, especially when handling large sets of document identifiers?
Confronted with these challenges, we are exploring ways to optimize our MongoDB operations to manage these vast data volumes and to minimize query times in Python. What MongoDB features, Python techniques, or general data management practices could provide us with solutions?
We yern for any guidance or insights on the matter. Response is much appreciated.