Whats the best way to store vector embeddings in chunks for one document in Mongo to use Triggers efficiently

Hi,

I am evaluating MongoDB Atlas for vector embedding use case for Gen AI application. I am using Langchain framework for this implementation. My problem statement is I have a PDF document that gets split into smaller chunks (using Langchain doc loader and splitter) to create smaller set of sub-documents and trying to store in Mongo. I cant store one big PDF document vector embeddings in one document in Mongo as that wont be efficient for language models.
My requirement is to use Mongo database trigger to call AWS EventBridge. Since the number of chunks (documents) for one big document are more and are being stored as separate documents in Mongo, attaching trigger will trigger it for each document though each document (chunk). Whats the best way organize the documents in smaller chunks (documents) in Mongo in order to use the Trigger efficiently?

Hi @Vivek_Daramwal and welcome to MongoDB community forums!!

Using multiple vector embedding in MongoDB is certainly not an issue unless the following parameters are considered:

  1. If you are working with a limited, constant number of values, which can conveniently be stored in distinct fields. For example, you can utilize fields like “title_vectorized,” “body_vectorized,” and “summary_vectorized.”
  2. There’s no requirement to query multiple vector fields simultaneously.

The schema design in MongoDB plays an important role in performance of the query and hence the recommendation is to use modelling technique according to your use case.
One possible way would be to use the extended reference pattern to store the chunks in the collection.
You can also make use of Model Tree structures to store the chunks as well.

Finally, could you confirm if you using the PDF load/split from LangChain PDF | 🦜️🔗 Langchain ?

If you have further questions, please feel free to share additional information of your use case so that others can help you better

Warm regards
Aasawari

Yes I am using PDF load/split from LangChain to generate smaller set of documents from out of a big PDF document and storing it in the MongoDB collection. Because there can be multiple such documents chunks (say 20), they gets stored as separate documents in Mongo. If I have to use Database triggers for insert, it will trigger 20 times for the same document. I want the trigger to happen just once. How can I make it possible?

And I cant use multi-field vector in the document because I dont know how many document chunks it can create and it will also create problem for querying on all the documents.

Hi Vivek,

Thank you for the question! Ideally one PDF document would yield one trigger, which executes a function that splits the PDF into an arbitrary number of chunks that then get added individually to your collection.

You can do this by setting up an Atlas Trigger for when a PDF is added which then calls AWS eventbridge, as you’ve described. This would then schedule an AWS Lambda function which can leverage Langchain (in either Javascript or Python) in the way you’ve used it to parse the document into subdocuments which can be added back to MongoDB. The reason we suggest using Lambda instead of triggers for calling Langchain is that there is greater support for javascript dependencies in Lambda at the moment.

I found a few resources detailing how to do this, but this is definitely something that has come up before so we may put out our own dedicated tutorial on this.

I hope this helps answer your question, but do reach out if you have any additional ones.

Cheers,
Henry

1 Like

Hi Henry,

We dont want to push PDF documents to MongoDB as a whole. We already have a repository of documents and want to build a workflow where we pick the PDF document from our store and split it into chunks via LangChain and push the smaller documents (generated via chunk split) to MongoDB. Now since multiple such documents are stored in Mongo collection, it wont be possible to trigger just once for all. Is there a better way to implement this?

Hi Vivek,

Apologies for using vague language, when I said “a PDF is added” I meant a PDF is updated in the source system where it lives, the metadata of which could be tracked in MongoDB (this is a common pattern).

In this way you can watch for the “last_updated_time” to change, and trigger the workflow I mentioned to split a single PDF document into many MongoDB documents containing chunks and embeddings, along with any additional relevant metadata.