Hi everyone!
I’ve been exploring the topic of RAG creation using LangChain with Atlas Vector Store & OpenAI, utilizing my own data (not the sample one from Mongo and not the one provided by LangChain).
I successfully created an “embedding service” using SageMaker and generated embeddings for all of my 24K documents in a dedicated embedding field. The text data that served as the input for this embedding field is a combination of 6 other fields within this single document.
I followed this guide for the creation of embeddings: Amazon SageMaker and MongoDB Vector Search - Part 1 (all three parts).
Now, I’m stuck at the stage of integrating it with RAG using this example code that I adapted to my collection: langchain/templates/rag-mongo at master · langchain-ai/langchain · GitHub.
However, I encountered three problems in this approach that I haven’t been able to solve correctly:
-
Context size - My collection is too large to be processed by the LLM model, which means I can’t perform the search without modifications to the retriever.
-
Response type - I managed to bypass the first problem by setting a search limitation:
pythonCopy code
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 1,
"post_filter_pipeline": [{"$limit": 5}]
}
)
But now, I’m faced with the issue: TypeError: Type is not JSON serializable: ObjectId.
- The creation of MongoDBAtlasVectorSearch requires a text_key field, but I don’t have one. Therefore, I directed the text_key to one of my 6 data fields (strings) that was part of the embeddings input. Is this an appropriate solution, or should I handle it differently?
Thanks in advance!
Hey @Rotem_Kama , thanks for posting and welcome to the forums!
Can you share your index definition and a sample document? (fake or dummy data is fine)
Also, by setting k = 1 you should only be getting 1 result back, I think the post filter pipeline of limit = 5 is redundant in this case?
Hi @Rotem_Kama
Typically, Langchain, by default, expects certain data formats for the source it operates on and creates embeddings for it.
I would highly recommend going through this [notebook | GenAI-Showcase/notebooks/rag/mongodb-langchain-cache-memory.ipynb at main · mongodb-developer/GenAI-Showcase · GitHub]
The LangChain Templates that you are using are for a different use case, and I would advise using them at a later stage once you have your prototype working in a notebook.
Since you already have your document with embeddings created already, you will have to do a little bit of maneuvering:
Initialize your MongoDBAtlasVectorSearch instance using this:
class MongoDBAtlasVectorSearch(VectorStore):
"""`MongoDB Atlas Vector Search` vector store.
To use, you should have both:
Example:
vectorstore = MongoDBAtlasVectorSearch(collection, embeddings)
"""
def __init__(
self,
collection: Collection[MongoDBDocumentType],
embedding: Embeddings,
*,
index_name: str = "default",
text_key: str = "text",
embedding_key: str = "embedding",
relevance_score_fn: str = "cosine",
):
Where you will input your text_key , embedding_key, embedding model
Once you have that figured out you will want to create your Vector Search Index and run your retriever queries.
Other Resources:
basic: Introduction to LangChain and MongoDB Atlas Vector Search | MongoDB
intermediate: RAG with Atlas Vector Search, LangChain, and OpenAI | MongoDB
advanced: Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain | MongoDB
Hi Prakul!
First of all thanks for the help and for attaching relevant resources. i’m sure that the use case of “pre-existing” documents with embeddings is a relevant one and i hope we will manage to solve it out and maybe put it all to a dedicated tutorial or so.
I will definitely go over all your resources and validate my settings and inputs, but meanwhile here is my MongoDBAtlasVectorSearch definition:
vectorstore = MongoDBAtlasVectorSearch.from_connection_string(
MONGODB_CONNECTION_STRING,
DATABASE_NAME + "." + COLLECTION_NAME,
sagemaker_endpoint_embedding_model,
index_name=VECTOR_SEARCH_INDEX_NAME,
text_key='name' # Here I took the name key because i don't have dedicated text field in my document
(i didn't pass embedding_key value because my embedding field in the collection is also called embedding as the default, and the same for the relevance_score_fn)
)
so this part is probably aligned except for the text_key field that i need to readjust somehow.
Thanks once again!
Sure!
Here is the vector search index definition:
{
"fields": [
{
"numDimensions": 384,
"path": "embedding",
"similarity": "cosine",
"type": "vector"
}
]
}
about the document, here is a fake doc (the source is too big to be inserted here):
{
"_id": {
"$oid": "658d300abe272991561c6fe9"
},
"genericHash": "67a62bd1786dac5ca7fc9168913f9eef029955be0cdc989445e13bffefbbd58c",
"cve": [
"6550b905af32b47db941149e"
],
"cwe": [
"CWE-787"
],
"description": "description that usually builded from 250 words",
"family": "CWE-787",
"hash": "40091e8c7f67fb9cf27a5cb3a43d6429eba90352b28dcdfaa37094f110f92c0e",
"updateTime": {
"$date": "2023-12-28T09:02:56.000Z"
},
"name": "CVE-2022-43108",
"instances": [
array of objects (key value pairs, build from 14 pairs, can be big array also)
],
"severity": "Critical",
"embedding": [
-0.13004250824451447,
0.017079371958971024,
-0.08573945611715317,
-0.09606916457414627,
...
]
(few more simple key value pairs)
}
About the K instruction and post filter limit - you are probably right, its a leftover from various tries to get a smaller context (the k was to key to reach it out, the post filter is redundant indeed.)