LangChain RAG issues with VectorStore & OpenAI LLM

Rotem_Kama · April 9, 2024, 10:18am

Hi everyone!

I’ve been exploring the topic of RAG creation using LangChain with Atlas Vector Store & OpenAI, utilizing my own data (not the sample one from Mongo and not the one provided by LangChain).

I successfully created an “embedding service” using SageMaker and generated embeddings for all of my 24K documents in a dedicated embedding field. The text data that served as the input for this embedding field is a combination of 6 other fields within this single document.

I followed this guide for the creation of embeddings: Amazon SageMaker and MongoDB Vector Search - Part 1 (all three parts).
Now, I’m stuck at the stage of integrating it with RAG using this example code that I adapted to my collection: langchain/templates/rag-mongo at master · langchain-ai/langchain · GitHub.

However, I encountered three problems in this approach that I haven’t been able to solve correctly:

Context size - My collection is too large to be processed by the LLM model, which means I can’t perform the search without modifications to the retriever.
Response type - I managed to bypass the first problem by setting a search limitation:

pythonCopy code

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 1,
        "post_filter_pipeline": [{"$limit": 5}]
    }
)

But now, I’m faced with the issue: TypeError: Type is not JSON serializable: ObjectId.

The creation of MongoDBAtlasVectorSearch requires a text_key field, but I don’t have one. Therefore, I directed the text_key to one of my 6 data fields (strings) that was part of the embeddings input. Is this an appropriate solution, or should I handle it differently?

Thanks in advance!

Benjamin_Flast · April 9, 2024, 8:23pm

Hey @Rotem_Kama , thanks for posting and welcome to the forums!

Can you share your index definition and a sample document? (fake or dummy data is fine)

Also, by setting k = 1 you should only be getting 1 result back, I think the post filter pipeline of limit = 5 is redundant in this case?

Prakul_Agarwal · April 9, 2024, 9:27pm

Hi @Rotem_Kama

Typically, Langchain, by default, expects certain data formats for the source it operates on and creates embeddings for it.

I would highly recommend going through this [notebook | GenAI-Showcase/notebooks/rag/mongodb-langchain-cache-memory.ipynb at main · mongodb-developer/GenAI-Showcase · GitHub]

The LangChain Templates that you are using are for a different use case, and I would advise using them at a later stage once you have your prototype working in a notebook.

Since you already have your document with embeddings created already, you will have to do a little bit of maneuvering:

Initialize your MongoDBAtlasVectorSearch instance using this:

class MongoDBAtlasVectorSearch(VectorStore):
    """`MongoDB Atlas Vector Search` vector store.

    To use, you should have both:
    Example:
            vectorstore = MongoDBAtlasVectorSearch(collection, embeddings)
    """

    def __init__(
        self,
        collection: Collection[MongoDBDocumentType],
        embedding: Embeddings,
        *,
        index_name: str = "default",
        text_key: str = "text",
        embedding_key: str = "embedding",
        relevance_score_fn: str = "cosine",
    ):

Where you will input your text_key , embedding_key, embedding model

Once you have that figured out you will want to create your Vector Search Index and run your retriever queries.

Other Resources:
basic: Introduction to LangChain and MongoDB Atlas Vector Search | MongoDB
intermediate: RAG with Atlas Vector Search, LangChain, and OpenAI | MongoDB
advanced: Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain | MongoDB

Rotem_Kama · April 10, 2024, 5:07am

Hi Prakul!
First of all thanks for the help and for attaching relevant resources. i’m sure that the use case of “pre-existing” documents with embeddings is a relevant one and i hope we will manage to solve it out and maybe put it all to a dedicated tutorial or so.

I will definitely go over all your resources and validate my settings and inputs, but meanwhile here is my MongoDBAtlasVectorSearch definition:

vectorstore = MongoDBAtlasVectorSearch.from_connection_string(
    MONGODB_CONNECTION_STRING,
    DATABASE_NAME + "." + COLLECTION_NAME,
    sagemaker_endpoint_embedding_model,
    index_name=VECTOR_SEARCH_INDEX_NAME,
    text_key='name' # Here I took the name key because i don't have dedicated text field in my document
(i didn't pass embedding_key value because my embedding field in the collection is also called embedding as the default, and the same for the relevance_score_fn)
)

so this part is probably aligned except for the text_key field that i need to readjust somehow.

Thanks once again!

Rotem_Kama · April 10, 2024, 5:14am

Sure!
Here is the vector search index definition:

{
  "fields": [
    {
      "numDimensions": 384,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    }
  ]
}

about the document, here is a fake doc (the source is too big to be inserted here):

{
  "_id": {
    "$oid": "658d300abe272991561c6fe9"
  },
  "genericHash": "67a62bd1786dac5ca7fc9168913f9eef029955be0cdc989445e13bffefbbd58c",
  "cve": [
    "6550b905af32b47db941149e"
  ],
  "cwe": [
    "CWE-787"
  ],
  "description": "description that usually builded from 250 words",
  "family": "CWE-787",
  "hash": "40091e8c7f67fb9cf27a5cb3a43d6429eba90352b28dcdfaa37094f110f92c0e",
  "updateTime": {
    "$date": "2023-12-28T09:02:56.000Z"
  },
  "name": "CVE-2022-43108",
  "instances": [
    array of objects (key value pairs, build from 14 pairs, can be big array also)
  ],
  "severity": "Critical",
  "embedding": [
    -0.13004250824451447,
    0.017079371958971024,
    -0.08573945611715317,
    -0.09606916457414627,
	...
  ]
(few more simple key value pairs)
}

About the K instruction and post filter limit - you are probably right, its a leftover from various tries to get a smaller context (the k was to key to reach it out, the post filter is redundant indeed.)