LangChain RAG issues with VectorStore & OpenAI LLM

Hi everyone!

I’ve been exploring the topic of RAG creation using LangChain with Atlas Vector Store & OpenAI, utilizing my own data (not the sample one from Mongo and not the one provided by LangChain).

I successfully created an “embedding service” using SageMaker and generated embeddings for all of my 24K documents in a dedicated embedding field. The text data that served as the input for this embedding field is a combination of 6 other fields within this single document.

I followed this guide for the creation of embeddings: Amazon SageMaker and MongoDB Vector Search - Part 1 (all three parts).
Now, I’m stuck at the stage of integrating it with RAG using this example code that I adapted to my collection: langchain/templates/rag-mongo at master · langchain-ai/langchain · GitHub.

However, I encountered three problems in this approach that I haven’t been able to solve correctly:

  1. Context size - My collection is too large to be processed by the LLM model, which means I can’t perform the search without modifications to the retriever.

  2. Response type - I managed to bypass the first problem by setting a search limitation:

pythonCopy code

retriever = vectorstore.as_retriever(
        "k": 1,
        "post_filter_pipeline": [{"$limit": 5}]

But now, I’m faced with the issue: TypeError: Type is not JSON serializable: ObjectId.

  1. The creation of MongoDBAtlasVectorSearch requires a text_key field, but I don’t have one. Therefore, I directed the text_key to one of my 6 data fields (strings) that was part of the embeddings input. Is this an appropriate solution, or should I handle it differently?

Thanks in advance!

Hey @Rotem_Kama , thanks for posting and welcome to the forums!

Can you share your index definition and a sample document? (fake or dummy data is fine)

Also, by setting k = 1 you should only be getting 1 result back, I think the post filter pipeline of limit = 5 is redundant in this case?

Hi @Rotem_Kama

Typically, Langchain, by default, expects certain data formats for the source it operates on and creates embeddings for it.

I would highly recommend going through this [notebook | GenAI-Showcase/notebooks/rag/mongodb-langchain-cache-memory.ipynb at main · mongodb-developer/GenAI-Showcase · GitHub]

The LangChain Templates that you are using are for a different use case, and I would advise using them at a later stage once you have your prototype working in a notebook.

Since you already have your document with embeddings created already, you will have to do a little bit of maneuvering:

Initialize your MongoDBAtlasVectorSearch instance using this:

class MongoDBAtlasVectorSearch(VectorStore):
    """`MongoDB Atlas Vector Search` vector store.

    To use, you should have both:
            vectorstore = MongoDBAtlasVectorSearch(collection, embeddings)

    def __init__(
        collection: Collection[MongoDBDocumentType],
        embedding: Embeddings,
        index_name: str = "default",
        text_key: str = "text",
        embedding_key: str = "embedding",
        relevance_score_fn: str = "cosine",

Where you will input your text_key , embedding_key, embedding model

Once you have that figured out you will want to create your Vector Search Index and run your retriever queries.

Other Resources:
basic: Introduction to LangChain and MongoDB Atlas Vector Search | MongoDB
intermediate: RAG with Atlas Vector Search, LangChain, and OpenAI | MongoDB
advanced: Adding Semantic Caching and Memory to Your RAG Application Using MongoDB and LangChain | MongoDB

Hi Prakul!
First of all thanks for the help and for attaching relevant resources. i’m sure that the use case of “pre-existing” documents with embeddings is a relevant one and i hope we will manage to solve it out and maybe put it all to a dedicated tutorial or so.

I will definitely go over all your resources and validate my settings and inputs, but meanwhile here is my MongoDBAtlasVectorSearch definition:

vectorstore = MongoDBAtlasVectorSearch.from_connection_string(
    text_key='name' # Here I took the name key because i don't have dedicated text field in my document
(i didn't pass embedding_key value because my embedding field in the collection is also called embedding as the default, and the same for the relevance_score_fn)

so this part is probably aligned except for the text_key field that i need to readjust somehow.

Thanks once again!

Here is the vector search index definition:

  "fields": [
      "numDimensions": 384,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"

about the document, here is a fake doc (the source is too big to be inserted here):

  "_id": {
    "$oid": "658d300abe272991561c6fe9"
  "genericHash": "67a62bd1786dac5ca7fc9168913f9eef029955be0cdc989445e13bffefbbd58c",
  "cve": [
  "cwe": [
  "description": "description that usually builded from 250 words",
  "family": "CWE-787",
  "hash": "40091e8c7f67fb9cf27a5cb3a43d6429eba90352b28dcdfaa37094f110f92c0e",
  "updateTime": {
    "$date": "2023-12-28T09:02:56.000Z"
  "name": "CVE-2022-43108",
  "instances": [
    array of objects (key value pairs, build from 14 pairs, can be big array also)
  "severity": "Critical",
  "embedding": [
(few more simple key value pairs)

About the K instruction and post filter limit - you are probably right, its a leftover from various tries to get a smaller context (the k was to key to reach it out, the post filter is redundant indeed.)