preFilter configuration and indexing for Langchain retriever

Adrien_Le_Clair · November 21, 2023, 4:14pm

My index looks like this one :

{
  "mappings": {
    "fields": {
      "field1": {
        "type": "string"
      },
      "embedding": [
        {
          "dimensions": 1536,
          "similarity": "cosine",
          "type": "knnVector"
        }
      ],
      "field2": {
        "type": "string"
      }
    }
  }
}

and my retriver like this one

const retriever = await vectorStore.asRetriever({
  searchType: "mmr",
  searchKwargs: {
    fetchK: 20,
    lambda: 0.1,
  },
  filter: {
    preFilter: {
      text: {
        path: "field1",
        query: value,
      },
    },
  },
  });

I want to filter by field2 also (with is a string), so how to filter by multiple field in preFilter ? I can’t succeed, please help

Kushagra_Kesav · November 30, 2023, 5:33am

Hi @Adrien_Le_Clair,

Welcome to the MongoDB Community forums

You can refer to the following tutorial: Leveraging OpenAI and MongoDB Atlas for Improved Search Functionality | MongoDB and the Atlas Vector Search Pre-Filter documentation to learn more about it.

In the meantime, could you please provide additional details about your specific use case, the expected output, and any workarounds you’ve attempted? This information will help the community better understand the issue and provide more effective assistance.

Best regards,
Kushagra

Prakul_Agarwal · December 2, 2023, 6:02pm

Hi @Adrien_Le_Clair , which version of Langchain JS are you using? In v0.0.165 we released the use of $vectorSearch syntax in Langchain -
Release notes , PR

If using langchain JS >= v0.0.165 requires two changes to the code you posted:

In the Atlas Vector Search index definition use the following

{
  "mappings": {
    "fields": {
      "field1": {
        "type": "token",
        "normalizer": "lowercase"
      },
      "embedding": [
        {
          "dimensions": 1536,
          "similarity": "cosine",
          "type": "knnVector"
        }
      ],
      "field2": {
       "type": "token",
        "normalizer": "lowercase"
      }
    }
  }
}

Next you can put together the preFilter definition using $and , and on similar lines as follows

const retriever = await vectorStore.asRetriever({
  searchType: "mmr",
  searchKwargs: {
    fetchK: 20,
    lambda: 0.1,
  },
  filter: {
    preFilter: {
        {
        "$and": [{
          "field1": {
            "$eq": "x" ,
          }},
         {
          "field2": {
            "$eq": "y",
          }}
        ]
      }
    },
  },
  });

further reading: Pre filter in Atlas Vector Search
Tutorial for Semantic Search queries

Adrien_Le_Clair · December 5, 2023, 8:31am

Thanks for your help, I am stuck with langchain in 0.0.164, I will try your solution as soon as I can upgrade (waiting for upgrading MongoDB Atlas from 4 to 7)

Iqbal_Ali · May 5, 2024, 7:22pm

I’m having a similar problem but with python. The index mentioned here helped.

Relevant parts of my model example:

class Question(Document):
    content = StringField(required=True)
    ...

class Theme(Document):
    question = ReferenceField(Question, required=True)
    text = StringField(required=True) # AKA Category title
    embedding = ListField(FloatField())

So, basically, Question is a reference field of Theme.

I want to run a semantic search on theme collection, but want to PREFILTER by question.

Here is my index for theme:

{
"mappings": {
"fields": {
"embedding": [
{
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
}
],
"question": {
"type": "token"
}
}
}
}

And here’s my simplified python code:

from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain_openai import OpenAIEmbeddings
from pymongo import MongoClient
from bson import ObjectId
import os, json

# Set environment variables
os.environ['OPENAI_API_KEY'] = ""
os.environ["MONGODB_HOST"] = ""

# Connect to the MongoDB database
mongo_client = MongoClient(os.environ["MONGODB_HOST"])['text-mining-langchain']
embeddings = OpenAIEmbeddings()

# Get the collection
collection = mongo_client['theme']

# Filter the documents based on the 'question' field
question_id = ObjectId('663252400674de6854bf6594')
pre_filter_dict = {"question": str(question_id)}

vectorstore = MongoDBAtlasVectorSearch(collection, embeddings, text_key="text", 
                                       embedding_key="embedding", index_name="default")

# Perform similarity search on the filtered documents
query = 'Ease of Use and Accuracy'
docs = vectorstore.similarity_search_with_score(query, k=10, pre_filter=pre_filter_dict)

docs

When I run it WITH an empty pre_filter_dict, I get results. When I try pre filter first, I get zero results, even though I should have some results.

Can anyone help with this?