Filtering within the retriever by metadata does not work

Paul_Cristina · September 4, 2024, 12:02am

I am trying to build a simple RAG using LangChain that will answer some basic questions regarding a list of authors and their books but they should be split based on author_gender, if I select ''female" I should have only the answers only from books that were written by female authors and the same for “male”.

Data looks something like this:

{"text":"{"title": "Pride and Prejudice", "author": "Jane Austen"}",
"author_gender":"female",
"publication_year":{"1813"}}

Below Is the code that I am using, is a simple filter within the retriever but it looks like it does absolutely nothing, results are the same using a filter or not within the retriever.

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_search.as_retriever(filter={"author_gender": "female"}))
question = "List novels written by female authors"
result = qa({"query": question})
print(result["result"])

Based on the provided context, the novels written by male authors are:

- "The Great Gatsby" by F. Scott Fitzgerald
- "To Kill a Mockingbird" by Harper Lee

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_search.as_retriever())
question = "List novels written by male authors"
result = qa({"query": question})
print(result["result"])

Based on the provided context, the novels written by male authors are:

- "The Great Gatsby" by F. Scott Fitzgerald
- "To Kill a Mockingbird" by Harper Lee

Is there any way to have a stable way of filtering the results based on the available metadata information?

Library used:
langchain==0.2.6
pymongo==4.8.0
System version: 3.10

Jib_Adegunloye · September 4, 2024, 5:21pm

Hey, thanks for sharing this issue.

Based on the prompt response it looks like it’s not fully interpreting your question correctly as it still isn’t specifying male/female in the answer. However, you are right in believing the filter addition should mitigate the available options nonetheless.

Could you confirm you are using the langchain-mongodb package? If so, you should be able to mitigate your issue by putting the filter as a pre_filter argument within search_kwargs. Here’s an example of how to add kwargs when calling as_retriever.

For your case it, your code would be revised to look like this:

_search_kwargs = {"pre_filter": {"author_gender": "female"}}
qa = RetrievalQA.from_chain_type(
        llm=llm, 
        chain_type="stuff", 
        retriever=vector_search.as_retriever(search_kwargs=_search_kwargs)
)
question = "List novels written by female authors"
result = qa({"query": question})
print(result["result"])

Paul_Cristina · September 5, 2024, 11:33am

I confirm that I am using langchain-mongodb package

langchain-mongodb==0.1.8

The pre_filter works like a charm, is filtering the results as expected:

_search_kwargs = {"pre_filter": {"author_gender": "male"}}
qa = RetrievalQA.from_chain_type(
        llm=llm, 
        chain_type="stuff", 
        retriever=vector_store.as_retriever(search_kwargs=_search_kwargs)
)
question = "List novels with authors name"
result = qa({"query": question})
print(result["result"])

Here are the novels with their respective authors' names:

1. "The Great Gatsby" by F. Scott Fitzgerald
2. "Moby-Dick" by Herman Melville
3. "The Catcher in the Rye" by J.D. Salinger

Thank you very much for your help.