Smart Filtering: A Guide to Generating Pre-filters for Semantic Search
Vipul Bhardwaj, Fabian Valle20 min read • Published Sep 03, 2024 • Updated Sep 03, 2024
FULL APPLICATION
Rate this tutorial
Ever searched for "old black and white comedies" only to be bombarded with a mix of modern action flicks? Frustrating, right? That’s the challenge with traditional search engines — they often struggle to understand the nuances of our queries, leaving us wading through irrelevant results.
This is where smart filtering comes in. It's a game-changer that uses metadata and vector search to deliver search results that truly match your intent. Imagine finding exactly the classic comedies you crave, without the hassle.
In this blog, we'll dive into what smart filtering is, how it works, and why it's essential for building better search experiences. Let's uncover the magic behind this technology and explore how it can revolutionize the way you search.
Vector search is a powerful tool that helps computers understand the meaning behind data, not just the words themselves. Instead of matching keywords, it focuses on the underlying concepts and relationships. Imagine searching for "dog" and getting results that include "puppy," "canine," and even images of dogs. That's the magic of vector search!
How does it work? Well, it transforms data into mathematical representations called vectors. These vectors are like coordinates on a map, and similar data points are closer together in this vector space. When you search for something, the system finds the vectors closest to your query, giving you results that are semantically similar.
While vector search is fantastic at understanding context, it sometimes falls short when it comes to simple filtering tasks. For instance, finding all movies released before 2000 requires precise filtering, not just semantic understanding. This is where smart filtering comes in to complement vector search.
While vector search brings us closer to understanding the true meaning of queries, there's still a gap between what users want and what search engines deliver. Complex search queries like "earliest comedy movies before 2000" can still be a challenge. Semantic search might understand the concepts of "comedy" and "movies," but it might struggle with the specifics of "earliest" and "before 2000."
This is where the results start to get messy. We might get a mix of old and new comedies, or even dramas that were mistakenly included. To truly satisfy users, we need a way to refine these search results and make them more precise. That's where pre-filters come into play.
Smart filtering is the solution to this challenge. It's a technique that uses a dataset's metadata to create specific filters, refining search results and making them more accurate and efficient. By analyzing the information about your data, like its structure, content, and attributes, smart filtering can identify relevant criteria to filter your search.
Imagine searching for "comedy movies released before 2000." Smart filtering would use metadata like genre, release date, and potentially even plot keywords to create a filter that only includes movies matching those criteria. This way, you get a list of exactly what you want, without the irrelevant noise.
Smart filtering is a multi-step process that involves extracting information from your data, analyzing it, and creating specific filters based on your needs. Let's break it down:
- Metadata extraction: The first step is to gather relevant information about your data. This includes details like:
- Data structure: How is the data organized (e.g., tables, documents)?
- Attributes: What kind of information is included (e.g., title, description, release date)?
- Data types: What format is the data in (e.g., text, numbers, dates)?
- Pre-filter generation: Once you have the metadata, you can start creating pre-filters. These are specific conditions that data must meet to be included in the search results. For example, if you're searching for comedy movies released before 2000, you might create pre-filters for:
- Genre: comedy
- Release date: before 2000
- Integration with vector search: The final step is to combine these pre-filters with your vector search. This ensures that the vector search only considers data points that match your pre-defined criteria.
By following these steps, smart filtering significantly improves the accuracy and efficiency of your search results.
To be successful with this tutorial, you will need:
- The IDE of your choosing. This tutorial uses a Jupyter notebook. Please feel free to run your commands directly from a notebook.
- An OpenAI API key. We will use OpenAI LLM to embed our data and generate filters. You will need access to:
text-embedding-ada-002
embedding model.gpt-4o
for text generation.
- Python <4.0, >=3.8.1.
The following instructions are for running in a notebook but can be adapted to run in your IDE. There will just be some differences.
Install the required dependencies.
For the purpose of this tutorial, we will use some sample movie data. We will define a list of LangChain documents. Each document represents one movie that has a
page_content
with movie description and some metadata with it. In the sample data, the metadata has release_date
, rating
, and genre
.For this tutorial, we are using OpenAI's
text-embedding-ada-002
embedding model.We will use MongoDBAtlasVectorSearch retriever from LangChain to embed and ingest our data. It will accept a list of docs, the embedding object we created earlier, and the MongoDB client. This step will initialize our MongoDB collection with the movies data with embeddings.
After initialization, the inserted data will look like the following:
We have the
page_content
in the text
field which is default for MongoDBAtlasVectorSearch vector store. The embeddings is an array of floats and saved under embedding
. The metadata fields are present as release_date
, rating
, and genre
.Before we can perform a search on our data, we need to create a search index. Follow these steps to create a search index.
- In AtlasUI, go to your collection
smart_filtering.movies
. - Click on
Search Indexes
. - Click on
Create Index
. - Under
Atlas Search
selectJson Editor
. - Name your index as
default
and copy/paste the below index definition:
The index will take a couple of seconds to build. After the index builds successfully, we will be ready to query our data.
Let’s say a user wants to find documents for the latest movie released before some date in the animation genre with this query:
I want to watch a movie released before year 2000 in the animation genre with the latest release date
.We will try semantic search for this query and see what results we get.
Output: We received four movies in the output, three of which are not relevant to the user’s query. This is the problem with semantic search that we will solve using smart filtering.
Let us analyze the filtering requirements from the user query:
I want to watch a movie released before year 2000 in the animation genre with the latest release date
.There are two types of filter requirements in the user query and we will solve it in two stages.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.
We are performing the filtering in two stages because these are two different tasks and we should perform one task at a time with LLMs to get better results.
Stage 1 — metadata filter: A pre-filter can be generated based on the query and metadata.
- Release before year 2000 can be a potential filter.
- Animation genre can be a potential filter.
Stage 2 — time-based filter: We will also need to account for the latest release date. We will need to query the data to find the latest release date movie.
We will be using LangChain’s load_query_constructor_runnable to generate our filter query and then we will be using MongoDBAtlasTranslator to convert the query to a valid MongoDB query. We will need the below to pass to load_query_constructor_runnable:
- A description of the content of our data that will be passed in document_content
- Metadata attributes of our data passed in attribute_info
- Prompt with some examples that the LLM will use to generate the query
For the purpose of this tutorial, we will define the metadata that we will be using for the filtering purpose. We will need to define the content and provide a
document_content_description
, name
, description
, and type
of each field. This requires some basic understanding of the data.Description of the content of our data:
Metadata attributes of our data:
Our goal is to extract meaningful information from the user query that we can use for the metadata filtering. We will pass the metadata of our data in the context such that the LLM gets an idea of the information that can be used as a filter. We will use a few-shot prompting technique to generate our results.
Few-shot prompting is a technique used with large language models, where the model is given a few examples of a task within the prompt to help guide it to produce the desired output.
Note: Please update the prompt as per your use case.
The below prompt will be passed as
schema_prompt
in LangChain’s load_query_constructor_runnable that will be used to generate the query.The prompt will be used to instruct LLM on how to generate the query. We will use the prompt defined in the LangChain’s query constructor prompt but we will change it as per our use case.
Let’s break down the prompt and understand it:
Let’s break down the prompt and understand it:
- In the beginning, we instruct the LLM to output the result in JSON format with the rewritten query and filter as keys.
- We instruct the LLM to not include any information in the new query that is already accounted for in the filter.
- The variables are wrapped in
{}
in the prompt that will be filled later.
Now, we will define some examples that we will use in our prompt. The examples will help the LLM to generate better results. We will need to define metadata for our example, a user query, and an expected answer.
Let’s define three data sources:
- A songs data source that has a content description and attributes definition.
- A movies data source. This is similar to our sample data that we are trying to solve. Adding it in the few-shot examples can improve our results.
- A generic keyword data source. The LLM was struggling with generating correct query format for the keywords/array filtering so we added this to improve our results.
Note: Please add some examples as per your use case to enhance the results.
Now, let’s define some example user queries and expected answers:
Putting the above together, we can define our examples with the data source definition, user query, and expected answer:
Let us define a utility function that we will use to process our filters before returning.
Now, let's go ahead and define our
generate_metadata_filter
function that we will be using to generate our metadata filters.Let us define our LLM object for text generation. We will use LangChain’s ChatOpenAI for this purpose. We will use the
gpt-4o
model for our filter generation.Now that we have everything we need to generate the metadata filter, let’s give it a try.
Output:
Generated filter and new query:
Generated filter and new query:
As you can see in the generated filter, we were able to extract the information from the user’s query and the metadata, such as “release before year 2000” and “animation genre,” that can be used to pre-filter the data before running the semantic search.
Note that we are also returning a new query after removing the filters that we generated. This will be helpful in the next stage of filter generation.
O/P docs:
We have received two documents in the output which is a better result than before. But we are still not able to get the “latest release date” movie.
And, to find the latest movie, we will need to query our data so we will move to Stage 2, filter generation.
The purpose of this stage is to generate the filters that can be used to find the
movies with the latest release date
.Note: We will use the filter generated in the Stage 1 in this stage because we want to find the
movie with the latest release date
in movies released before 2000 and in the animation category
.We will need to query our MongoDB collection via LLM. For this purpose, we will be defining some tools that the LLM can use to query our data.
Let us define the tools to allow LLM to query our MongoDB collection.
In the second stage, we are only accounting for use cases where the user wants "latest," "recent," "first," or "last" type of queries. We will instruct our LLM to only generate an aggregation pipeline to generate filters for these types of queries.
Now, let's go ahead and define our
generate_time_based_filter
function that we will be using to generate our time-based filters.We have handled the cases where only one stage filter generation is required or no filter generation is required.
Let’s run with both the filters now and check the result.
Output:
Now that we have generated both stages’ filters, we can combine them using the
$and
operator to generate our final filter.Output:
Now, let’s run a semantic search using the
final_pre_filter
.Output:
With smart filtering, we used both metadata and time-based filtering stages and were able to generate a filter that can be used to pre-filter the data before running a semantic search. We have received only the required documents in the end.
Smart filtering brings a host of advantages to the table, making it a valuable tool for enhancing search experiences:
- Improved search accuracy: By precisely targeting the data that matches your query, smart filtering dramatically increases the likelihood of finding relevant results. No more wading through irrelevant information.
- Faster search results: Since smart filtering narrows down the search scope, the system can process information more efficiently, leading to quicker results.
- Enhanced user experience: When users find what they're looking for quickly and easily, it leads to higher satisfaction and a better overall experience.
- Versatility: Smart filtering can be applied to various domains, from e-commerce product searches to content recommendations, making it a versatile tool.
By leveraging metadata and creating targeted pre-filters, smart filtering empowers you to deliver search results that truly meet user expectations.
Smart filtering is a powerful tool that transforms search experiences by bridging the gap between user intent and search results. By harnessing the power of metadata and vector search, it delivers more accurate, relevant, and efficient search outcomes.
Whether you're building an e-commerce platform, a content recommendation system, or any application that relies on effective search, incorporating smart filtering can significantly enhance user satisfaction and drive better results.
By understanding the fundamentals of smart filtering, you're equipped to explore its potential and implement it in your projects. So why wait? Start leveraging the power of smart filtering today and revolutionize your search game!
Check out additional resources: Unlock the Power of Semantic Search With MongoDB Atlas Vector Search and Interactive RAG With MongoDB Atlas + Function Calling API. If you have any questions or want to show us what you are building, join us in the MongoDB Community Forums.
Top Comments in Forums
There are no comments on this article yet.