EventGet 50% off your ticket to MongoDB.local London on October 2. Use code WEB50Learn more >>
MongoDB Developer
Python
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Languageschevron-right
Pythonchevron-right

Simplify Semantic Search With LangChain and MongoDB

Brian Leonard3 min read • Published Aug 29, 2024 • Updated Aug 29, 2024
AIAtlasSearchPython
FULL APPLICATION
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Enabling semantic search on user-specific data is a multi-step process that includes loading, transforming, embedding, and storing data before it can be queried.
LangChain Retrieval Flow
That graphic is from the team over at LangChain, whose goal is to provide a set of utilities to greatly simplify this process.
In this tutorial, we'll walk through each of these steps, using MongoDB Atlas as our Store. Specifically, we'll use the AT&T Wikipedia page as our data source. We'll then use libraries from LangChain to load, transform, embed, and store:
LangChain Storage Flow Example
Once the source is stored in MongoDB, we can retrieve the data that interests us:
LangChain Retrieval Flow

Prerequisites

Quick start steps

  1. Get the code:
  1. Update params.py with your MongoDB connection string and Open AI API key.
  2. Create a new Python environment
  1. Activate the new Python environment
  1. Install the requirements
  1. Load, transform, embed, and store
  1. Retrieve

The details

Load -> Transform -> Embed -> Store

Step 1: Load

There's no lack of sources of data — Slack, YouTube, Git, Excel, Reddit, Twitter, etc. — and LangChain provides a growing list of integrations that includes this list and many more.
For this exercise, we're going to use the WebBaseLoader to load the Wikipedia page for AT&T.

Step 2: Transform (Split)

Now that we have a bunch of text loaded, it needs to be split into smaller chunks so we can tease out the relevant portion based on our search query. For this example, we'll use the recommended RecursiveCharacterTextSplitter. As I have it configured, it attempts to split on paragraphs ("\n\n"), then sentences("(?<=\. )"), and then words (" ") using a chunk size of 1,000 characters. So if a paragraph doesn't fit into 1,000 characters, it will truncate at the next word it can fit to keep the chunk size under 1,000 characters. You can tune the chunk_size to your liking. Smaller numbers will lead to more documents, and vice-versa.

Step 3: Embed

Embedding is where you associate your text with an LLM to create a vector representation of that text. There are many options to choose from, such as OpenAI and Hugging Face, and LangChang provides a standard interface for interacting with all of them.
For this exercise, we're going to use the popular OpenAI embedding. Before proceeding, you'll need an API key for the OpenAI platform, which you will set in params.py.
We're simply going to load the embedder in this step. The real power comes when we store the embeddings in Step 4.

Step 4: Store

You'll need a vector database to store the embeddings, and lucky for you MongoDB fits that bill. Even luckier for you, the folks at LangChain have a MongoDB Atlas module that will do all the heavy lifting for you! Don't forget to add your MongoDB Atlas connection string to params.py.
You'll find the complete script in vectorize.py, which needs to be run once per data source (and you could easily modify the code to iterate over multiple data sources).

Step 5: Index the vector embeddings

The final step before we can query the data is to create a search index on the stored embeddings.
In the Atlas console and using the JSON editor, create a Search Index named vsearch_index with the following definition:
Create a Search Index - Configuration Method
Create a Search Index - JSON Editor

Retrieve

We can now run a search, using methods like similirity_search or max_marginal_relevance_search. That would return the relevant slice of data, which in our case would be an entire paragraph. However, we can continue to harness the power of the LLM to contextually compress the response so that it more directly tries to answer our question.

Resources

Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Using OpenAI Latest Embeddings in a RAG System With MongoDB


Jul 01, 2024 | 15 min read
Quickstart

Quick Start: Getting Started With MongoDB Atlas and Python


Apr 10, 2024 | 4 min read
Tutorial

How to Choose the Right Chunking Strategy for Your LLM Application


Jun 17, 2024 | 16 min read
Tutorial

Is it Safe to Go Outside? Data Investigation With MongoDB


Sep 23, 2022 | 11 min read
Table of Contents
  • Prerequisites