Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Introduction to LangChain and MongoDB Atlas Vector Search

Anaiya Raisinghani, Prakul Agarwal5 min read • Published Dec 08, 2023 • Updated Jan 12, 2024
MongoDB
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
In this tutorial, we will leverage the power of LangChain, MongoDB, and OpenAI to ingest and process data created after ChatGPT-3.5. Follow along to create your own chatbot that can read lengthy documents and provide insightful answers to complex queries!

What is LangChain?

LangChain is a versatile Python library that enables developers to build applications that are powered by large language models (LLMs). LangChain actually helps facilitate the integration of various LLMs (ChatGPT-3, Hugging Face, etc.) in other applications and understand and utilize recent information. As mentioned in the name, LangChain chains together different components, which are called links, to create a workflow. Each individual link performs a different task in the process, such as accessing a data source, calling a language model, processing output, etc. Since the order of these links can be moved around to create different workflows, LangChain is super flexible and can be used to build a large variety of applications.

LangChain and MongoDB

MongoDB integrates nicely with LangChain because of the semantic search capabilities provided by MongoDB Atlas’s vector search engine. This allows for the perfect combination where users can query based on meaning rather than by specific words! Apart from MongoDB LangChain Python integration and MongoDB LangChain Javascript integration, MongoDB recently partnered with LangChain on the LangChain templates release to make it easier for developers to build AI-powered apps.

Prerequisites for success

Diving into the tutorial

Our first step is to ensure we’re downloading all the crucial packages we need to be successful in this tutorial. In Google Colab, please run the following command:
1!pip install langchain pypdf pymongo openai python-dotenv tiktoken
Here, we’re installing six different packages in one. The first package is langchain (the package for the framework we are using to integrate language model capabilities), pypdf (a library for working with PDF documents in Python), pymongo (the official MongoDB driver for Python so we can interact with our database from our application), openai (so we can use OpenAI’s language models), python-dotenv (a library used to read key-value pairs from a .env file), and tiktoken (a package for token handling).

Environment configuration

Once this command has been run and our packages have been successfully downloaded, let’s configure our environment. Prior to doing this step, please ensure you have saved your OpenAI API key and your connection string from your MongoDB Atlas cluster in a .env file at the root of your project. Help on finding your MongoDB Atlas connection string can be found in the docs.
1import os
2from dotenv import load_dotenv
3from pymongo import MongoClient
4
5
6load_dotenv(override=True)
7
8
9# Add an environment file to the notebook root directory called .env with MONGO_URI="xxx" to load these environment variables
10
11
12OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
13MONGO_URI = os.environ["MONGO_URI"]
14DB_NAME = "langchain-test-2"
15COLLECTION_NAME = "test"
16ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
17
18
19EMBEDDING_FIELD_NAME = "embedding"
20client = MongoClient(MONGO_URI)
21db = client[DB_NAME]
22collection = db[COLLECTION_NAME]
Please feel free to name your database, collection, and even your vector search index anything you like. Just continue to use the same names throughout the tutorial. The success of this code block ensures that both your database and collection are created in your MongoDB cluster.

Loading in our data

We are going to be loading in the GPT-4 Technical Report PDF. As mentioned above, this report came out after OpenAI’s ChatGPT information cutoff date, so the learning model isn’t trained to answer questions about the information included in this 100-page document.
The LangChain package will help us answer any questions we have about this PDF. Let’s load in our data:
1from langchain.document_loaders import PyPDFLoader
2from langchain.text_splitter import RecursiveCharacterTextSplitter
3from langchain.embeddings import OpenAIEmbeddings
4from langchain.vectorstores import MongoDBAtlasVectorSearch
5
6
7
8
9loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
10data = loader.load()
11
12
13text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
14docs = text_splitter.split_documents(data)
15
16
17# insert the documents in MongoDB Atlas Vector Search
18x = MongoDBAtlasVectorSearch.from_documents(
19documents=docs, embedding=OpenAIEmbeddings(disallowed_special=()), collection=MONGODB_COLLECTION, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
20)
In this code block, we are loading in our PDF, using a command to split up the data into various chunks, and then we are inserting the documents into our collection so we can use our search index on the inserted data.
To test and make sure our data is properly loaded in, run a test:
1docs[0]
Your output should look like this: output from our docs[0] command to see if our data is loaded correctly

Creating our search index

Let’s head over to our MongoDB Atlas user interface to create our Vector Search Index. First, click on the “Search” tab and then on “Create Search Index.” You’ll be taken to this page. Please click on “JSON Editor.”
click on JSON editor on screen
Please make sure the correct database and collection are pressed, and make sure you have the correct index name chosen that was defined above. Then, paste in the search index we are using for this tutorial:
1{
2 "fields": [
3 {
4 "type": "vector",
5 "path": "embedding",
6 "numDimensions": 1536,
7 "similarity": "cosine"
8 },
9 {
10 "type": "filter",
11 "path": "source"
12 }
13
14 ]
15}
These fields are to specify the field name in our documents. With embedding, we are specifying that the dimensions of the model used to embed are 1536, and the similarity function used to find the nearest k neighbors is cosine. It’s crucial that the dimensions in our search index match that of the language model we are using to embed our data.
Check out our Vector Search documentation for more information on the index configuration settings.
Once set up, it’ll look like this: proper configuration of our vector search index
Create the search index and let it load.

Querying our data

Now, we’re ready to query our data! We are going to show various ways of querying our data in this tutorial. We are going to utilize filters along with Vector Search to see our results. Let’s get started. Please ensure you are connected to your cluster prior to attempting to query or it will not work.

Semantic search in LangChain

To get started, let’s first see an example using LangChain to perform a semantic search:
1from langchain.embeddings import OpenAIEmbeddings
2from langchain.vectorstores import MongoDBAtlasVectorSearch
3
4
5vector_search = MongoDBAtlasVectorSearch.from_connection_string(
6 MONGO_URI,
7 DB_NAME + "." + COLLECTION_NAME,
8 OpenAIEmbeddings(disallowed_special=()),
9 index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
10)
11query = "gpt-4"
12results = vector_search.similarity_search(
13 query=query,
14 k=20,
15)
16
17
18for result in results:
19 print( result)
This gives the output: output of LangChain semantic search
This gives us the relevant results that semantically match the intent behind the question. Now, let’s see what happens when we ask a question using LangChain.

Question and answering in LangChain

Run this code block to see what happens when we ask questions to see our results:
1qa_retriever = vector_search.as_retriever(
2 search_type="similarity",
3 search_kwargs={
4 "k": 200,
5 "post_filter_pipeline": [{"$limit": 25}]
6 }
7)
8from langchain.prompts import PromptTemplate
9prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
10
11
12{context}
13
14
15Question: {question}
16"""
17PROMPT = PromptTemplate(
18 template=prompt_template, input_variables=["context", "question"]
19)
20from langchain.chains import RetrievalQA
21from langchain.chat_models import ChatOpenAI
22from langchain.llms import OpenAI
23
24
25qa = RetrievalQA.from_chain_type(llm=OpenAI(),chain_type="stuff", retriever=qa_retriever, return_source_documents=True, chain_type_kwargs={"prompt": PROMPT})
26
27
28docs = qa({"query": "gpt-4 compute requirements"})
29
30
31print(docs["result"])
32print(docs['source_documents'])
After this is run, we get the result:
1GPT-4 requires a large amount of compute for training, it took 45 petaflops-days of compute to train the model. [Document(page_content='gpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simulate
This provides a succinct answer to our question, based on the data source provided.

Conclusion

Congratulations! You have successfully loaded in external data and queried it using LangChain and MongoDB. For more information on MongoDB Vector Search, please visit our documentation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Quickstart

Getting Started With MongoDB & Mongoose


Aug 05, 2024 | 9 min read
Tutorial

How to Import Data Into MongoDB With mongoimport


Jun 12, 2024 | 15 min read
Quickstart

Java - Change Streams


Oct 01, 2024 | 10 min read
Article

MongoDB's Performance over RDBMS


Feb 14, 2024 | 6 min read
Table of Contents
  • Prerequisites for success