Caching LLMs Response With MongoDB Atlas and Vector Search
Rate this tutorial
Large language models (LLMs) are the renowned solution for most business domains in 2024. While there is an estimation that 750 million applications will be integrated with LLMs in 2025, training LLMs does consume significant monetary resources. So, the cost of LLM platforms such as OpenAI’s GPT REST API reflects the above cost in their pricing. The problem is how we can reduce the operational cost of the AI applications in production; the obvious answer is calling the API less, and then another problem arises: the maintenance of response quality to users.
Caching has been a fundamental solution in software engineering for many years. The application creates a cache by extracting a key from the request and storing the corresponding result. Next time the same key is called, the server can immediately respond to user queries on the network without extra computation.
However, the character of the LLM query is not a fixed key but a flexible format like the meaning of the human’s question. Consequently, the traditional cache that stores fixed keys is inefficient enough to handle LLM queries.
Unlike traditional cache, semantic cache characteristics of data and semantic representation are simply described as meaning-based representations. We call this process embedding. In LLM systems, the model converts text into numerical vectors that represent its semantic meaning.
We will store the embeddings in the cache system. When a new request comes in, the system extracts its semantic representation by creating an embedding. It then searches for similarities between this new embedding and the stored embeddings in the cache system. If a high similarity match is found, the corresponding cached response will be returned. This process allows for semantic-based retrieval of previously computed responses, potentially reducing the need for repeated API calls to the LLM service.
• Python (3.12.3 or newer)
• FastAPI (0.11 or newer)
• PyMongo (4.7.2 or newer)
• uvicorn (0.29.0 or newer)
We need to install the dependencies as mentioned above. You can utilize pip for the package manager. The necessary dependencies are in requirements.txt. After cloning the project and entering the project’s directory, you can run the below command to install them.
In case you are creating an isolated project, you can enable python virtualenv for this specific environment.)
To simulate the caching server, the request shall come from an HTTP request. We thus set up a web server in Python by FastAPI.1.2.1) Create app.py in the root directory.1.2.2) Import FastAPI and initiate / and /ask routes.
app.py
Next, run the application for testing our route. (--reload for hot refresh if application code is edited.)
Your server must be running at http://127.0.0.1:8000. Now, we can test our search route using the command below.
The server must respond as below:
We previously set up a basic FastAPI server and ask route. Next, the LLM functionality will be integrated.
1.3.1) Create llm.py in the same directory as app.py.
1.3.2) Set up OpenAI as an LLM service.
llm.py
We have to modify app.py with a few lines of code.
app.py
Then, we can invoke the ask route with a new query.
Now, we can receive a response from the OpenAI LLM. However, the system is always relying on the OpenAI service. Our goal is to reduce the load from the AI service to the cache system.
To cache the LLM response, we must transform our text (or any type of data) into vector data.
A vector can be thought of as an N-dimensional (depending on the embedding model) array in which each number represents the meaning of the original data.
Example:
We can embed our data using a language model. In our case, we utilize OpenAI's text-embedding model. Therefore, we modify llm.py with a few lines.
llm.py
So, we shall modify app.py to use the new functionality of llm.py.
app.py
If we run the
curl
command to invoke ask
again, in the server shell, it must print similar data like below.We already have vector data for its semantic. Let's see how we will store them for our cache system.MongoDB Atlas Vector Search is a key feature that lets us enable AI-powered semantic search in vector data. To do so, we must store documents in the MongoDB database first.First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
4.1) Connect MongoDB with Python.
4.1.1) Create db.py in same directory of app.py.
4.1.2) Implement document saving in MongoDB.
db.py
4.1.3) Save response from AI to database.
Modify app.py to save the AI response and its vector information in the database.
app.py
4.2) Create index vector search in MongoDB AtlasMongoDB’s Vector Search enables AI-powered experiences to perform semantic search of unstructured data by its embeddings with machine learning models. We have to enable vector search index in the database. You can go to the database in Atlas -> Atlas Search -> CREATE SEARCH INDEX.
Below is the JSON editor version of Atlas index.
Logically, when we receive a new request from the client, we’ll embed the search query and perform a vector search to find the documents that contain embeddings that are semantically similar to the query embedding.
Vector search is one of the stages of aggregation pipelines. The pipeline is constructed as shown below.
Modify db.py and app.py to implement the PyMongo aggregation pipeline for vector search.
db.py
app.py
We can try to send the request to our system. Let’s ask the system, “How are things with you?”
response For the first time, the system retrieves data from the AI service.
Let’s try to ask the system a new question (but with a similar meaning): How are you today?
Now, the system will return cache data from MongoDB Atlas.
If you go to shell/terminal, you will see a log like below.
It seems that the query “How are you today?” is 80% similar to “How are things with you?” That is what we expect.
This article outlines the implementation of a semantic caching system for LLM responses using MongoDB Atlas and Vector Search. The solution covered in this article aims to reduce costs and latency associated with frequent LLM API calls by caching responses based on query semantics rather than exact matches.
The solution integrates FastAPI, OpenAI, and MongoDB Atlas to create a workflow where incoming queries are embedded into vectors and compared against cached entries. Matching queries retrieve stored responses, while new queries are processed by the LLM and then cached.
Key benefits include reduced LLM service load, lower costs, faster response times for similar queries, and scalability. The system demonstrates how combining vector search capabilities with LLMs can optimize natural language processing applications, offering a balance between efficiency and response quality.
Top Comments in Forums
There are no comments on this article yet.