Building a Knowledge Base and Visualization Graphs for RAG With MongoDB
Prasad Pillalamarri, Shounak Acharya12 min read • Published Sep 02, 2024 • Updated Sep 02, 2024
Rate this tutorial
Several solutions provide rich data to improve the performance of RAG systems. Each of these alternatives offers different strengths, and the choice depends on the specific requirements of the RAG system, such as the type of data being used, the complexity of queries, and the desired quality of the generated text. In practice, combining several of these methods often yields the best results, leveraging their respective advantages to enhance both retrieval and generation processes.
MongoDB provides support for implementing vector search for fast retrieval, pre-filters, hybrid search, and knowledge bases. All of the above options can be implemented out of the box. In this article, Shounak and I would like to highlight how MongoDB — and more importantly, the JSON-based document model — can easily be used to construct a knowledge base and store the relationships between entities and nodes in a RAG architecture. We will also extend it further and use the JSON document as the base to construct hierarchical network graphs or MongoDB Charts-based visualizations.
Document databases store and index documents, allowing for fast retrieval based on complex queries. They are well-suited for storing large collections of text documents, web pages, or articles. The retriever can query these databases to fetch relevant documents based on keywords or semantic similarity, which the generator then uses to produce coherent text.
Filtering your data is useful for narrowing the scope of your semantic search and ensuring that not all vectors are considered for comparison. The $vectorSearch filter option matches only BSON boolean, date, objectId, string, and numeric values.
Combining full-text search with vector embeddings integrates multiple retrieval methods to leverage their strengths. For instance, embedding-based reranking is used for keyword-based retrieval, and then a full-text search is run. The hybrid system first refines the selection with advanced semantic techniques and then narrows down candidates using full-text search before passing it to the generator.
Large repositories of structured information are extracted from various sources. They often include a wide range of entities and relationships. The retriever can query these knowledge bases to fetch relevant facts and relationships, enhancing the context and detail in the generated text.
MongoDB Charts is a data visualization tool specifically designed for MongoDB Atlas, offering a fast, intuitive, and robust way to visualize your data. It supports a wide range of use cases, whether you're working with a dedicated cluster, a serverless instance, leveraging Atlas Data Federation to uncover valuable insights from combined Atlas and S3 data, or visualizing archived data in Online Archive.
In this article, we will create a dependency graph on the entities from a free text paragraph using the LLGraphTransformer class in LangChain and OpenAI as the LLM. We will pass in a text on the history of the Python programming language and ask it to return the entities and their relationships as shown in the code snippet below.
This generates an output as below, capturing various nodes and their relationships:
As can be seen from the output above, the LLMGraphTransformer captured various entities like Python, Guido van Rossum, Netherlands, etc. and also assigned a type. For example, Python is a programming language, Guido van Rossum is a person, and Netherlands is a country.
The LLMGraphTransformer not only identifies nodes but also generates relationships between them. For instance, the output above establishes that Guido van Rossum created Python, a programming language. This connection is represented by a relationship object, which consists of a source (Python), a target (Guido van Rossum), and a relationship type (CREATED_BY). The output demonstrates multiple such relationships being captured between the identified node entities.
Now, using these node and relationship data structures, we can create MongoDB collections to capture the relationship graph inside MongoDB. In this example, we create a collection for each of the node types — for example, Programming_language, Country, Operating_system, etc. as shown below in the code snippet:
This creates one collection each based upon the node types as shown in the output below:
We can apply other design patterns, like polymorphic design patterns, to create a single collection with multiple object types. However, in those cases, the code needs to be modified based on the domain knowledge in the graph. In our example, we have kept the pattern more generic so that the same pattern can be utilized for generating collections and corresponding relationships without much code modification for any knowledge base.
Now, in order to capture relationships between the documents across collections, we will use linking. In our case, we iterate through the relationship lists and do the following:
- For the source of the relationship, we create an array attribute on the document.
- The value of the array attribute is the target of the relationship type.
- We create these array attributes on the source for each of the relationships where the current object is the source.
For example, in the Programming_language collection, we will have Python as one of the documents. Now, in the Python document, we will have array attributes for DEVELOPED_IN, CREATED_BY, DEVELOPED_AT, SUCCESSOR_OF, and INTERFACE_WITH, as shown in the screenshot below:
Similarly, for example, ABC programming language, which is a predecessor to Python, has been inspired by Setl, as shown in the following screenshot. Please note that both these are from the same collection called Programming_language:
However, if we observe properly, Setl does not have any linking as there were no relationships defined in the LLMGraphTransformer output.
The following code snippets show how to arrive at the above collections.
The above code creates a dictionary of all unique relationship types per source from the LLMGraphTransformer relationship list and gives the output as below:
Once we know of all the relationships, we create the documents for each type of collection, linking other collections on the way, and insert into MongoDB as shown in the following snippet:
The above snippet first generates documents to be inserted into the corresponding collections with all the details of linking to other documents, as shown in the output below:
Finally, we add these documents to MongoDB. We figure out the collection on which to insert by looking at the “type” field that we have inserted in the documents in the step above:
When you have write once and read many use cases (1+W/99R), or write multiple times and read often (20W/80R), it is recommended to pre-compute the rendering schema expected by your rendering engine — in our case, d3.js — and save it along with your MongoDB documents. The following output shows the schema for the Python document in the Programming_language collection that we showed in the last section, which now stores all the target nodes and edges to the nodes, which start from this node:
This can be done while creating the mongo documents from the graph, as shown in the following snippet. Please note that the code for creating collections would still be the same as mentioned in the previous section.
The code above shows capturing and storing one-level relations. The same concepts can be utilized to store N-level relations per document based upon your use-case, which follows the subset pattern while designing MongoDB data.
You can leverage all graph types. Hierarchical graphs can be in a tree structure, disjoint force directed graph, or hierarchical arcs. Below, JSON documents can power the data sets to display these graphs.
Graph type 1:
Graph type 2:
Graph type 3:
In our use case, we use the d3.js force-directed graph to create a visualization example for the Programming_language collection. Now, d3.js expects two arrays, namely nodes and links, in a JSON object, where each array is a JSON capturing the properties of nodes and relationships, respectively. The structure looks something like below:
Each of the JSON objects within the arrays have some mandatory fields and some optional fields. For example, nodes should have “id” as a mandatory field. Similarly, relationship objects should have “source” and “target” as the mandatory fields.
In order to create this structure from our Programming_language collection, we use a graph lookup that recursively creates the relationship between Python and its predecessor, the ABC programming language. Finally, that goes to Setl, the language from which ABC was inspired.
MongoDB graphlookup performs a recursive search on a collection, with options for restricting the search by recursion depth and query filter.
The $graphLookup process works as follows:
- Input documents are processed in the $graphLookup stage of an aggregation pipeline.
- The search is directed to the collection specified by the “from” parameter.
- For each input document, the search starts with the value specified by startWith.
- $graphLookup compares this startWith value to the field indicated by connectToField in other documents within the “from” collection.
- When a match is found, $graphLookup retrieves the value from connectFromField and checks other documents in the “from” collection for corresponding connectToField values. Matching documents are then added to an array specified by the as parameter.
- This recursive process continues until no further matches are found or the maximum recursion depth, defined by maxDepth, is reached.
- Finally, $graphLookup appends the array to the original input document and completes its search for all input documents.
Finally, we create the nodes and relationship arrays and save them off to a JSON file. Please note that we also add Link Labels, which shows the relationship types between the nodes:
This creates the python-dependecies.json file whose contents are shown as below with labels of nodes as well as links:
We then use this JSON file to create the nodes and links array in the d3.js code. We have reused some of the code from the GitHub URL and updated the Link Labels there. Finally, we run a local node http server and render the HTML. The output looks as below:
As we can see, we are able to display the relationship graph that was captured by our MongoDB collections using d3.js.
If we take the embedded relations approach, the rendering becomes even more easy and generic. The following code reads from documents which have the embedded target nodes and edges data in the documents themselves. Simply doing a find on all documents of a collection gives us all the required nodes and edges to form the graph, as shown in the code snippet below:
This generates the “python-dependencies_embedded_2.json” file, which, when fed to the d3.js HTML code, results in the following graph, which is exactly like the one shown above:
As you can see, this is much simpler and more usable code, especially when we need to visualize nodes and relationships.
This article provides all the integration points and code snippets that help developers leverage a knowledge base with MongoDB for RAG architectures. It demonstrates that MongoDB’s JSON document is the base for generating graph models and visualizations powered by MongoDB Atlas Charts. Additionally, please note that most of the code is out of the box, focusing on MongoDB’s key value proposition around developer productivity.
Interested in a role on MongoDB’s Partner Presales team? We have several open roles **on our teams across the globe and would love for you to transform your career with us!
Top Comments in Forums
There are no comments on this article yet.
Related
Tutorial
Tutorial: Build a Movie Search Engine Using Atlas Full-Text Search in 10 Minutes
Sep 09, 2024 | 10 min read
Tutorial
Add a Comments Section to an Eleventy Website with MongoDB and Netlify
Feb 03, 2023 | 19 min read