BlogAtlas Vector Search voted most loved vector database in 2024 Retool State of AI reportLearn more >>
MongoDB Developer
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right

Beyond Vectors: Augment LLM Capabilities With MongoDB Aggregation Framework

Fabian Valle16 min read • Published Jun 20, 2024 • Updated Jun 20, 2024
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
In the field of investment management, having transaction data that are updated in real time is your most powerful ally. One bad decision can negatively impact your entire portfolio. If you know how to leverage transactional data, you can use it to discover actionable insights and make more strategic investment decisions. This article will explore how MongoDB's aggregation framework and GenAI work together to transform your data analysis workflow.
Large language models (LLMs) have significantly changed the way we interact with computers, providing capabilities such as drafting emails, writing poetry, and even engaging in human-like conversations. However, when it comes to dealing with complex data processing and mathematical calculations, LLMs have their limitations.
While LLMs excel at language, they can’t understand and manipulate numbers or symbols in the same way. That's where MongoDB's aggregation framework shines. It allows you to process entire collections of data, passing it through a multi-stage pipeline. Within these stages, you can perform calculations and transformations on entire collections. This allows you to bypass the limitations of LLMs, providing a reliable method for data analysis.
In this article, we’ll use the MongoDB aggregation framework and GenAI to overcome the limitations of classic RAG. We'll explore the MongoDB Atlas sample dataset — specifically, the sample_analytics database and the transactions collection. The sample_analytics database contains three collections for a typical financial services application: customers, accounts, and transactions.
For this example, we'll focus on transaction data, which offers a realistic dataset that allows users to hone their skills in data analysis, querying, and aggregation, particularly in the context of finance.
The source code is available at GitHub - mdb-agg-crewai
Before we start
To follow along, you'll need:
  1. An LLM resource: CrewAI supports various LLM connections, including local models (Ollama), APIs like Azure, and all LangChain LLM components for customizable AI solutions. Learn more about CrewAI LLM support.
Note: The source code in the example uses Azure OpenAI. To follow along, you’ll need a valid Azure OpenAI deployment.
The sample_analytics database contains three collections (customers, accounts, transactions) for a typical financial services application. The transactions collection contains transaction details for users. Each document contains an account ID, a count of how many transactions are in this set, the start and end dates for transactions covered by this document, and a list of sub-documents. Each sub-document represents a single transaction and the related information for that transaction.
  • transaction_id: This is a unique identifier that distinctly marks each transaction.
  • account_id: This field establishes a connection between the transaction and its corresponding account.
  • date: This represents the precise date and time at which the transaction took place.
  • transaction_code: This indicates the nature of the transaction, such as a deposit, withdrawal, buy, or sell.
  • symbol: This field denotes the symbol of the stock or investment involved in the transaction.
  • amount: This reflects the value of the transaction.
  • total: This captures the comprehensive transacted amount, inclusive of quantities, fees, and any additional charges associated with the transaction.
Transaction document schema

The task: uncover hidden opportunities

Picture this: You're running a company with a standard financial services application. Your objective? Spot hidden opportunities in the market by scrutinizing all transaction data and identifying the top three stocks based on net gain or loss. We can then research current events and market trends to uncover potential opportunities in the stocks that have historically shown the best net gain, according to our transaction data.
Net gain provides a clear picture of the profitability of an investment over a certain period. It's the difference between the total amount received from selling an investment (like stocks) and the total amount spent buying it.
There are several reasons why net gain matters. It allows investors to determine the profitability of their investments (positive or negative). Also, net gain helps investors compare how different investments are performing, whether they’re succeeding or falling behind. Third, net gain can be used to evaluate the effectiveness of an investment strategy. Investments that consistently result in a negative net gain might be too risky and may need to be sold off. And lastly, net gain can influence future investment decisions and mitigate risk — for instance, by identifying which stocks have historically shown the best net gain.
In a traditional SQL environment, calculating the net gain on transactional data would require multiple subqueries, temporary tables, and joins — a complex and potentially inefficient process, especially when dealing with large datasets. It could also be a resource-intensive task, demanding significant computational power and time.
The process can be more efficient by harnessing the power of MongoDB's aggregation framework, combined with the intelligent capabilities of AI technologies like CrewAI and LLMs. This not only streamlines the process but also offers deeper insights.

The solution: MongoDB's aggregation framework

The aggregation pipeline we will build calculates the total buy and sell values for each stock. Then, it calculates the net gain or loss by subtracting the total buy value from the total sell value. Next, the stocks are sorted by net gain or loss in descending order, with the highest net gains at the top.
If you’re new to MongoDB, I suggest you build this aggregation pipeline using the aggregation builder in Compass and then export it to Python. The aggregation pipeline builder in MongoDB Compass helps you create aggregation pipelines to process documents from a collection or view and return computed results.

Supercharge investment analysis with MongoDB and CrewAI

CrewAI Visualization
The MongoDB aggregation pipeline gives us the data we need to analyze. When you can extract meaningful insights from raw data faster, you can make better investment decisions. CrewAI, combined with MongoDB Atlas, provides a unique approach that goes beyond basic number-crunching to deliver actionable insights.
For this example, we will create an Investment Researcher agent. This agent finds valuable data using tools like search engines. It's designed to identify financial trends, company news, and analyst insights. Learn more about creating agents using CrewAI.

Unlocking the power of AI collaboration: agents, tasks, and tools

Artificial intelligence (AI) is rapidly evolving, transforming how we work in the data-driven world we live in. CrewAI introduces a framework for collaborative AI that empowers teams to achieve more by leveraging specialized AI units and streamlined workflows.
At the core of CrewAI lie agents. These are not your typical AI assistants. Instead, they function as intelligent team members, each with a distinct role (e.g., researcher, writer, editor) and a well-defined goal. They can perform tasks, make decisions, and communicate with other agents.
But what truly sets CrewAI apart is the seamless collaboration between these agents. This is achieved through a system of tasks. Tasks act as the building blocks of CrewAI workflows, allowing you to define a sequence of actions that leverage the strengths of different agents.
CrewAI also provides a comprehensive arsenal of tools that empower these agents. These tools include web scraping, data analysis, and content generation. By equipping agents with the right tools, you can ensure they have everything they need to perform their tasks effectively.
In essence, CrewAI's powerful combination of agents, tasks, and tools empowers you to:
  • Automate repetitive tasks.
  • Streamline workflows.
  • Unlock the true potential of AI.

The code

In this section, we'll walk through the Python code used to perform financial analysis based on transaction data stored in MongoDB, using GenAI for data analysis. The Python version used during development was 3.10.10.
Here are the required packages to run the code. Make sure they are installed properly before continuing.
You can install all the packages by running pip install -r requirements.txt.

MongoDB setup

First, we set up a connection to MongoDB using PyMongo. This is where our transaction data is stored.
Important: While we're including the connection string directly in the code for demonstration purposes, it's not recommended for real-world applications. A more secure approach is to retrieve the connection string from your MongoDB Atlas cluster.
Here's how to access your connection string from Atlas:
  • Log in to your MongoDB Atlas account and navigate to your cluster.
  • Click on "Connect" in the left-hand navigation menu.
  • Choose the driver you'll be using (e.g., Python) and its version.
  • You'll see a connection string provided. Copy this string for use in your application.
Once you have your connection string, you are ready to start.

Azure OpenAI setup

Next, we set up our Azure OpenAI LLM resource. The code in the example uses Azure OpenAI. To follow along, you’ll need a valid Azure OpenAI deployment.

Web search API setup

For this example, we will be using the the DuckDuckGo Search LangChain integration. The DuckDuckGo Search is a component that allows users to search the web using DuckDuckGo.
DuckDuckGo was chosen for this example because it:
  • Requires no API key.
  • Is easy to use.
  • Provides snippets.
DuckDuckGo Search Tool

CrewAI setup

We'll be using CrewAI to manage our agents and tasks. In this case, we have one agent — a researcher who is tasked with analyzing the data and providing insights. ​​In CrewAI, tasks are the individual steps that make up a larger workflow.

Agents and tasks: working together as a crew

In CrewAI, a crew represents a collaborative group of agents working together to achieve a set of tasks. While our example is a single-agent crew for simplicity, you can create multi-agent crews for more complex workflows.
  • Tasks: These are the individual steps that make up your investment research workflow. Each task represents a specific action the agent needs to take to achieve the overall goal.
  • Agents: Think of these as the workers who execute the tasks. We'll have a dedicated Investment Researcher agent equipped with the necessary tools and knowledge to complete the assigned tasks.

Fine-tuning your Investment Researcher

CrewAI allows you to customize your agent's behavior through various parameters:
  • Role and goal (AGENT_ROLE & AGENT_GOAL): These define the agent's purpose. Here, we set the role to "Investment Researcher" with a goal of "identifying investment opportunities." This guides the agent toward relevant data sources and analysis methods (e.g., market trends, company news, analyst reports).
  • Backstory: Craft a backstory like "Expert stock researcher with decades of experience" to add context and potentially influence the agent's communication style and interpretation of information.
  • Tools: Equip your agent with tools (functions or classes) to complete its tasks. This could include a search tool for gathering information or an analysis tool for processing data.
  • Large language model (LLM): This is the AI engine powering the agent's tasks, like text processing and generation. Choosing a different LLM can significantly impact the agent's output based on the underlying LLM’s strengths and weaknesses.
  • Verbose (verbose): Setting verbose=True provides a more detailed log of the agent's thought process for debugging purposes.
By adjusting these parameters, you can tailor your investment research agent to focus on specific market sectors, prioritize information sources, and even influence its risk tolerance or investment style (through the backstory).

MongoDB aggregation pipeline

Next, we define our MongoDB aggregation pipeline. This pipeline is used to process our transaction data and calculate the net gain for each stock symbol.
Here's a breakdown of what the MongoDB pipeline does:
  1. Unwinding transactions: Each document contains information about multiple stock purchases and sales. The pipeline uses the $unwind operator to unpack an array field named "transactions" within each document. Unwinding separates these transactions into individual documents, simplifying subsequent calculations.
  2. Grouping by symbol: Next, the $group operator groups the unwound documents based on the value in the "transactions.symbol" field. This essentially combines all transactions for a specific stock (represented by the symbol) into a single group.
  3. Calculating buy and sell values: Within each symbol group, the pipeline calculates two crucial values:
    • buyValue: This uses the $sum accumulator along with a conditional statement ($cond). The $cond checks if the "transaction_code" within the "transactions" object is "buy." If it is, it converts the "total" field (the transaction amount) to a double using $toDouble and adds it to the running total for buyValue. If it's not a buy transaction, it contributes nothing (0) to the sum. This effectively calculates the total amount spent buying shares of that specific symbol.
    • sellValue: Similar to buyValue, this calculates the total amount received by selling shares of the same symbol. It uses the same logic but checks for "transaction_code" equal to "sell" and sums those "total" values.
  4. Projecting results: Now, the $project operator defines the final output format. It discards the automatically generated grouping identifier (_id) by setting it to 0. It then renames the grouping field (_id which held the "transactions.symbol") to a clearer name, "symbol." Finally, it calculates the net gain or loss for each symbol using the $subtract operator. This subtracts the buyValue from the sellValue to determine the net gain or loss for that symbol.
  5. Sorting by net gain: The $sort operator organizes the results. It sorts the documents based on the "netGain" field in descending order (-1) so that the symbols with the highest net gain (most profitable) will appear first in the final output.
  6. Limiting results: Lastly, the $limit operator limits the number of documents passed to the next stage in the pipeline. In this case, it's set to 3, meaning only the top three documents (stocks with the highest net gain) will be included in the final output.

Preliminary check: ensuring error-free execution

Before we initiate our automated agent workflow, we must ensure that the code executed so far is error-free.
The expected output should resemble the following:

Initiating the agent task execution

We can now kick off our task execution. The researcher agent will utilize the data derived from our MongoDB aggregation, along with any other tools at its disposal, to analyze the data and offer insight.
MongoDB Aggregation Pipeline Results Screenshot

Complete Source Code


Example OUTPUT

Limitations and considerations

MongoDB's aggregation framework and GenAI are powerful tools for analyzing data, but we must recognize a few potential limitations.
First, there’s a bigger dependence on historical data. The past performance of an investment isn’t necessarily indicative of future results. This is especially the case in unpredictable markets.
Second, there’s a dependence on search result snippets. The snippets provided by DuckDuckGo may not always provide enough information. You would perhaps want to consider scraping the search result URL using something like Firecrawl, which can crawl and convert any website into clean markdown or structured data.
Next, there’s always going to be uncertainty in predictions, despite how savvy these tools can be.
And finally, we must consider that LLMs have their own limitations. They’re always evolving and continually improving. However, biases in training data or limitations in the model's architecture could lead to inaccurate or misleading insights.
It’s important to be aware of these limitations so you can ensure a more responsible and well-rounded approach to investment analysis.


In this article, we explored how MongoDB's aggregation framework, large language models, and CrewAI can be leveraged to transform investment analysis. The key to making smarter investment decisions is harnessing the power of your transaction data. MongoDB's aggregation framework provides the tools to efficiently calculate essential metrics like net gain, right within the data platform, with no additional code required at the application layer.
When combined with CrewAI's ability to automate research workflows, you gain a deeper understanding of the market, identify new opportunities, make smarter decisions, and boost your investment success.

The future: AI-powered investment analysis

The future of investment analysis belongs to those who embrace data and AI. By combining MongoDB's robust data platform with the insight-generating capabilities of AI tools like CrewAI, you gain the ability to:
  • Analyze trends faster than those relying on traditional methods.
  • Identify profitable patterns that others miss.
  • Make informed decisions backed by both raw data and contextual insights.
  • Automate tedious analysis, giving you more time for strategic thinking.
Don't just analyze the market — shape it. Explore MongoDB and AI today, and transform your investment decision-making process.
The source code is available at GitHub - mdb-agg-crewai.
Questions? Comments? Join us in the MongoDB Developer Community to continue the conversation.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial

Creating, Reading, Updating, and Deleting MongoDB Documents with PHP

Aug 24, 2023 | 8 min read

Easy Migration: From Relational Database to MongoDB with MongoDB Relational Migrator

Jan 04, 2024 | 6 min read
News & Announcements

Improved Error Messages for Schema Validation in MongoDB 5.0

Jun 14, 2023 | 10 min read

Leveraging Atlas Vector Search With HashiCorp Terraform: Empowering Semantic Search in Modern Applications

May 02, 2024 | 4 min read
Table of Contents