Discover how you can implement video search capabilities in your media applications using MongoDB Atlas and Voyage AI’s multimodal embedding models.
Use cases: Gen AI
Industries: Media
Products: MongoDB Atlas, MongoDB Atlas Vector Search, Voyage AI Multimodal Embeddings
Partners: Open AI for speech-to-text conversion
Solution Overview
The media and entertainment industry is adopting digital transformation to drive growth strategies. According to a PWC study, industry revenues are expected to grow at a 3.7% compound annual growth rate from 2025 to 2029, rising from US$2.9 trillion to US$3.5 trillion in 2029. For media companies, a driving force of this growth is video entertainment generated through social video platforms, streaming on-demand, and news channels.
In the US alone, consumers spend an average of six hours per day with media and entertainment content, more than half of which is devoted to video. Media companies must capitalize on this video data to offer premium video experiences for their customers and streamline operations. Leveraging video data enables the development of innovative applications, such as semantic video search.
Semantic video search enables users to find specific content in videos based on its contextual meaning. This technique uses embeddings and vector search capabilities to transform video content into numerical values that can be processed by computational algorithms. For example, users can provide a query, such as police cars on the road, to the video search application, which then locates the corresponding scene in the video.
This solution shows how you can implement a semantic video search service for a media application. In this service, MongoDB Atlas supports data storage and vector search capabilities, while Voyage AI provides multimodal embeddings. This functionality provides the following benefits:
Enables better user experiences with enhanced content discovery.
Reduces time spent searching for information in lengthy videos, improving efficiency.
Drives revenue by attracting new customer groups and increasing loyalty among existing ones.
You can extend the concepts of this solution to other industries, such as insurance, telecommunications, or retail.
Reference Architectures
This framework uses MongoDB Atlas for data storage and semantic search, Voyage AI for embeddings, and Open AI to convert speech to text. The implementation pre-processes video and audio, and then uses a semantic search component. Figure 1 represents the video and image pre-processing.
Figure 1. Video processing framework
The workflow operates as follows:
The
moviepy
Python library transforms the MP4 movie file into image frames and an MP3 audio file.The
pydub
library converts audio into chunks.A speech-to-text provider converts audio chunks into text.
Voyage AI transforms pairs of text and images into embeddings with a multimodal embedding model. The embeddings encode these modalities in a single transformer, creating a unified vector representation that captures the meaning of visuals and text together.
MongoDB Atlas stores vectors and their metadata as documents with timestamps that identify individual documents.
After pre-processing, you can create your Vector Search indexes and perform semantic search in your application. The image below represents this process:
Figure 2. Video search process with MongoDB
In this workflow, Vector Search finds the metadata of the best matching video and its timestamp. With this information, the application displays the results at the appropriate video offset.
You can now search for content displayed in the video like basketball or car play ultra. In this application, the solution selects one of the two available videos, and plays it from the appropriate video offset.
Data Model Approach
Vector embeddings convert text, voice, and sentences to numerical values that represent their meaning. Building upon this concept, multimodal embedding models vectorize interleaved text and images into a single vector space with the same dimensionality.
You can use the flexibility of the document model to store multimodal embeddings along with their metadata in a single document. The following code shows a sample document:
{ "movie": "mymovie" , "offset": 0, "text_offset": 0, "embedding": [<list of floats>] }
The embedding
field contains the joint information from embedded
images and text. The metadata includes the video name, the image offset,
and the voice offset. You can adapt this structure to your specific
requirements.
Build the Solution
Follow these steps to replicate the video search solution using the
MongoDB ist.media
GitHub repository.
You can use this framework as inspiration to build your own customized
solution.
Set up your environment variables
Set your environment variables for different components of this solution by running the following commands:
MongoDB Atlas cluster:
export MONGODB_IST_MEDIA=<your token> Voyage AI embeddings:
export VOYAGE_API_KEY=<your_token> Open AI token:
export OPENAI_API_KEY=<your_token>
Use your own videos
The video folder
in the GitHub repository controls the video search service. Go to
the README
and follow the instructions for the helper scripts
to adapt the solution to your needs.
Key Learnings
Store metadata and embeddings together: Store your embeddings and their metadata in a single document with MongoDB’s flexible document model. This structure powers AI-driven applications with advanced capabilities such as semantic video search.
Use multimodal embedding models: Transform unstructured data from multiple modalities, such as images and text, into a shared vector space with multimodal embedding models. You can use Voyage AI’s voyage-multimodal-3 model to directly vectorize inputs containing interleaved text and images.
Enable semantic search capabilities: Use Vector Search to index and query your vector data. Vector Search enables you to query data based on its semantic meaning, retrieving the most relevant results for your video search application.
Authors
Benjamin Lorenz, MongoDB
Diego Canales, MongoDB