Enhance Media Experiences with Semantic Video Search

Discover how you can implement video search capabilities in your media applications using MongoDB Atlas and Voyage AI’s multimodal embedding models.

Use cases: Gen AI

Industries: Media

Products: MongoDB Atlas, MongoDB Atlas Vector Search, Voyage AI Multimodal Embeddings

Partners: Open AI for speech-to-text conversion

Solution Overview

The media and entertainment industry is adopting digital transformation to drive growth strategies. According to a PWC study, industry revenues are expected to grow at a 3.7% compound annual growth rate from 2025 to 2029, rising from US$2.9 trillion to US$3.5 trillion in 2029. For media companies, a driving force of this growth is video entertainment generated through social video platforms, streaming on-demand, and news channels.

In the US alone, consumers spend an average of six hours per day with media and entertainment content, more than half of which is devoted to video. Media companies must capitalize on this video data to offer premium video experiences for their customers and streamline operations. Leveraging video data enables the development of innovative applications, such as semantic video search.

Semantic video search enables users to find specific content in videos based on its contextual meaning. This technique uses embeddings and vector search capabilities to transform video content into numerical values that can be processed by computational algorithms. For example, users can provide a query, such as police cars on the road, to the video search application, which then locates the corresponding scene in the video.

This solution shows how you can implement a semantic video search service for a media application. In this service, MongoDB Atlas supports data storage and vector search capabilities, while Voyage AI provides multimodal embeddings. This functionality provides the following benefits:

Enables better user experiences with enhanced content discovery.
Reduces time spent searching for information in lengthy videos, improving efficiency.
Drives revenue by attracting new customer groups and increasing loyalty among existing ones.

You can extend the concepts of this solution to other industries, such as insurance, telecommunications, or retail.

Reference Architectures

This framework uses MongoDB Atlas for data storage and semantic search, Voyage AI for embeddings, and Open AI to convert speech to text. The implementation pre-processes video and audio, and then uses a semantic search component. Figure 1 represents the video and image pre-processing.

Figure 1. Video processing framework

The workflow operates as follows:

The moviepy Python library transforms the MP4 movie file into image frames and an MP3 audio file.
The pydub library converts audio into chunks.
A speech-to-text provider converts audio chunks into text.
Voyage AI transforms pairs of text and images into embeddings with a multimodal embedding model. The embeddings encode these modalities in a single transformer, creating a unified vector representation that captures the meaning of visuals and text together.
MongoDB Atlas stores vectors and their metadata as documents with timestamps that identify individual documents.

After pre-processing, you can create your Vector Search indexes and perform semantic search in your application. The image below represents this process:

Figure 2. Video search process with MongoDB

In this workflow, Vector Search finds the metadata of the best matching video and its timestamp. With this information, the application displays the results at the appropriate video offset.

You can now search for content displayed in the video like basketball or car play ultra. In this application, the solution selects one of the two available videos, and plays it from the appropriate video offset.

Data Model Approach

Vector embeddings convert text, voice, and sentences to numerical values that represent their meaning. Building upon this concept, multimodal embedding models vectorize interleaved text and images into a single vector space with the same dimensionality.

You can use the flexibility of the document model to store multimodal embeddings along with their metadata in a single document. The following code shows a sample document:

{
  "movie": "mymovie" ,
  "offset": 0,
  "text_offset": 0,
  "embedding": [<list of floats>]
}

The embedding field contains the joint information from embedded images and text. The metadata includes the video name, the image offset, and the voice offset. You can adapt this structure to your specific requirements.

Build the Solution

Follow these steps to replicate the video search solution using the MongoDB ist.media GitHub repository. You can use this framework as inspiration to build your own customized solution.

Set up your environment variables

Set your environment variables for different components of this solution by running the following commands:

MongoDB Atlas cluster:
```
export MONGODB_IST_MEDIA=<your token>
```
Voyage AI embeddings:
```
export VOYAGE_API_KEY=<your_token>
```
Open AI token:
```
export OPENAI_API_KEY=<your_token>
```

Deploy the `ist.media` demo

Clone the ist.media GitHub repository and follow the README instructions to deploy the demo.

Test video search capabilities in the demo

Go to the video search tab and test video search capabilities. Use suggested words, such as police or Greece, to experiment with your video search application works.

Figure 3. Video search service in the IST Media demo

Use your own videos

The video folder in the GitHub repository controls the video search service. Go to the README and follow the instructions for the helper scripts to adapt the solution to your needs.

Key Learnings

Store metadata and embeddings together: Store your embeddings and their metadata in a single document with MongoDB’s flexible document model. This structure powers AI-driven applications with advanced capabilities such as semantic video search.
Use multimodal embedding models: Transform unstructured data from multiple modalities, such as images and text, into a shared vector space with multimodal embedding models. You can use Voyage AI’s voyage-multimodal-3 model to directly vectorize inputs containing interleaved text and images.
Enable semantic search capabilities: Use Vector Search to index and query your vector data. Vector Search enables you to query data based on its semantic meaning, retrieving the most relevant results for your video search application.

Authors

Benjamin Lorenz, MongoDB
Diego Canales, MongoDB

Learn More

Back

Gen AI-Powered Video Summarization

Text-to-Audio News Conversion