Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Audio Find - Atlas Vector Search for Audio

Ran Shir, Pavel Duchovny11 min read • Published Sep 09, 2024 • Updated Sep 09, 2024
AIDjangoAWSTriggersJavaScriptPythonAtlas
SNIPPET
Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty

Introduction

As we venture deeper into the realm of digital audio, the frontiers of music discovery are expanding. The pursuit for a more personalized audio experience has led us to develop a state-of-the-art music catalog system. This system doesn't just archive music; it understands it. By utilizing advanced sound embeddings and leveraging the power of MongoDB Atlas Vector Search, we've crafted an innovative platform that recommends songs not by genre or artist, but by the intrinsic qualities of the music itself. Search Illustration
This article was done together with a co-writer, Ran Shir, music composer and founder of Cues Assets , a production music group. We have researched and developed the following architecture to allow businesses to take advantage of their audio materials for searches.

Demo video for the main flow

System architecture overview

At the heart of this music catalog is a Python service, intricately detailed in our Django-based views.py. This service is the workhorse for generating sound embeddings, using the Panns-inference model to analyze and distill the unique signatures of audio files uploaded by users. Here's how our sophisticated system operates:
Audio file upload and storage:
A user begins by uploading an MP3 file through the application's front end. This file is then securely transferred to Amazon S3, ensuring that the user's audio is stored safely in the cloud.
Sound embedding generation: When an audio file lands in our cloud storage, our Django service jumps into action. It downloads the file from S3, using the Python requests library, into a temporary storage on the server to avoid any data loss during processing.
Normalization and embedding processing:
The downloaded audio file is then processed to extract its features. Using librosa, a Python library for audio analysis, the service loads the audio file and passes it to our Panns-inference model. The model, running on a GPU for accelerated computation, computes a raw 4096 members embedding vector which captures the essence of the audio.
Embedding normalization:
The raw embedding is then normalized to ensure consistent comparison scales when performing similarity searches. This normalization step is crucial for the efficacy of vector search, enabling a fair and accurate retrieval of similar songs.
MongoDB Atlas Vector Search integration:
The normalized embedding is then ready to be ingested by MongoDB Atlas. Here, it's indexed alongside the metadata of the audio file in the "embeddings" field. This indexing is what powers the vector search, allowing the application to perform a K-nearest neighbor (KNN) search to find and suggest the songs most similar to the one uploaded by the user.
User interaction and feedback:
Back on the front end, the application communicates with the user, providing status updates during the upload process and eventually serving the results of the similarity search, all in a user-friendly and interactive manner.
Sound Catalog Similarity Architecture
This architecture encapsulates a blend of cloud technology, machine learning, and database management to deliver a unique music discovery experience that's as intuitive as it is revolutionary.

Uploading and storing MP3 files

The journey of an MP3 file through our system begins the moment a user selects a track for upload. The frontend of the application, built with user interaction in mind, takes the first file from the dropped files and prepares it for upload. This process is initiated with an asynchronous call to an endpoint that generates a signed URL from AWS S3. This signed URL is a token of sorts, granting temporary permission to upload the file directly to our S3 bucket without compromising security or exposing sensitive credentials.

Frontend code for file upload

The frontend code, typically written in JavaScript for a web application, makes use of the axios library to handle HTTP requests. When the user selects a file, the code sends a request to our back end to retrieve a signed URL. With this URL, the file can be uploaded to S3. The application handles the upload status, providing real-time feedback to the user, such as "Uploading..." and then "Searching based on audio..." upon successful upload. This interactive feedback loop is crucial for user satisfaction and engagement.
1async uploadFiles(files) {
2 const file = files[0]; // Get the first file from the dropped files
3 if (file) {
4 try {
5 this.imageStatus = "Uploading...";
6 // Post a request to the backend to get a signed URL for uploading the file
7 const response = await axios.post('https://[backend-endpoint]/getSignedURL', {
8 fileName: file.name,
9 fileType: file.type
10 });
11 const { url } = response.data;
12 // Upload the file to the signed URL
13 const resUpload = await axios.put(url, file, {
14 headers: {
15 'Content-Type': file.type
16 }
17 });
18 console.log('File uploaded successfully');
19 console.log(resUpload.data);
20
21 this.imageStatus = "Searching based on image...";
22 // Post a request to trigger the audio description generation
23 const describeResponse = await axios.post('https://[backend-endpoint]/labelsToDescribe', {
24 fileName: file.name
25 });
26
27 const prompt = describeResponse.data;
28 this.searchQuery = prompt;
29 this.$refs.dropArea.classList.remove('drag-over');
30 if (prompt === "I'm sorry, I can't provide assistance with that request.") {
31 this.imageStatus = "I'm sorry, I can't provide assistance with that request."
32 throw new Error("I'm sorry, I can't provide assistance with that request.");
33 }
34 this.fetchListings();
35 // If the request is successful, show a success message
36 this.showSuccessPopup = true;
37 this.imageStatus = "Drag and drop an image here"
38
39 // Auto-hide the success message after 3 seconds
40 setTimeout(() => {
41 this.showSuccessPopup = false;
42 }, 3000);
43 } catch (error) {
44 console.error('File upload failed:', error);
45 // In case of an error, reset the UI and show an error message
46 this.$refs.dropArea.classList.remove('drag-over');
47 this.showErrorPopup = true;
48
49 // Auto-hide the error message after 3 seconds
50 setTimeout(() => {
51 this.showErrorPopup = false;
52 }, 3000);
53
54 // Reset the status message after 6 seconds
55 setTimeout(() => {
56 this.imageStatus = "Drag and drop an image here"
57 }, 6000);
58
59 }
60 }
61}

Backend Code for Generating Signed URLs

On the backend, a Serverless function that interacts with the AWS SDK. It uses stored AWS credentials to access S3 and create a signed URL, which it then sends back to the frontend. This URL contains all the necessary information for the file upload, including the file name, content type, and access control settings.
1// Serverless function to generate a signed URL for file uploads to AWS S3
2exports = async function({ query, headers, body}, response) {
3
4 // Import the AWS SDK
5 const AWS = require('aws-sdk');
6
7 // Update the AWS configuration with your access keys and region
8 AWS.config.update({
9 accessKeyId: context.values.get('YOUR_AWS_ACCESS_KEY'), // Replace with your actual AWS access key
10 secretAccessKey: context.values.get('YOUR_AWS_SECRET_KEY'), // Replace with your actual AWS secret key
11 region: 'eu-central-1' // The AWS region where your S3 bucket is hosted
12 });
13
14 // Create a new instance of the S3 service
15 const s3 = new AWS.S3();
16 // Parse the file name and file type from the request body
17 const { fileName, fileType } = JSON.parse(body.text())
18
19 // Define the parameters for the signed URL
20 const params = {
21 Bucket: 'YOUR_S3_BUCKET_NAME', // Replace with your actual S3 bucket name
22 Key: fileName, // The name of the file to be uploaded
23 ContentType: fileType, // The content type of the file to be uploaded
24 ACL: 'public-read' // Access control list setting to allow public read access
25 };
26
27 // Generate the signed URL for the 'putObject' operation
28 const url = await s3.getSignedUrl('putObject', params);
29
30 // Return the signed URL in the response
31 return { 'url' : url }
32};

Sound embedding with Panns-inference model

Once an MP3 file is securely uploaded to S3, a Python service, which interfaces with our Django back end, takes over. This service is where the audio file is transformed into something more — a compact representation of its sonic characteristics known as a sound embedding. Using the librosa library, the service reads the audio file, standardizing the sample rate to ensure consistency across all files. The Panns-inference model then takes a slice of the audio waveform and infers its embedding.
1import tempfile
2from django.http import JsonResponse
3from django.views.decorators.csrf import csrf_exempt
4from panns_inference import AudioTagging
5import librosa
6import numpy as np
7import os
8import json
9import requests
10
11# Function to normalize a vector
12def normalize(v):
13 norm = np.linalg.norm(v)
14 return v / norm if norm != 0 else v
15
16# Function to generate sound embeddings from an audio file
17def get_embedding(audio_file):
18 # Initialize the AudioTagging model with the specified device
19 model = AudioTagging(checkpoint_path=None, device='gpu')
20 # Load the audio file with librosa, normalizing the sample rate to 44100
21 a, _ = librosa.load(audio_file, sr=44100)
22 # Add an extra dimension to the array to fit the model's input requirements
23 query_audio = a[None, :]
24 # Perform inference to get the embedding
25 _, emb = model.inference(query_audio)
26 # Normalize the embedding before returning
27 return normalize(emb[0])
28
29# Django view to handle the POST request for downloading and embedding
30@csrf_exempt
31def download_and_embed(request):
32 if request.method == 'POST':
33 try:
34 # Parse the request body to get the file name
35 body_data = json.loads(request.body.decode('utf-8'))
36 file_name = body_data.get('file_name')
37
38 # If the file name is not provided, return an error
39 if not file_name:
40 return JsonResponse({'error': 'Missing file_name in the request body'}, status=400)
41
42 # Construct the file URL (placeholder) and send a request to get the file
43 file_url = f"https://[s3-bucket-url].amazonaws.com/{file_name}"
44 response = requests.get(file_url)
45
46 # If the file is successfully retrieved
47 if response.status_code == 200:
48 # Create a temporary file to store the downloaded content
49 with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as temp_audio_file:
50 temp_audio_file.write(response.content)
51 temp_audio_file.flush()
52 # Log the temporary file's name and size for debugging
53 print(f"Temp file: {temp_audio_file.name}, size: {os.path.getsize(temp_audio_file.name)}")
54
55 # Generate the embedding for the downloaded file
56 embedding = get_embedding(temp_audio_file.name)
57 # Return the embedding as a JSON response
58 return JsonResponse({'embedding': embedding.tolist()})
59 else:
60 # If the file could not be downloaded, return an error
61 return JsonResponse({'error': 'Failed to download the file'}, status=400)
62 except json.JSONDecodeError:
63 # If there is an error in the JSON data, return an error
64 return JsonResponse({'error': 'Invalid JSON data in the request body'}, status=400)
65
66 # If the request method is not POST, return an error
67 return JsonResponse({'error': 'Invalid request'}, status=400)

Role of Panns-inference model

The Panns-inference model is a deep learning model trained to understand and capture the nuances of audio content. It generates a vector for each audio file, which is a numerical representation of the file's most defining features. This process turns a complex audio file into a simplified, quantifiable form that can be easily compared against others.
For more information and setting up this model see the following github example.

Vector search with MongoDB Atlas

Storing and indexing embeddings in MongoDB Atlas
MongoDB Atlas is where the magic of searchability comes to life. The embeddings generated by our Python service are stored in a MongoDB Atlas collection. Atlas, with its robust indexing capabilities, allows us to index these embeddings efficiently, enabling rapid and accurate vector searches. This is the index definition used on the “songs” collection:
1{
2 "mappings": {
3 "dynamic": false,
4 "fields": {
5 "embeddings": {
6 "dimensions": 4096,
7 "similarity": "dotProduct",
8 "type": "knnVector"
9 },
10 "file": {
11 "normalizer": "none",
12 "type": "token"
13 }
14 }
15 }
16}
The "file" field is indexed with a "token" type for file name filtering logic, explained later in the article.
Songs collection sample document:
1{
2_id : ObjectId("6534dd09164a19b0ac1f7311"),
3 file : "Glorious Outcame Full Mix.mp3",
4embeddings : [Array (4096)]
5}

Vector search functionality

Vector search in MongoDB Atlas employs a K-nearest neighbor (KNN) algorithm to find the closest embeddings to the one provided by the user's uploaded file. When a user initiates a search, the system queries the Atlas collection, searching through the indexed embeddings to find and return a list of songs with the most similar sound profiles. This combination of technologies — from the AWS S3 storage and signed URL generation to the processing power of the Panns-inference model, all the way to the search capabilities of MongoDB Atlas — creates a seamless experience. Users can not only upload their favorite tracks but also discover new ones that carry a similar auditory essence, all within an architecture built for scale, speed, and accuracy.
'“Get Songs” functionality The “Get Songs” feature is the cornerstone of the music catalog, enabling users to find songs with a similar auditory profile to their chosen track. When a user uploads a song, the system doesn't just store the file; it actively searches for and suggests tracks with similar sound embeddings. This is achieved through a similarity search, which uses the sound embeddings stored in the MongoDB Atlas collection.
1// Serverless function to perform a similarity search on the 'songs' collection in MongoDB Atlas
2exports = async function({ query, body }, response) {
3 // Initialize the connection to MongoDB Atlas
4 const mongodb = context.services.get('mongodb-atlas');
5 // Connect to the specific database
6 const db = mongodb.db('YourDatabaseName'); // Replace with your actual database name
7 // Connect to the specific collection within the database
8 const songsCollection = db.collection('YourSongsCollectionName'); // Replace with your actual collection name
9
10 // Parse the incoming request body to extract the embedding vector
11 const parsedBody = JSON.parse(body.text());
12 console.log(JSON.stringify(parsedBody)); // Log the parsed body for debugging
13
14 // Perform a vector search using the parsed embedding vector
15 let foundSongs = await songs.aggregate([
16 { "$vectorSearch": {
17 "index" : "default",
18 "queryVector": parsedBody.embedding,
19 "path": "embeddings",
20 "numCandidates": 15,
21 "limit" : 15
22 }
23 }
24 ]).toArray()
25
26 // Map the found songs to a more readable format by stripping unnecessary path components
27 let searchableSongs = foundSongs.map((song) => {
28 // Extract a cleaner, more readable song title
29 let shortName = song.name.replace('.mp3', '');
30 return shortName.replace('.wav', ''); // Handle both .mp3 and .wav file extensions
31 });
32
33 // Prepare an array of $unionWith stages to combine results from multiple collections if needed
34 let unionWithStages = searchableSongs.slice(1).map((songTitle) => {
35 return {
36 $unionWith: {
37 coll: 'RelatedSongsCollection', // Name of the other collection to union with
38 pipeline: [
39 { $match: { "songTitleField": songTitle } }, // Match the song titles against the related collection
40 ],
41 },
42 };
43 });
44
45 // Execute the aggregation query with a $match stage for the first song, followed by any $unionWith stages
46 const relatedSongsCollection = db.collection('YourRelatedSongsCollectionName'); // Replace with your actual related collection name
47 const locatedSongs = await relatedSongsCollection.aggregate([
48 { $match: { "songTitleField": searchableSongs[0] } }, // Start with the first song's match stage
49 ...unionWithStages, // Include additional stages for related songs
50 ]).toArray();
51
52 // Return the array of located songs as the response
53 return locatedSongs;
54};
Since embeddings are stored together with the songs data we can use the embedding field when performing a lookup of nearest N neighbours. This approach implements the "More Like This" button.
1// Get input song 3 neighbours which are not itself. "More Like This"
2 let foundSongs = await songs.aggregate([
3 { "$vectorSearch": {
4 "index" : "default",
5 "queryVector": songDetails.embeddings,
6 "path": "embeddings",
7 "filter" : { "file" : { "$ne" : fullSongName}},
8 "numCandidates": 15,
9 "limit" : 3
10 }}
11 ]).toArray()
The code filter out the searched song itself.
The backend code responsible for the similarity search is a serverless function within MongoDB Atlas. It executes an aggregation pipeline that begins with a vector search stage, leveraging the $vectorSearch operator with queryVector to perform a K-nearest neighbor search. The search is conducted on the "embeddings" field, comparing the uploaded track's embedding with those in the collection to find the closest matches. The results are then mapped to a more human-readable format, omitting unnecessary file path information for the user's convenience.
1 let foundSongs = await songs.aggregate([
2 { "$vectorSearch": {
3 "index" : "default",
4 "queryVector": parsedBody.embedding,
5 "path": "embeddings",
6 "numCandidates": 15,
7 "limit" : 15
8 }
9 }
10 ]).toArray()

Frontend functionality

Uploading and searching for similar songs
The front end provides a drag-and-drop interface for users to upload their MP3 files easily. Once a file is selected and uploaded, the front end communicates with the back end to initiate the search for similar songs based on the generated embedding. This process is made transparent to the user through real-time status updates.
** User Interface and Feedback Mechanisms **
The user interface is designed to be intuitive, with clear indications of the current process — whether it's uploading, searching, or displaying results. Success and error popups inform the user of the status of their request. A success popup confirms the upload and successful search, while an error popup alerts the user to any issues that occurred during the process. These popups are designed to auto-dismiss after a short duration to keep the interface clean and user-friendly.

Challenges and solutions

Developmental challenges

One of the challenges faced was ensuring the seamless integration of various services, such as AWS S3, MongoDB Atlas, and the Python service for sound embeddings. Handling large audio files and processing them efficiently required careful consideration of file management and server resources.

Overcoming the challenges

To overcome these issues, we utilized temporary storage for processing and optimized the Python service to handle large files without significant memory overhead. Additionally, the use of serverless functions within MongoDB Atlas allowed us to manage compute resources effectively, scaling with the demand as needed.

Conclusion

This music catalog represents a fusion of cloud storage, advanced audio processing, and modern database search capabilities. It offers an innovative way to explore music by sound rather than metadata, providing users with a uniquely tailored experience.
Looking ahead, potential improvements could include enhancing the Panns-inference model for even more accurate embedding generation and expanding the database to accommodate a greater variety of audio content. Further refinements to the user interface could also be made, such as incorporating user feedback to improve the recommendation algorithm continually. Looking ahead, potential improvements could include enhancing the model for even more accurate embedding generation and expanding the database to accommodate a greater variety of audio content. Further refinements to the user interface could also be made, such as incorporating user feedback to improve the recommendation algorithm continually. In conclusion, the system stands as a testament to the possibilities of modern audio technology and database management, offering users a powerful tool for music discovery and promising avenues for future development.
Special Thanks: Ran Shir and Cues Assets group for the work, research efforts and materials.
Want to continue the conversation? Meet us over in the MongoDB Community forums!
Top Comments in Forums
Forum Commenter Avatar
Diana_Li_Samlittle tea3 months ago

Very interesting article

See More on Forums

Facebook Icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Flexible Querying With Atlas Search


Jul 12, 2024 | 3 min read
Tutorial

How to Choose the Right Chunking Strategy for Your LLM Application


Jun 17, 2024 | 16 min read
Industry Event
locationEDINA, MN, UNITED STATES | IN-PERSON

MongoDB Developer Day Minneapolis


Nov 14, 2024 | 1:30 PM - 10:00 PM UTC
Tutorial

How to Optimize LLM Applications With Prompt Compression Using LLMLingua and LangChain


Jun 18, 2024 | 13 min read
Table of Contents