You can configure MongoDB Vector Search to automatically generate and manage vector embeddings for the text data in your cluster. You can create a one-click AI semantic search index in your cluster and use Voyage AI embedding models, simplifying indexing, updating, and querying with vectors.
When you enable Automated Embedding, MongoDB Vector Search automatically generates embeddings using the specified embedding model at index-time for the specified text field in your collection and at query-time for your text string in your query against the field indexed for automated embeddings.
Initial Sync
When you create a MongoDB Vector Search index for Automated Embedding, MongoDB performs an initial synchronization to generate embeddings for all existing documents in your collection:
Scans documents.
MongoDB Vector Search scans all documents in the collection that contain the indexed text field.
Generates embeddings.
For each document, MongoDB Vector Search sends the text from the indexed field to the Voyage AI embedding model to generate vector embeddings.
Stores embeddings.
MongoDB Vector Search stores the generated embeddings in a separate internal system collection (
__mdb_internal_search) on the same cluster to keep the embeddings isolated from your application data while maintaining data locality.Builds index.
After embeddings are generated, MongoDB Vector Search builds the index using the generated embeddings to enable vector search.
During the initial sync, MongoDB processes documents in batches and uses a special Flex inference processing tier to optimize throughput.
Note
The initial sync duration depends on the number of documents, the length of text in the indexed field, and the available rate limit quota. It might take several hours to complete the initial sync for large collections.
Ongoing Updates
After the initial sync, MongoDB Vector Search keeps embeddings automatically synchronized with your data as it changes.
Document Inserts
When you insert a new document with the indexed text field, MongoDB Vector Search automatically:
Detects the new document through change streams.
Generates embeddings for the text field using the configured model.
Stores the embeddings in the system collection.
Updates the MongoDB Vector Search index to include the new embeddings.
Document Updates
When you update a document and the indexed text field changes, MongoDB Vector Search automatically:
Detects the field change through change streams.
Generates new embeddings for the updated text.
Replaces the old embeddings in the system collection.
Updates the MongoDB Vector Search index with the new embeddings.
Note
MongoDB Vector Search doesn't trigger embedding regeneration for updates to fields that aren't indexed for Automated Embedding.
Document Deletes
When you delete a document, MongoDB Vector Search automatically removes the corresponding embeddings from the system collection and updates the index.
Model Hosting and Multi-Tenancy
Automated Embedding uses Voyage AI's embedding models, which are hosted and managed by MongoDB in a multi-tenant environment:
Model Infrastructure
Hosted Service: All embedding models are hosted and maintained by MongoDB. The model inference platform runs on MongoDB's infrastructure in Google Cloud cloud in a US region. You don't need to deploy, configure, or manage any model infrastructure.
API-Based Access: For self-managed deployments that are configured to use Voyage AI API key, MongoDB sends text to Voyage AI's API endpoints to generate embeddings. The embeddings are returned to MongoDB and stored in your cluster.
Multi-Tenant Architecture: The embedding service is shared across multiple users. This multi-tenant model provides:
Cost efficiency through shared infrastructure
Automatic model updates and improvements
High availability and scalability
Data Privacy
Text sent to Automated Embedding service is used only to generate embeddings and is not stored or used for model training.
Embeddings are returned to your MongoDB cluster and stored within your own database.
All communication with the Automated Embedding service occurs over encrypted connections.
Rate Limits
The embedding service is multi-tenant. Therefore, MongoDB enforces rate limits to ensure fair usage across all customers. To learn more about rate limits and how they affect Automated Embedding operations, see Rate Limits.
Query Processing
When you run a vector search query using Automated Embedding, MongoDB automatically handles embedding generation for your query text:
Query Text Submission: You provide a text string in the
queryfield of the$vectorSearchstage instead of a pre-generated vector.Embedding Generation: MongoDB sends your query text to the Automated Embedding service to generate embeddings using the same model specified in the index (or a compatible model if you override it with the
modeloption).Vector Search: The generated query embeddings are used to search the indexed embeddings using the configured similarity function (cosine, dotProduct, or euclidean).
Results Returned: MongoDB returns documents ranked by similarity to your query.
Query Rate Limits
Each query that uses Automated Embedding counts toward your Automated Embedding rate limits because it requires an API call to generate embeddings. To learn more about managing query throughput and costs, see Rate Limits.
Impact on Operations
Initial Sync
Large collections might take significant time to complete initial sync if you hit rate limits.
MongoDB automatically retries failed embedding requests and implements exponential backoff.
You can monitor sync progress through Atlas Search monitoring.
Ongoing Updates
Document updates are processed as they occur, subject to rate limits.
If updates exceed rate limits, they are queued and processed when capacity becomes available.
Your application continues to function normally; only embedding generation might be delayed.
Queries
Query rate limits affect how many concurrent searches you can perform.
If you exceed query rate limits, queries return an error indicating the rate limit has been exceeded.
Consider caching frequently used query results or upgrading to a paid tier for higher throughput.
Generated Embeddings Collection
Automated Embedding uses a separate reserve database to store vector embeddings. You can find the generated embeddings collection for an index and retrieve the embeddings from the generated embeddings collection.
- Embeddings Storage
MongoDB stores the generated embeddings asynchronously and persists them in an internal generated embeddings collection. This generated embeddings collection exists in a dedicated internal database named
__mdb_internal_searchon the same cluster. Every auto-embedding index in the cluster has exactly one corresponding generated embeddings collection inside this database. To learn more, see Generated Embeddings Collection.Warning
The
__mdb_internal_searchdatabase is a reserved internal namespace created and managed by MongoDB. Don't manipulate this database or its collections. If you modify this reserved namespace, it could result in index failures and inconsistent search results.- Structure of the Generated Embeddings Collection
The generated embeddings collection contains one document per source-collection document. Each generated embeddings collection document has the same
_idas the source, copies of the source's filter fields, and the generated embedding vector for each Automated Embedding field.You can see the following fields:
FieldTypeDescription_idObjectId
Same
_idas the source document.<filter-field>Any
Copy of the filter field from the source document.
_autoEmbedObject
Contains the embedding vector for each Automated Embedding field.
_autoEmbed.<fieldPath>Array of float or quantized vector
Contains the generated embedding vector for the Automated Embedding field.
Find the Generated Embeddings Collection
Warning
The __mdb_internal_search database is a reserved internal namespace
created and managed by MongoDB. Don't manipulate this database or its
collections. If you modify this reserved namespace, it could result in
index failures and inconsistent search results.
Get the ID of the index.
Run the following query after replacing the following placeholders:
<database_name>- Name of the database that contains the Automated Embedding index.<collection_name>- Name of the collection that contains the Automated Embedding index.<index_name>- Name of the Automated Embedding index.
1 use <database_name> 2 db.<collection_name>.aggregate( [ { $listSearchIndexes: { name: "<index_name>" } } ] )
1 [ 2 { 3 id: '69f382ecd6fa583100184fe7', 4 name: 'auto-embed-index', 5 type: 'vectorSearch', 6 status: 'READY', 7 numDocs: 0, 8 latestDefinition: { ... }, 9 statusDetail: [ ... ] 10 } 11 ]
Get the generated embeddings collection:
Run the following query after replacing <index_id> with the ID of the
Automated Embedding index returned by the command in the preceding step.
1 use __mdb_internal_search 2 db.getCollectionNames().filter(n => n.startsWith("<index_id>"))
[ '69f382ecd6fa583100184fe7-96dad03b0a735a19fd9f1a22f9694efc-1-0' ]
The output is the name of the generated embeddings collection.
Check how many documents have been created in the generated embeddings collection.
Run the following query after replacing the following placeholders:
<generated_embeddings_collection_name>- Name of the generated embeddings collection.
1 use __mdb_internal_search 2 const mvColl = "<generated_embeddings_collection_name>" 3 db.getCollection(mvColl).countDocuments()
1 100
Check the Storage Size of the Generated Embeddings Collection
You can check the storage size of generated embeddings collections to understand disk and index space consumption from generated embeddings. This is useful for capacity planning, debugging unexpected growth, and validating cleanup after dropping or redefining an index.
Important
Before checking storage size, find your generated embeddings collection name. To learn more, see Find the Generated Embeddings Collection.
Log in to Atlas and navigate to the Data Explorer.
If it's not already displayed, select your desired organization from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
In the sidebar, click Data Explorer under the Database section.
In the Data Explorer page, the __mdb_internal_search database
is displayed alongside your other databases. This database view shows:
Storage size - Total disk space used by the database.
Data size - Total size of all documents in the database.
Collections - Number of collections in the database.
Indexes - Number of indexes in the database.
Sort by Storage size to see the amount of cluster disk space used by Automated Embedding.
View individual collection metrics.
Click __mdb_internal_search to see each generated embeddings
collection with:
Storage size - Total disk space used by the collection.
Data size - Total size of all documents in the collection.
Documents - Number of documents in the collection.
Avg. document size - Average size of a document in the collection.
Indexes - Number of indexes on the collection.
Total index size - Total size of all indexes on the collection.
Note
The auto_embedding_leases collection is used for
leader election and typically uses only kilobytes of
storage.
Check the storage size of the generated embeddings collection.
Run the following query after replacing the <generated_embeddings_collection_name>
with the name of the generated embeddings collection:
1 use __mdb_internal_search 2 const mvColl = "<generated_embeddings_collection_name>" 3 db.getCollection(mvColl).stats()
The collStats command provides detailed storage metrics for
generated embeddings collections. Use this approach when you need
scripted access, sharded-cluster aggregation, or scheduled
monitoring.
The following collStats fields provide storage information:
Field | Description |
|---|---|
| Number of documents in the generated embeddings collection. One document exists for each source document with generated embeddings. |
| Uncompressed logical size of all documents, in bytes. |
| On-disk size of the collection's data files after WiredTiger compression, in bytes. |
| On-disk size of all MongoDB indexes on the generated embeddings collection, in bytes. |
| Sum of |
| Average uncompressed document size. Useful for validating per-document embedding size. |
Note
storageSizeandtotalIndexSizereflect actual disk usage.sizeis the uncompressed logical view and is typically larger.These metrics show storage in the MongoDB cluster only. They don't include disk used by the Lucene vector index on the
mongothost.
Run the following commands against the __mdb_internal_search
database:
1 use __mdb_internal_search 2 3 const mvColl = "<generated_embeddings_collection_name>"; 4 5 db.runCommand({ collStats: mvColl }).count 6 db.runCommand({ collStats: mvColl, scale: 1024 * 1024 })
1 { 2 ns: '__mdb_internal_search.69f382ecd6fa583100184fe7-96dad03b0a735a19fd9f1a22f9694efc-1-0', 3 size: 5142, 4 count: 1250000, 5 avgObjSize: 4312, 6 numOrphanDocs: 0, 7 storageSize: 1830, 8 freeStorageSize: 7, 9 capped: false, 10 wiredTiger: { ... }, 11 nindexes: 1, 12 indexDetails: { ... }, 13 indexBuilds: [], 14 totalIndexSize: 42, 15 indexSizes: { _id_: 0 }, 16 totalSize: 1872, 17 scaleFactor: 1048576, 18 ok: 1, 19 '$clusterTime': { 20 clusterTime: Timestamp({ t: 1777646199, i: 1 }), 21 signature: { 22 hash: Binary.createFromBase64('pomqluUIpiZzLro3VWhO4dt2LKE=', 0), 23 keyId: Long('7634583163557117960') 24 } 25 }, 26 operationTime: Timestamp({ t: 1777646199, i: 1 }) 27 }
For a formatted summary, run:
1 const s = db.runCommand({ collStats: mvColl, scale: 1024 * 1024 }); 2 ({ 3 count: s.count, 4 avgObjSizeKB: (s.avgObjSize / 1024).toFixed(2), 5 dataMB: s.size, 6 storageMB: s.storageSize, 7 indexesMB: s.totalIndexSize, 8 totalMB: s.totalSize, 9 })
1 { 2 "count": 1250000, 3 "avgObjSizeKB": "4.21", 4 "dataMB": 5142, 5 "storageMB": 1830, 6 "indexesMB": 42, 7 "totalMB": 1872 8 }
1 from pymongo import MongoClient 2 3 MV_DATABASE = "__mdb_internal_search" 4 MB = 1024 * 1024 5 6 def get_mv_storage_stats(client, mv_collection_name): 7 """Return storage metrics for a generated embeddings collection.""" 8 db = client[MV_DATABASE] 9 stats = db.command("collStats", mv_collection_name, scale=MB) 10 return { 11 "count": stats["count"], 12 "avg_obj_kb": round(stats["avgObjSize"] / 1024, 2), 13 "data_mb": stats["size"], 14 "storage_mb": stats["storageSize"], 15 "indexes_mb": stats["totalIndexSize"], 16 "total_mb": stats["totalSize"], 17 } 18 19 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/") 20 print(get_mv_storage_stats(client, "<generated_embeddings_collection_name>"))
Sample output:
{'count': 1250000, 'avg_obj_kb': 4.21, 'data_mb': 5142, 'storage_mb': 1830, 'indexes_mb': 42, 'total_mb': 1872}
On sharded source collections, each shard has its own generated
embeddings collection in the __mdb_internal_search database.
mongos doesn't see these collections, so you must query each
shard's mongod directly and sum the results.
The following script connects to each shard, queries the generated embeddings collection, and returns per-shard and total metrics:
1 from pymongo import MongoClient 2 3 MV_DATABASE = "__mdb_internal_search" 4 MB = 1024 * 1024 5 6 def _resolve_mv_name(client, source_db, source_collection, index_name): 7 """Find the generated embeddings collection name for an index.""" 8 src = client[source_db][source_collection] 9 indexes = list(src.aggregate([{"$listSearchIndexes": {"name": index_name}}])) 10 if not indexes: 11 raise LookupError(f"No search index named {index_name!r}") 12 index_id = indexes[0]["id"] 13 matches = [n for n in client[MV_DATABASE].list_collection_names() 14 if n.startswith(index_id)] 15 if not matches: 16 return None 17 matches.sort(reverse=True) 18 return matches[0] 19 20 def get_mv_storage_per_shard(shard_uris, source_db, source_collection, index_name): 21 """Get per-shard and total storage for a sharded cluster.""" 22 per_shard = {} 23 totals = {"count": 0, "data_mb": 0, "storage_mb": 0, 24 "indexes_mb": 0, "total_mb": 0} 25 26 for shard_name, uri in shard_uris.items(): 27 client = MongoClient(uri) 28 mv_name = _resolve_mv_name(client, source_db, source_collection, index_name) 29 30 if mv_name is None: 31 per_shard[shard_name] = {"note": "no MV found (still building?)"} 32 continue 33 34 s = client[MV_DATABASE].command("collStats", mv_name, scale=MB) 35 row = { 36 "mv": mv_name, 37 "count": s["count"], 38 "data_mb": s["size"], 39 "storage_mb": s["storageSize"], 40 "indexes_mb": s["totalIndexSize"], 41 "total_mb": s["totalSize"], 42 } 43 per_shard[shard_name] = row 44 45 for k in totals: 46 totals[k] += row[k] 47 48 return {"per_shard": per_shard, "totals": totals} 49 50 # Usage 51 shard_uris = { 52 "shard-00": "mongodb://<user>:<pwd>@shard-00.example.net:27017/?replicaSet=shard-00", 53 "shard-01": "mongodb://<user>:<pwd>@shard-01.example.net:27017/?replicaSet=shard-01", 54 "shard-02": "mongodb://<user>:<pwd>@shard-02.example.net:27017/?replicaSet=shard-02", 55 } 56 57 result = get_mv_storage_per_shard( 58 shard_uris, 59 source_db="<source_db>", 60 source_collection="<source_collection>", 61 index_name="<index_name>", 62 ) 63 64 for shard, row in result["per_shard"].items(): 65 print(shard, row) 66 print("TOTAL:", result["totals"])
shard-00 {'mv': '69e183...-1-3', 'count': 416000, 'data_mb': 1714, 'storage_mb': 612, 'indexes_mb': 14, 'total_mb': 626} shard-01 {'mv': '69e183...-1-3', 'count': 418200, 'data_mb': 1721, 'storage_mb': 615, 'indexes_mb': 14, 'total_mb': 629} shard-02 {'mv': '69e183...-1-3', 'count': 415800, 'data_mb': 1707, 'storage_mb': 603, 'indexes_mb': 14, 'total_mb': 617} TOTAL: {'count': 1250000, 'data_mb': 5142, 'storage_mb': 1830, 'indexes_mb': 42, 'total_mb': 1872}
To check storage for all Automated Embedding indexes on a cluster,
sum collStats for every collection in __mdb_internal_search.
This is useful for capacity reviews and identifying orphaned generated
embeddings collections.
Run the following in mongosh on a single replica set or on each
shard for sharded clusters:
1 use __mdb_internal_search 2 3 const MB = 1024 * 1024; 4 const rows = db.getCollectionNames().map(name => { 5 const s = db.runCommand({ collStats: name, scale: MB }); 6 return { 7 collection: name, 8 count: s.count, 9 storageMB: s.storageSize, 10 indexesMB: s.totalIndexSize, 11 totalMB: s.totalSize, 12 }; 13 }); 14 15 const total = rows.reduce((a, r) => ({ 16 storageMB: a.storageMB + r.storageMB, 17 indexesMB: a.indexesMB + r.indexesMB, 18 totalMB: a.totalMB + r.totalMB, 19 }), { storageMB: 0, indexesMB: 0, totalMB: 0 }); 20 21 print("Per-collection:"); 22 printjson(rows); 23 print("Cluster total:"); 24 printjson(total);
1 Per-collection: 2 [ 3 { "collection": "69e183...-1-3", "count": 1250000, "storageMB": 1830, "indexesMB": 42, "totalMB": 1872 }, 4 { "collection": "71fa42...-1-1", "count": 84000, "storageMB": 121, "indexesMB": 3, "totalMB": 124 } 5 ] 6 Cluster total: 7 { "storageMB": 1951, "indexesMB": 45, "totalMB": 1996 } 8 }
Note
On sharded clusters, run this command on each shard and sum the results.
Retrieve Embeddings from the Generated Embeddings Collection
Log in to Atlas.
Go to the Data Explorer page for your project.
If it's not already displayed, select your desired organization from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
In the sidebar, click Data Explorer under the Database header.
The Data Explorer page displays.
Retrieve embeddings for a document.
Run the following query after replacing the following placeholders:
<generated_embeddings_collection_name>- Name of the generated embeddings collection.<document_id>-_idof the document in the source collection.<auto_embed_field>- Name of the field indexed for Automated Embedding.
1 use __mdb_internal_search 2 const mvColl = "<generated_embeddings_collection_name>" 3 db.getCollection(mvColl).findOne( 4 { _id: "<document_id>" }, 5 { _id: 1, "_autoEmbed.<auto_embed_field>": 1 } 6 )
1 [ 2 { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" }, 3 { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" }, 4 { 5 _autoEmbed: { 6 fullplot: Binary.fromInt8Array(new Int8Array([ 7 5, -30, 16, 4, -57, -8, -17, -13, 16, 11, -22, 15, 8 -7, 13, 8, -2, -1, -14, 27, 10, -9, 20, 14, -2, 9 3, -56, -21, 10, -24, 12, 10, 9, 12, 7, 4, 14, 10 -7, -24, -15, 16, 13, 21, -4, -16, -12, -15, 3, -33, 11 5, -21, 2, -1, 0, 16, 7, 13, 19, 4, 5, -14, 12 -34, 7, -16, 38, 4, 4, 7, -22, 8, 14, 15, -14, 13 -4, 6, 22, -17, 8, 27, 8, 13, 46, -12, -7, -9, 14 -20, 13, 10, 4, -14, -11, 31, -7, 0, -3, 1, 16, 15 9, 5, 6, -2, 16 ... 924 more items 17 ])) 18 }, 19 _id: "ObjectId('573a1390f29313caabcd5c0f')" 20 }, 21 { 22 _autoEmbed: { 23 fullplot: Binary.fromInt8Array(new Int8Array([ 24 -5, -22, 22, -6, -43, -13, -5, 4, 5, 2, 4, 13, 25 0, -3, -3, -50, -5, -2, -2, 27, -5, 36, 27, 12, 26 -12, -6, -1, 9, -7, 25, 4, -28, 3, 9, 3, 23, 27 8, 11, 11, 25, -19, 27, 17, 18, -1, 0, 5, -12, 28 13, -5, -3, 3, -17, 16, -15, 43, -1, 1, 1, -6, 29 -26, 16, -11, 13, 14, 0, -9, -23, 25, -16, 11, -25, 30 7, 9, -1, 0, 33, -8, -3, -18, 3, 4, -20, -14, 31 17, -2, -2, -10, 17, -25, -11, 9, 1, 2, -8, 7, 32 20, 18, 17, -2, 33 ... 924 more items 34 ])) 35 }, 36 _id: "ObjectId('573a1390f29313caabcd5c0f')" 37 }, 38 { 39 _autoEmbed: { 40 fullplot: Binary.fromInt8Array(new Int8Array([ 41 0, -1, 47, 6, -20, -14, 29, -2, 13, -1, 20, 11, 42 -18, -7, 12, -10, -25, 10, 7, -15, 11, 9, -14, 12, 43 -9, -22, 16, 0, 18, 5, 9, -26, 14, -27, 6, 20, 44 -19, -8, 1, -5, 21, 13, -37, -7, 0, -21, -51, 1, 45 -38, -14, 4, 6, -23, 15, 19, 33, 8, 0, -7, -3, 46 -25, 8, -29, 25, -1, 12, 4, -21, -1, 0, -14, -3, 47 -6, -3, 7, 30, 8, -8, 34, -19, -12, -29, -15, -14, 48 1, -4, 6, -2, -36, -18, -2, 4, 23, 17, -13, 1, 49 0, 7, 25, -19, 50 ... 924 more items 51 ])) 52 }, 53 _id: "ObjectId('573a1390f29313caabcd5c0f')" 54 } 55 ]
Retrieve embeddings for multiple documents.
Run the following query after replacing the following placeholders:
<generated_embeddings_collection_name>- Name of the generated embeddings collection.<document_id>-_idof the document in the source collection.<auto_embed_field>- Name of the field indexed for Automated Embedding.<number_of_documents>- Number of documents to return.
1 use __mdb_internal_search 2 const mvColl = "<generated_embeddings_collection_name>" 3 db.getCollection(mvColl).find( 4 {}, 5 { _id: "<document_id>", "_autoEmbed.<auto_embed_field>": { $slice: 5 } } 6 ).limit(<number_of_documents>)
1 [ 2 { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" }, 3 { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" }, 4 { 5 _autoEmbed: { 6 fullplot: Binary.fromInt8Array(new Int8Array([ 7 5, -30, 16, 4, -57, -8, -17, -13, 16, 11, -22, 15, 8 -7, 13, 8, -2, -1, -14, 27, 10, -9, 20, 14, -2, 9 3, -56, -21, 10, -24, 12, 10, 9, 12, 7, 4, 14, 10 -7, -24, -15, 16, 13, 21, -4, -16, -12, -15, 3, -33, 11 5, -21, 2, -1, 0, 16, 7, 13, 19, 4, 5, -14, 12 -34, 7, -16, 38, 4, 4, 7, -22, 8, 14, 15, -14, 13 -4, 6, 22, -17, 8, 27, 8, 13, 46, -12, -7, -9, 14 -20, 13, 10, 4, -14, -11, 31, -7, 0, -3, 1, 16, 15 9, 5, 6, -2, 16 ... 924 more items 17 ])) 18 }, 19 _id: "ObjectId('573a1390f29313caabcd5c0f')" 20 }, 21 { 22 _autoEmbed: { 23 fullplot: Binary.fromInt8Array(new Int8Array([ 24 -5, -22, 22, -6, -43, -13, -5, 4, 5, 2, 4, 13, 25 0, -3, -3, -50, -5, -2, -2, 27, -5, 36, 27, 12, 26 -12, -6, -1, 9, -7, 25, 4, -28, 3, 9, 3, 23, 27 8, 11, 11, 25, -19, 27, 17, 18, -1, 0, 5, -12, 28 13, -5, -3, 3, -17, 16, -15, 43, -1, 1, 1, -6, 29 -26, 16, -11, 13, 14, 0, -9, -23, 25, -16, 11, -25, 30 7, 9, -1, 0, 33, -8, -3, -18, 3, 4, -20, -14, 31 17, -2, -2, -10, 17, -25, -11, 9, 1, 2, -8, 7, 32 20, 18, 17, -2, 33 ... 924 more items 34 ])) 35 }, 36 _id: "ObjectId('573a1390f29313caabcd5c0f')" 37 }, 38 { 39 _autoEmbed: { 40 fullplot: Binary.fromInt8Array(new Int8Array([ 41 0, -1, 47, 6, -20, -14, 29, -2, 13, -1, 20, 11, 42 -18, -7, 12, -10, -25, 10, 7, -15, 11, 9, -14, 12, 43 -9, -22, 16, 0, 18, 5, 9, -26, 14, -27, 6, 20, 44 -19, -8, 1, -5, 21, 13, -37, -7, 0, -21, -51, 1, 45 -38, -14, 4, 6, -23, 15, 19, 33, 8, 0, -7, -3, 46 -25, 8, -29, 25, -1, 12, 4, -21, -1, 0, -14, -3, 47 -6, -3, 7, 30, 8, -8, 34, -19, -12, -29, -15, -14, 48 1, -4, 6, -2, -36, -18, -2, 4, 23, 17, -13, 1, 49 0, 7, 25, -19, 50 ... 924 more items 51 ])) 52 }, 53 _id: "ObjectId('573a1390f29313caabcd5c0f')" 54 } 55 ]
To retrieve embeddings from the generated embeddings collection, you can use the following Python script. To run the script, install the PyMongo Driver.
Copy and paste the following code into the get_embedding.py file.
1 from pymongo import MongoClient 2 3 MV_DATABASE = "__mdb_internal_search" 4 5 def get_mv_collection(client, source_db, source_collection, index_name): 6 """Resolve the MV collection for an auto-embedding index.""" 7 # 1. Look up the index ID via $listSearchIndexes on the source collection. 8 src = client[source_db][source_collection] 9 indexes = list(src.aggregate([{"$listSearchIndexes": {"name": index_name}}])) 10 if not indexes: 11 raise LookupError(f"No search index named {index_name!r} on {source_db}.{source_collection}") 12 index_id = indexes[0]["id"] 13 14 # 2. Find the MV collection in __mdb_internal_search whose name starts with the index ID. 15 mv_db = client[MV_DATABASE] 16 matches = [n for n in mv_db.list_collection_names() if n.startswith(index_id)] 17 if not matches: 18 raise LookupError(f"No MV collection found for index {index_id} (index may still be building)") 19 if len(matches) > 1: 20 # Possible briefly during an auto-embed field update; pick the newest. 21 matches.sort(reverse=True) 22 return mv_db[matches[0]] 23 24 def get_embedding(client, source_db, source_collection, index_name, embed_path, source_id): 25 """Fetch the embedding for a single source document.""" 26 mv = get_mv_collection(client, source_db, source_collection, index_name) 27 doc = mv.find_one( 28 {"_id": source_id}, 29 {"_id": 1, f"_autoEmbed.{embed_path}": 1}, 30 ) 31 if doc is None: 32 return None 33 return doc["_autoEmbed"][embed_path] 34 35 # --- Usage --- 36 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/") 37 38 embedding = get_embedding( 39 client, 40 source_db="<source_db>", 41 source_collection="<source_collection>", 42 index_name="<auto_embed_index_name>", 43 embed_path="<auto_embed_field>", 44 source_id="<document_id>", 45 ) 46 47 print(f"dims: {len(embedding)}") 48 print(f"first 5: {embedding[:5]}")
Replace the following placeholders in the get_embedding.py file:
Placeholder | Description |
|---|---|
| Your username for your MongoDB deployment. |
| Your password for your MongoDB deployment. |
| Your cluster connection string for your MongoDB deployment. |
| Name of the database that contains the source collection. |
| Name of the source collection. |
| Name of the Automated Embedding index. |
|
|
| Name of the field indexed for Automated Embedding. |
| Number of documents to return. |
To stream embeddings from the generated embeddings collection, you can use the following Python script.
Copy and paste the following code into the stream_embedding.py file.
1 from pymongo import MongoClient 2 3 # --- Usage --- 4 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/") 5 6 mv = get_mv_collection(client, "<source_db>", "<source_collection>", "<auto_embed_index_name>") 7 8 cursor = mv.find( 9 {}, 10 {"_id": 1, "_autoEmbed.<auto_embed_field>": 1}, 11 batch_size=500, 12 ) 13 14 for doc in cursor: 15 src_id = doc["_id"] 16 vec = doc["_autoEmbed"]["<auto_embed_field>"]
Replace the following placeholders in the stream_embedding.py file:
Placeholder | Description |
|---|---|
| Your username for your MongoDB deployment. |
| Your password for your MongoDB deployment. |
| Your cluster connection string for your MongoDB deployment. |
| Name of the database that contains the source collection. |
| Name of the source collection. |
| Name of the Automated Embedding index. |
| Name of the field indexed for Automated Embedding. |
Troubleshooting
The following sections provide guidance for troubleshooting common issues with Automated Embedding.
- No generated embeddings collection matching the index ID
- Your index might still be in :guialbel`Building` or Pending state.
The generated embeddings collection is created lazily on first write. Check status using the
$listSearchIndexes. - Document missing for a source
_id - The embedding for that the specified document has not yet been generated, or the document was filtered out by the index's filter expression.
- More than one collection matches the index ID
- The auto-embed field configuration has been updated. Although a new generated embeddings collection has been created, the old one might linger briefly until cleanup.