Automated Embedding uses Voyage AI's embedding models, which Atlas hosts and manages in the Data Plane in a multi-tenant environment.
Supported Models
Automated Embedding supports the following Voyage AI embedding models:
Embedding Model | Description | Price Per 1M Tokens |
|---|---|---|
| Optimized for high-volume, cost-sensitive applications. | $0.02 |
| (Recommended) Balanced performance for general text search. | $0.06 |
| Maximum accuracy for complex semantic relationships. | $0.12 |
| Specialized for code search and technical documentation. | $0.18 |
Context Window Sizes
A context window is the maximum amount of text (measured in tokens, not characters) that an embedding or LLM model can consider in a single request. The maximum context window size for each model is as follows:
Embedding Model | Context Window Size |
|---|---|
| 32,000 tokens |
| 32,000 tokens |
| 32,000 tokens |
| 32,000 tokens |
If the indexed text field is longer than the context window, the text is
automatically truncated to the context window size for the model. If your
query text exceeds this context window for the model, the
$vectorSearch query fails with a context-limit-exceeded
error.
Cost of Models
Model tokens are consumed during index operations (first time creation,
inserts, update) and query operations. For the index operations, only the
fields in the MongoDB document that are indexed as the autoEmbed type
are used for embedding generation and incur a token usage. For the query
operation, the query text provided is used for embedding generation
and incurs a token usage.
The cost for tokens for each model is as follows:
Embedding Model | Cost per 1K Tokens | Cost per 1M Tokens |
|---|---|---|
| $0.00012 | $0.12 |
| $0.00006 | $0.06 |
| $0.00002 | $0.02 |
| $0.00018 | $0.18 |
Free Tokens
For each model, Atlas includes a one-time allocation of 200 million free tokens at the organization level. The organization shares free tokens across all Atlas projects and clusters within the organization.
For each model, MongoDB Vector Search includes a one-time allocation of 200 million free tokens. The free tokens are shared across all clusters in the deployment.
Free tokens do not refresh.
Rate Limits
Rate limits are restrictions on the frequency and number of tokens you can request from Automated Embedding within a specified period of time. MongoDB enforces rate limits on embedding generation to ensure fair usage across all users in the multi-tenant environment. Rate limits are based on the requests per minute (RPM) and tokens per minute (TPM). These rate limits apply at a MongoDB Cluster level and are shared between all the indexes on that cluster using Automated embedding. To request higher rate limits, please reach out to your MongoDB account team or contact MongoDB support.
Rate limits are applied separately to queries, first-time index builds, and index update operations (document inserts and updates), providing traffic isolation. Indexing build operations are strictly isolated from the real-time query traffic.
First-time Index build Rate Limits
The first-time index build rate limits restrict the maximum frequency and number of tokens at which embeddings are generated. For large workloads during the first-time index build (initial sync), Automated Embedding uses a separate inference mechanism that is not bound by standard rate limits. This mechanism is optimized for throughput to handle the initial index build, provides the following benefits:
Faster Initial Synchronization: Scale embedding generation throughput dynamically to handle massive bursts.
Unbounded Throughput: Bursts up to available GPU capacity, and eliminate manual rate-limit increase requests.
Fair Resource Sharing: Competing index builds converge to similar token-per-second allocation, avoiding starvation.
Safe Ramp-Up: Starts at low concurrency and grows only on explicit internal success signals dynamically.
Index Insert and Update Rate Limits
The index rate limits restrict the maximum frequency and number of tokens at which embeddings are generated during certain operations on MongoDB Vector Search Automated Embedding indexes. These operations include inserts (new data is added to your index) or updates (existing data changes that require re-embedding).
Model | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
|---|---|---|
| 2,000 | 3,000,000 |
| 2,000 | 8,000,000 |
| 2,000 | 16,000,000 |
| 2,000 | 3,000,000 |
Query Operations Rate Limits
The query rate limits control the maximum embedding generation frequency
and the number of tokens for all queries that use the
$vectorSearch operations on your MongoDB Vector Search Automated Embedding
indexes.
Model | Requests Per Minute (RPM) | Tokens Per Minute (TPM) |
|---|---|---|
| 3 | 2,000 |
| 3 | 2,000 |
| 3 | 2,000 |
| 3 | 2,000 |
Best Practices
To optimize performance within rate limits:
Use Shorter Text: Limit indexed text to relevant content to reduce token consumption.
Batch Updates: If you are performing bulk updates, space them out to avoid hitting rate limits.
Monitor Usage: Track your embedding generation usage through the Voyage AI dashboard to identify patterns and optimize.
Upgrade When Needed: If you consistently hit rate limits, consider upgrading to a paid tier for higher quotas.