Join us at MongoDB.local London on 7 May to unlock new possibilities for your data. Use WEB50 to save 50%.
Register now >
Docs Menu
Docs Home
/ /

Models for Automated Embedding

Automated Embedding uses Voyage AI's embedding models, which Atlas hosts and manages in the Data Plane in a multi-tenant environment.

Automated Embedding supports the following Voyage AI embedding models:

Embedding Model
Description
Price Per 1M Tokens

voyage-4-lite

Optimized for high-volume, cost-sensitive applications.

$0.02

voyage-4

(Recommended) Balanced performance for general text search.

$0.06

voyage-4-large

Maximum accuracy for complex semantic relationships.

$0.12

voyage-code-3

Specialized for code search and technical documentation.

$0.18

A context window is the maximum amount of text (measured in tokens, not characters) that an embedding or LLM model can consider in a single request. The maximum context window size for each model is as follows:

Embedding Model
Context Window Size

voyage-4-large

32,000 tokens

voyage-4

32,000 tokens

voyage-4-lite

32,000 tokens

voyage-code-3

32,000 tokens

If the indexed text field is longer than the context window, the text is automatically truncated to the context window size for the model. If your query text exceeds this context window for the model, the $vectorSearch query fails with a context-limit-exceeded error.

Model tokens are consumed during index operations (first time creation, inserts, update) and query operations. For the index operations, only the fields in the MongoDB document that are indexed as the autoEmbed type are used for embedding generation and incur a token usage. For the query operation, the query text provided is used for embedding generation and incurs a token usage. The cost for tokens for each model is as follows:

Embedding Model
Cost per 1K Tokens
Cost per 1M Tokens

voyage-4-large

$0.00012

$0.12

voyage-4

$0.00006

$0.06

voyage-4-lite

$0.00002

$0.02

voyage-code-3

$0.00018

$0.18

For each model, Atlas includes a one-time allocation of 200 million free tokens at the organization level. The organization shares free tokens across all Atlas projects and clusters within the organization.

For each model, MongoDB Vector Search includes a one-time allocation of 200 million free tokens. The free tokens are shared across all clusters in the deployment.

Free tokens do not refresh.

Rate limits are restrictions on the frequency and number of tokens you can request from Automated Embedding within a specified period of time. MongoDB enforces rate limits on embedding generation to ensure fair usage across all users in the multi-tenant environment. Rate limits are based on the requests per minute (RPM) and tokens per minute (TPM). These rate limits apply at a MongoDB Cluster level and are shared between all the indexes on that cluster using Automated embedding. To request higher rate limits, please reach out to your MongoDB account team or contact MongoDB support.

Rate limits are applied separately to queries, first-time index builds, and index update operations (document inserts and updates), providing traffic isolation. Indexing build operations are strictly isolated from the real-time query traffic.

The first-time index build rate limits restrict the maximum frequency and number of tokens at which embeddings are generated. For large workloads during the first-time index build (initial sync), Automated Embedding uses a separate inference mechanism that is not bound by standard rate limits. This mechanism is optimized for throughput to handle the initial index build, provides the following benefits:

  • Faster Initial Synchronization: Scale embedding generation throughput dynamically to handle massive bursts.

  • Unbounded Throughput: Bursts up to available GPU capacity, and eliminate manual rate-limit increase requests.

  • Fair Resource Sharing: Competing index builds converge to similar token-per-second allocation, avoiding starvation.

  • Safe Ramp-Up: Starts at low concurrency and grows only on explicit internal success signals dynamically.

The index rate limits restrict the maximum frequency and number of tokens at which embeddings are generated during certain operations on MongoDB Vector Search Automated Embedding indexes. These operations include inserts (new data is added to your index) or updates (existing data changes that require re-embedding).

Model
Requests Per Minute (RPM)
Tokens Per Minute (TPM)

voyage-4-large

2,000

3,000,000

voyage-4

2,000

8,000,000

voyage-4-lite

2,000

16,000,000

voyage-code-3

2,000

3,000,000

The query rate limits control the maximum embedding generation frequency and the number of tokens for all queries that use the $vectorSearch operations on your MongoDB Vector Search Automated Embedding indexes.

To optimize performance within rate limits:

  1. Use Shorter Text: Limit indexed text to relevant content to reduce token consumption.

  2. Batch Updates: If you are performing bulk updates, space them out to avoid hitting rate limits.

  3. Monitor Usage: Track your embedding generation usage through the Voyage AI dashboard to identify patterns and optimize.

  4. Upgrade When Needed: If you consistently hit rate limits, consider upgrading to a paid tier for higher quotas.

Back

Get Started

On this page