Docs Menu

Docs HomeLaunch & Manage MongoDBMongoDB Atlas

How to Index Vector Embeddings for Vector Search

On this page

  • Review knnVector Type Limitations
  • Define the Index for the knnVector Type
  • Configure knnVector Field Properties
  • Try an Example for the knnVector Type

You can use the knnVector type to index vector embeddings. The vector field can be represented as an array of numbers of the following types:

  • BSON int32, int64, or double data types for querying using the knnBeta operator.

  • BSON double data type for querying using the $vectorSearch stage.

You can use the knnBeta operator, which is now deprecated, and the $vectorSearch stage in your aggregation pipeline to query fields indexed as knnVector.

Note

You can't use the Atlas Search Visual Editor in the Atlas UI to configure fields of type knnVector. Instead, use the Atlas Search JSON Editor to configure fields of type knnVector.

You can also use Atlas Vector Search with local Atlas deployments that you create with the Atlas CLI. To learn more, see Create a Local Atlas Deployment.

You can't index fields inside arrays of documents or fields inside arrays of objects (Atlas Search embeddedDocuments type) as knnVector type.

The following is the JSON syntax for the knnVector type. Replace the default index definition with the following. To learn more about the fields, see Field Properties.

1{
2 "mappings": {
3 "name": "<index-name>",
4 "dynamic": true|false,
5 "fields": {
6 "<field-name>": {
7 "type": "knnVector",
8 "dimensions": <number-of-dimensions>,
9 "similarity": "euclidean | cosine | dotProduct"
10 }
11 }
12 }
13}

The knnVector type has the following options:

Option
Type
Necessity
Purpose
type
string
Required
Human-readable label that identifies this field type. Value must be knnVector.
dimensions
int
Required
Number of vector dimensions, which Atlas Search enforces at index- and query-time. This value can't be greater than 4096.
similarity
string
Required

Vector similarity function to use to search for top K-nearest neighbors. Value can be one of the following:

  • euclidean - measures the distance between ends of vectors. This allows you to measure similarity based on varying dimensions. To learn more, see Euclidean.

  • cosine - measures similarity based on the angle between vectors. This allows you to measure similarity that isn't scaled by magnitude. You can't use zero magnitude vectors with cosine. To measure cosine similarity, we recommend that you normalize your vectors and use dotProduct instead. To learn more, see Cosine.

  • dotProduct - measures similar to cosine, but takes into account the magnitude of the vector. This allows you to efficiently measure similarity based on both angle and magnitude. To use dotProduct, you must normalize the vector to unit length at index- and query-time. To learn more, see Dot Product.

Note

If you normalize the magnitude, cosine and dotProduct are almost identical in measuring similarity.

The following index definition for the sample_mflix.embedded_movies collection dynamically indexes all the dynamically indexable fields in the collection and statically indexes plot_embedding field as the knnVector type. The plot_embedding field contains embeddings created using OpenAI's text-embedding-ada-002 embeddings model. The index definition specifies 1536 vector dimensions and measures similarity using euclidean.

1{
2 "mappings": {
3 "dynamic": true,
4 "fields": {
5 "plot_embedding": {
6 "type": "knnVector",
7 "dimensions": 1536,
8 "similarity": "euclidean"
9 }
10 }
11 }
12}

If you load the sample data on your cluster and create the preceding Atlas Search index for this collection, you can run $vectorSearch queries against this collection. To learn more about the sample queries that you can run, see $vectorSearch Examples.

← How to Index GeoJSON Objects