Creates vector embeddings for multimodal inputs consisting of text, images, or a combination of both.
This endpoint accepts inputs that can contain text and images in any combination and returns their vector representations.
Body
Required
-
A list of multimodal inputs to be vectorized.
A single input in the list is a dictionary containing a single key
"content", whose value represents a sequence of text and images.- The value of
"content"is a list of dictionaries, each representing a single piece of text or image. The dictionaries have four possible keys: - type: Specifies the type of the piece of the content. Allowed values are
text,image_url, orimage_base64. - text: Only present when
typeistext. The value should be a text string. - image_base64: Only present when
typeisimage_base64. The value should be a Base64-encoded image in the data URL formatdata:[<mediatype>];base64,<data>. Currently supportedmediatypesare:image/png,image/jpeg,image/webp, andimage/gif. - image_url: Only present when
typeisimage_url. The value should be a URL linking to the image. We support PNG, JPEG, WEBP, and GIF images. - Note: Only one of the keys,
image_base64orimage_url, should be present in each dictionary for image data. Consistency is required within a request, meaning each request should use eitherimage_base64orimage_urlexclusively for images, not both.
Example payload where
inputscontains an image as a URL:The
inputslist contains a single input, which consists of a piece of text and an image (which is provided via a URL).{ "inputs": [ { "content": [ { "type": "text", "text": "This is a banana." }, { "type": "image_url", "image_url": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg" } ] } ], "model": "voyage-multimodal-3.5" }Example payload where
inputscontains a Base64 image:Below is an equivalent example to the one above where the image content is a Base64 image instead of a URL. (Base64 images can be lengthy, so the example only shows a shortened version.)
{ "inputs": [ { "content": [ { "type": "text", "text": "This is a banana." }, { "type": "image_base64", "image_base64": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAA..." } ] } ], "model": "voyage-multimodal-3.5" }The following constraints apply to the
inputslist:- The list must not contain more than 1000 inputs.
- Each image must not contain more than 16 million pixels or be larger than 20 MB in size.
- With every 560 pixels of an image being counted as a token, each input in the list must not exceed 32,000 tokens, and the total number of tokens across all inputs must not exceed 320,000.
At least
1but not more than1000elements. - The value of
-
The multimodal embedding model to use. Recommended model:
voyage-multimodal-3.5.Values are
voyage-multimodal-3.5orvoyage-multimodal-3. -
Type of the input. Defaults to
null. Other options:query,document.- When
input_typeisnull, the embedding model directly converts theinputsinto numerical vectors. For retrieval or search purposes, where a "query", which can be text or image in this case, searches for relevant information among a collection of data referred to as "documents," specify whether yourinputsare queries or documents by settinginput_typetoqueryordocument, respectively. In these cases, Voyage automatically prepends a prompt to yourinputsbefore vectorizing them, creating vectors more tailored for retrieval or search tasks. Since inputs can be multimodal, "queries" and "documents" can be text, images, or an interleaving of both modalities. Embeddings generated with and without theinput_typeargument are compatible. - For transparency, the following prompts are prepended to your input:
- For
query, the prompt is "Represent the query for retrieving supporting documents: ". - For
document, the prompt is "Represent the document for retrieval: ".
Values are
query,document, or null. - When
-
Whether to truncate the inputs to fit within the context length. Defaults to
true.- If
true, over-length inputs are truncated to fit within the context length before vectorization by the embedding model. If the truncation happens in the middle of an image, the entire image is discarded. - If
false, an error occurs if any input exceeds the context length.
Default value is
true. - If
-
Format in which the embeddings are encoded. Defaults to
null.- If
null, the embeddings are represented as a list of floating-point numbers. - If
base64, the embeddings are represented as a Base64-encoded NumPy array of single-precision floats.
Values are
base64or null. - If
curl \
--request POST 'https://ai.mongodb.com/v1/multimodalembeddings' \
--header "Authorization: Bearer $ACCESS_TOKEN" \
--header "Content-Type: application/json" \
--data '{"inputs":[{"content":[{"type":"text","text":"string"}]}],"model":"voyage-multimodal-3.5","input_type":"query","truncation":true,"output_encoding":"base64"}'
{
"inputs": [
{
"content": [
{
"type": "text",
"text": "string"
}
]
}
],
"model": "voyage-multimodal-3.5",
"input_type": "query",
"truncation": true,
"output_encoding": "base64"
}
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
42.0
],
"index": 42
}
],
"model": "string",
"usage": {
"text_tokens": 42,
"image_pixels": 42,
"total_tokens": 42
}
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}
{
"detail": "string"
}