voyage-code-3: More Accurate Code Retrieval With Lower Dimensional, Quantized Embeddings

MongoDB
December 4, 2024 | Updated: September 9, 2025

TL;DR – Introducing voyage-code-3, our next-generation embedding model optimized for code retrieval. It outperforms OpenAI-v3-large and CodeSage-large by an average of 13.80% and 16.81% on a suite of 32 code retrieval datasets, respectively. By supporting smaller dimensions with Matryoshka learning and quantized formats like int8 and binary, voyage-code-3 can also dramatically reduce storage and search costs with minimal impact on retrieval quality.

Note to readers: voyage-code-3 is available through the Voyage AI APIs directly. For access, sign up for Voyage AI.

Since its launch in Jan, voyage-code-2 has been the most heavily used model with exponentially increasing adoption by code assistants and agents startups for their code retrieval. Today, we’re thrilled to announce voyage-code-3, which:

Outperforms OpenAI-v3-large and CodeSage-large by an average of 13.80% and 16.81% on a suite of 32 code retrieval datasets, respectively.
Supports embeddings of 2048, 1024, 512, and 256 dimensions.
Offers multiple embedding quantization, including float (32-bit floating point), int8 (8-bit signed integer), uint8 (8-bit unsigned integer), binary (bit-packed int8), and ubinary (bit-packed uint8).
Supports a 32K-token context length, compared to OpenAI (8K) and CodeSage large (1K).

Matryoshka embeddings and quantization

Storage and search costs in vector-based search can become significant for large corpora, such as in code retrieval with massive repositories. The costs scale linearly in the embedding dimensionality and precision (i.e., the number of bits used to encode each number). voyage-code-3 supports much lower dimensional embeddings and binary and int8 quantization to dramatically lower the costs without losing much retrieval quality. These are enabled by Matryoshka learning and quantization-aware training.

Matryoshka embeddings: Matryoshka learning creates embeddings with a nested family of embeddings with various lengths within a single vector. Concretely, for each of k in {256, 512, and 1024}, the first k entries of the 2048-dimensional embedding also form a valid k-dimensional embedding that is shorter with a slight loss of retrieval quality. Thus, the users can vectorize the documents into a long 2048-dimensional vector in advance and then later have the flexibility to use a shorter version of the embedding (by taking the first k entries) without re-invoking the embedding model.

Quantization: Quantized embeddings have lower precision, represented with 8 bits or 1 bit per dimension, reducing 4x or 32x storage costs compared to 32-bit floats, respectively. voyage-code-3 can return the embeddings with lower precision with various data types, int8 (8-bit signed integer), uint8 (8-bit unsigned integer), binary (bit-packed int8), and ubinary (bit-packed uint8). Most vector databases support storing and searching with quantized embeddings directly, including Milvus, Qdrant, Weaviate, Elasticsearch, and Vespa AI.

Storage cost vs retrieval quality tradeoff: Quantization and shorter embeddings inevitably come with a reduced retrieval quality. Voyage focuses intently on this, striving to minimize the quality loss as much as possible. The following graph plots retrieval quality versus relative storage cost — showing a limited reduction of quality up to binary 1024-dimensional embeddings compared to the float32 2048-dimensional embeddings.

Figure 1. Voyage AI relative storage costs and retrieval quality compared to other embedding models.

This graph shows voyage-code-3 storage costs and retrieval quality are both higher than other models such as OpenAI-v3-large and CodeSage-large.

All green data points above the green line represent voyage-code-3, and those below are uniquely colored per model and explicitly labeled with quantization and dimension. Lines connecting data points represent the same model and data type but at different embedding dimensions. The evaluation results used to generate this plot are available in this spreadsheet.

Optimized for code retrieval

Code retrieval presents unique challenges compared to general text retrieval due to the need for algorithmic reasoning and the nuanced syntax rules such as keywords, control structures, nesting, and formatting. These challenges are further complicated by several retrieval subtasks, including text-to-code (e.g., retrieve code snippets using natural language queries), code-to-code (e.g., identify semantically similar code snippets), and docstring-to-code (e.g., retrieve code snippets using function docstring queries).

Curated, massive code training data: We curated a larger, more diverse, high-quality code corpus for training voyage-code-3 than voyage-code-2. First, we assembled a broad corpus with trillions of tokens comprising text, code, and mathematical content with a carefully tuned code-to-text ratio. Next, we developed a comprehensive dataset with positive pairs for contrastive learning based on public GitHub repositories, containing docstring-code and code-code pairs across 300+ programming languages. This dataset was combined with the general text pair dataset used to train our leading general-purpose voyage-3 model. Finally, we collected additional real-world query-code pairs, covering a wide range of tasks in code assistant use cases, to ensure robust coverage of real-world scenarios.

Evaluation: We evaluated voyage-code-3 using an enhanced suite of evaluation datasets designed to address the shortcomings of existing benchmarks and deliver practical, robust results. Existing datasets can suffer from noisy labels, overly simplistic tasks, and data contamination risks, making them ill-suited for real-world applications. For instance, the original CoSQA dataset was found to have 51% of its queries paired with mismatched code. Our evaluation incorporated diverse tasks, such as text-to-code and code-to-code, repurposed question-answer datasets for retrieval, and introduced complex, real-world repositories and scenarios that challenge embedding models to achieve deeper understanding. For a deeper dive into code retrieval evaluation, check out our previous blog post.

Evaluation details

Datasets: We evaluate voyage-code-3 across 32 datasets spanning five categories that cover various code retrieval tasks, real-world use cases, and challenging code scenarios. These datasets are discussed in length in our code retrieval evaluation blog post. The table below summarizes the key datasets.

Figure 2. Dataset detail table.

Table showing the dataset details; including type, a description, the features, and the source.

Models: We evaluate voyage-code-3 alongside several general-purpose and code-specific alternatives, including: OpenAI-v3-large (text-embedding-3-large), OpenAI-v3-small (text-embedding-3-small), CodeSage-large, CodeRankEmbed (cornstack/CodeRankEmbed), Jina-v2-code (jina-embeddings-v2-base-code), voyage-code-2, voyage-3, and voyage-3-lite.

Results

The table below summarizes the key results from the code retrieval quality versus relative storage costs plot above. voyage-code-3 outperforms OpenAI-v3-large on average by:

14.64% and 17.66% at 1024 and 256 dimensions, respectively
13.80% at 1/3 the storage costs (1024 vs 3072 dimensions)
4.81% at 1/384 the storage costs (binary 256 vs float 3072 dimensions)

Figure 3. Summary of key results.

This table summarizes the results of the code retrieval quality versus relative storage costs graph.

The bar charts below show the average retrieval quality for each group of datasets (see spreadsheet for a full list of datasets and the grouping). voyage-code-3 outperforms all other models in every group, exceeding OpenAI-v3-large on average by 16.30%.

Figure 4. voyage-code-3 retrieval quality across datasets.

This image has 6 bar graphs showing retrieval quality of each model per dataset. In each case, voyage-code-3 has the highest retrieval quality.

Binary rescoring: Finally, users sometimes first retrieve a decent number of (e.g., 100 in our evaluation) documents with binary embeddings and then rescore the retrieved documents with full-precision embeddings. For voyage-code-3, as shown in the table, binary rescoring yields up to 4.25% improvement in retrieval quality when applied on top of standard binary retrieval.

All the evaluation results are available in this spreadsheet.

Try voyage-code-3!

voyage-code-3 is available today! The first 200 million tokens are free. To get started, head over to our docs to learn more. If you’re also interested in fine-tuned embedding models, we’d love to hear from you—please email us at contact@voyageai.com.

← Previous

The MongoDB AI Applications Program: Delivering Customer Value

Announcement: As of 9/26/2025, we are refocusing the MongoDB AI Applications Program (MAAP) to instead focus on fostering and developing strategic partnerships. Please visit the MongoDB Partner ecosystem page to learn how MongoDB and our partners are helping you build modern applications. When people ask me about MongoDB, I tell them that they’ve probably interacted with MongoDB without realizing it. In fact, many of the world’s leading companies—including 70% of the Fortune 100—are powered by MongoDB. Everything we do at MongoDB is about serving our customers, but that often happens in the background, where our work is invisible to many users. In my case, that means building an ecosystem of partners who enable customer innovation. A recent example is how MongoDB teamed up with Amazon Web Services (AWS) and Amazon Bedrock to help Base39 —a Brazilian fintech provider—automate loan analysis, decreasing decision time from three days to one hour, and reducing cost per loan analysis by 96%. And there’s the Indian company IndiaDataHub, which joined the MongoDB AI Applications Program (MAAP) to access AI expertise, in-depth support, and a full spectrum of technologies to enhance AI functionality within IndiaDataHub’s analytics platform. This includes connecting relevant data in MongoDB with Meta's AI models to perform sentiment analysis on text datasets. I could go on and on—after all, tens of thousands of MongoDB’s customers have success stories like these. Enabling customer success is precisely why we launched MAAP last summer, and why the program has evolved since. Customers tell us that they want to take advantage of AI, but they’re unsure how to navigate a fast-moving market, how to control costs, and how to unlock business value from their AI investments. So with MAAP, MongoDB offers customers a full AI stack and an integrated set of professional services to help them keep pace with the latest innovations, identify the best AI use cases, and to help them future-proof AI investments. With today’s announcement , Capgemini, Confluent, IBM, QuantumBlack, AI by McKinsey, and Unstructured have joined the 22 companies that now comprise the MAAP partner network. Which means that the MAAP ecosystem (which was founded with Accenture, Anthropic, Anyscale, Arcee AI, AWS, Cohere, Credal, Fireworks AI, Google Cloud, gravity9, LangChain, LlamaIndex, Microsoft Azure, Nomic, PeerIslands, Pureinsights, and Together AI) offers additional cutting-edge AI integration and solutions to customers—and more ways to set them on the path to AI success. CentralReach: Making an impact on autism with AI More than 150 customers have already gotten involved with MAAP, but I’m particularly excited to share the work of CentralReach . CentralReach provides an AI-powered electronic medical record (EMR) platform that is designed to improve outcomes for children and adults diagnosed with autism and related intellectual and developmental disabilities (IDD). Prior to working with MongoDB and MAAP, CentralReach was looking for an experienced partner to further connect and aggregate its more than 4 billion financial and clinical data points across its suite of solutions. CentralReach leveraged MongoDB’s document model to aggregate the company’s diverse forms of information from assessments to clinical data collection, so the company could build rich AI-assisted solutions on top of its database. Meanwhile, MAAP partners helped CentralReach to design and optimize multiple layers of its comprehensive buildout. All of this will enable CentralReach to support initiatives such as value-based outcome measurement, clinical supervision, and care delivery efficacy. With these new data layers in place, providers will be able to make substantial improvements to their clinical delivery to optimize care for all those they serve. “As a mission-driven organization, CentralReach is always looking to innovate on behalf of the clinical professionals—and the more than 350,000 autism and IDD learners—that we serve globally,” said Chris Sullens, CEO of CentralReach. “So being able to lean on MongoDBs database technology and draw on the collective expertise of the MAAP partner network—in addition to MongoDB’s tech expertise and services—to help us improve outcomes for our customers and their clients worldwide has been invaluable.” Working backward from customer needs The addition of Capgemini, Confluent, IBM, QuantumBlack, AI by McKinsey, and Unstructured to the MAAP partner network offers customers additional technology and AI support options. It also builds on MongoDB’s larger partner ecosystem , which is designed to give customers flexibility and choice. By working closely with our partners on product launches, integrations, and real-world challenges, MongoDB has been able to bring a better understanding of the challenges facing customers—and to give them the resources and confidence to move forward with groundbreaking technology like AI . Examples of support MAAP has offered customers include: Guidance on chunking strategies for an AI-native healthcare provider providing patient recommendations based on complex data sources Collaboration on advanced retrieval techniques to improve response accuracies for a large consultancy to automate manual research Evaluation of embedding models for multi-modal data stores for a well-known automaker developing diagnostic applications Guidance on architectures for complex agentic workflows for a mature enterprise technology provider augmenting customer service workflows One way we offer this support is through the MAAP Center of Excellence (CoE). The MAAP CoE comprises AI technical experts from across MongoDB and the MAAP partner ecosystem who collaborate with customers to understand their challenges, technical requirements, and timelines. The MAAP CoE can then recommend custom full-stack architectures and implementation best practices, optimized for the customer’s specific use case and requirements. Indeed, customization is intrinsic to MAAP: MongoDB and our MAAP partners will meet customers wherever they are to help them achieve their goals. For example, if an organization wants to fully own its AI application development, MongoDB and partners can provide guidance and expertise. And in cases where customers want hands-on support, we can help speed projects with professional services. Ultimately, we want MAAP customers—and anyone who works with MongoDB’s partner ecosystem at large—to feel empowered to own their application development, and to transform challenges into opportunities. Let’s build the next big thing together! To learn more about building AI-powered apps with MongoDB, see MongoDB’s AI Resources Hub , the Partner Ecosystem Catalog , or visit the MAAP page . And check out our partner Confluent’s own blog post about MAAP!

December 2, 2024

Next →

Cars24 Improves Search For 300 Million Users With MongoDB Atlas

The Indian multinational online car marketplace Cars24 serves 300 million users globally. The company offers services that span sales, insurance, maintenance, financing, and more, reshaping the entire car ownership journey. Speaking at MongoDB .local Bengaluru in July 2025 , Pradeep Sharma, Head of Technology at Cars24, shared how MongoDB has been a key driver of Car24’s digital transformation journey. Specifically, he highlighted two recent use cases that show how MongoDB Atlas has helped Cars24 scale, improve its search capabilities, and reduce its architectural complexity. Matching the growing scale with simplified and expanded search Cars24 has operations in multiple countries, and a diverse customer base. Over the years, the company has used customer data, behavior analytics, and operational workflows to build, evolving from being a platform for buying and selling cars, to an end-to-end ecosystem, supported by a hub of interconnected systems. At the start of its journey, Cars24 relied on legacy databases for managing and searching data, such as Postgres. Their relational database set-up would store information, synchronize the data to a separate “bolt-on” search engine (such as Elasticsearch), manually indexing it, and then querying the index. While initially effective for a small application ecosystem, these processes became bottlenecked as the organization’s services grew. Multiple engineering teams piped data into a single search index, which often resulted in synchronization challenges and overwhelming administrative overhead. Cars24 faced three core limitations with this setup: Lower developer productivity: Exponential effort was spent maintaining pipelines and synchronizing procedures. Developers had little bandwidth for building business features or innovation. Architectural complexity: Ensuring data sync consistency required multiple pipelines and race logic. This led to inefficiencies in real-time dashboard updates for agents. Operational overhead: Maintaining separate systems for database and search—alongside provisioning, patching, scaling, and monitoring—strained resources. Seeking an integrated approach, Cars24 embraced MongoDB Atlas, hosted on Google Cloud . MongoDB Atlas would serve as a single, consistent, modern database and embedded search solution, powered by Apache Lucene. MongoDB Atlas Search also enabled Cars24 to run queries directly in the database. This eliminated the need to synchronise data between systems while delivering real-time results. This unified approach allowed the company’s developers to transition from managing complex synchronization mechanisms to building applications. Furthermore, the reduced administrative overhead enabled Cars24 to consolidate the team’s efforts, and to streamline query execution across the ecosystem. Thanks to MongoDB Atlas and MongoDB Atlas Search, Cars24 was able to: Avoid "synchronization tax”: Switching to MongoDB Atlas eliminated the need for data synchronization and the additional tooling this mandated. Real-time searches can be performed from a single interface and workflow. Deliver new search features faster: By using a single, unified API across database and search operations, new features can be delivered rapidly. Work with a fully managed platform: With MongoDB Atlas, Cars24’s engineers can focus more on application development and building products, rather than thinking about managing indexes, syncing, and more. Following this successful migration, Cars24 decided to also use MongoDB Atlas to replace one of its legacy databases, ArangoDB. The switch to MongoDB Atlas eliminated major roadblocks for other critical search capabilities. From ArangoDB to MongoDB: Streamlined operations and 50% cost savings As Cars24 scaled new services globally, it encountered limitations with its geospatial search solution, which was based on ArangoDB. This included performance bottlenecks, weak transactions as it was difficult to guarantee consistent data operations, and a limited ecosystem which meant that scaling developer onboarding and troubleshooting became increasingly onerous. Moving to MongoDB Atlas enabled Cars24 to transition its geospatial services, consolidating its data storage and search capabilities under a single, versatile platform. “We now have a highly available architecture, and an amazing team at MongoDB that has our back,” said Sharma. MongoDB offered a proven architecture for high availability, scalability, and real-world production readiness: Enhanced scalability: MongoDB’s ability to scale massive workloads supports Cars24’s growing global presence. Reliable transactions: MongoDB provides robust multi-document ACID transactions across shards, meeting mission-critical needs. Streamlined operations: MongoDB offers a single platform that is not limited to a database only. By consolidating its geospatial search workload under MongoDB, Cars24 has reduced maintenance and operational overhead. Not only did Cars24 cut costs in half by moving to MongoDB, but the widespread market adoption of MongoDB Atlas also means that Cars24 can continue to rapidly onboard developers familiar with MongoDB, a recruiting priority for Cars24’s growing development team. “To give you an idea, one of our business units had a developer team of less than 10 about a year ago. Now they are a triple-digit team,” said Sharma. “If we are going to keep introducing new developers, for a product coming up or scaling up, it becomes very important to focus on the community skills and support provided by our technology partner.” “Now that we have moved from ArangoDB to MongoDB Atlas, our developers are the happiest,” he added. Cars24 is now looking to consolidate even more of its application and data workflows under MongoDB Atlas. With the growing number of developers joining Cars24’s engineering teams, plans are to utilize MongoDB Atlas further to enhance productivity, scalability, and data-driven insights. Visit the MongoDB Atlas Learning Hub to learn more about Atlas. To learn more about MongoDB Atlas Search, visit our product page .

October 12, 2025