rerank-2.5 and rerank-2.5-lite: Instruction-Following Rerankers
August 11, 2025 | Updated: September 9, 2025
Note to readers: rerank-2.5 and rerank-2.5-lite are available through the Voyage AI APIs directly. For access, sign up for Voyage AI.
TL;DR – We are excited to introduce the rerank-2.5 series, which significantly improves upon rerank-2’s performance while also introducing instruction-following capabilities for the first time. On our standard suite of 93 retrieval datasets spanning multiple domains, rerank-2.5 and rerank-2.5-lite improve retrieval accuracy by 7.94% and 7.16% over Cohere Rerank v3.5.
Furthermore, the new instruction-following feature allows users to steer the model’s output relevance scores using natural language. On the Massive Instructed Retrieval Benchmark (MAIR), rerank-2.5 and rerank-2.5-lite outperform Cohere Rerank v3.5 by 12.70% and 10.36%, respectively, and by similar margins on our in-house evaluation datasets.
Both models now support a 32K token context length – 8x that of Cohere Rerank v3.5 and double that of rerank-2 – enabling more accurate retrieval across longer documents.
Rerankers are a critical component in sophisticated retrieval systems, refining initial search results to deliver superior accuracy. Today, we are excited to announce rerank-2.5 and rerank-2.5-lite. Both models outperform LLMs as rerankers – a topic which we will dive deeper into in an upcoming blog post. These models are the product of an improved mixture of training data and advanced distillation techniques from our larger, in-house instruction-following models.
Both rerank-2.5 and rerank-2.5-lite now support a 32K token context length, an 8x increase over Cohere Rerank v3.5. This allows for the reranking of much longer documents without truncation and comes with no change in pricing.
For an introduction into rerankers, check out our previous post.
Instruction-following capability
A key feature of the rerank-2.5 series is its instruction-following capability. This allows users to dynamically steer the reranking process by providing explicit instructions alongside their query. These instructions can define the user’s notion of relevance or specify the desired characteristics of the documents to be retrieved. Leveraging the new instruction-following capability is straightforward. Users can simply append or prepend natural language instructions to their queries. The model is designed to understand these instructions and adjust the output relevance score accordingly.
Examples of instructions - Instructions can include, but are not limited to, the following examples:
-
Emphasizing query components: Specify which parts of a document are most important. For an application that searches academic papers, a standing instruction could be “Prioritize the title and ignore the abstract” to consistently surface the most relevant research based on titles.
-
Defining document types: Direct the reranker to retrieve a specific type of document (e.g., for the query “legal implications of AI,” an instruction could be “Retrieve regulatory documents and legal statutes, not court cases.”). For example, a legal research tool could be configured with the instruction “Retrieve regulatory documents and legal statutes, not court cases” to ensure that all queries prioritize statutory law over case law.
-
Disambiguating queries with contexts: Provide complementary information so that ambiguous queries can be clarified. For example, an instruction could be “This is an e-commerce application about cars” so that the word “Jaguar” will be interpreted as the car brand rather than as an animal.
Concrete examples of instructions and the impact of instructions on search results are available in Appendix A.
Accuracy gains from instruction following: The instruction-following feature is particularly useful for search/retrieval tasks where user intent can be nuanced. To demonstrate this, we built a set of in-house evaluation datasets composed of 24 domain-specific instruction-following datasets across 7 domains (web, tech, legal, finance, conversational, medical, and code). On domain-specific data, the accuracy of rerank-2.5 and rerank-2.5-lite is increased by an average of 8.13% and 7.55%, respectively, when leveraging instructions.

Evaluation details
Datasets: For standard results without instruction following, we conducted an evaluation across 9 domains: technical documentation, code, law, finance, web reviews, multilingual, long documents, medical, and conversations. The multilingual domain is composed of 51 datasets from 31 languages. Detailed information about each of the domains and languages can be found in the rerank-2 release blog.
To evaluate instruction-following capabilities, we utilize a set of in-house domain-specific and real-world instruction-following datasets (detailed in the previous section) as well as the MAIR (Massive Instructed Retrieval) benchmark, an academic benchmark with task-specific instructions in domains such as web, legal, and biomedical search.
Method and metrics: We evaluate the retrieval quality of various rerankers on top of four first-stage search methods: (1) lexical search with BM25, (2) OpenAI v3 large (text-embedding-3-large), (3) voyage-3-large, and (4) voyage-3.5. For each query, the first-stage method retrieves up to 100 candidate documents. The reranker then re-orders these documents, and we retrieve the top 10. We report the normalized discounted cumulative gain (NDCG@10), the standard metric for retrieval quality.
Baselines: We compare our models against rerank-2-lite, rerank-2, Cohere Rerank 3.5, and Qwen3-Reranker-8B.
Results
rerank-2.5 and rerank-2.5-lite collectively set a new cost-to-performance frontier. Specifically, rerank-2.5 outperforms rerank-2 by 1.85% at the same price per token, while rerank-2.5-lite outperforms rerank-2-lite by 3.40% at the same price per token. Furthermore, rerank-2.5-lite performs better than Qwen3-Reranker-8B, the best open source reranker, despite being over an order of magnitude smaller.

Real-world instruction following: In addition to the 24 domain-specific instruction-following datasets, we also curated 3 instruction-following datasets from real-world applications. Evaluating on these datasets shows that the accuracy of rerank-2.5 and rerank-2.5-lite is increased by an average of 11.48% and 7.83%, respectively, when leveraging instructions.

Results without instruction following: The first bar chart below shows the average accuracy of each reranker when evaluated across 9 domains without instruction following. rerank-2.5 and rerank-2.5-lite consistently emerge as the top-performing rerankers, regardless of the first-stage retrieval method used. This is not the case for Cohere Rerank v3.5, which hurts retrieval quality when applied on top of voyage-3-large (the most powerful first-stage retrieval method). In particular:
-
Averaged across the four first-stage retrieval methods, rerank-2.5 outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2 by 7.94%, 2.25%, 1.85%, respectively.
-
rerank-2.5-lite, while optimized for latency, still outperforms Cohere Rerank v3.5, Qwen3 Reranker 8B, and rerank-2 by 7.16%, 1.47%, and 1.08%, respectively.
-
Both rerank-2.5 and rerank-2.5-lite provide a significant quality improvement on top of all first-stage retrieval results.

The bar charts below illustrate NDCG@10 across different languages. Both rerank-2.5 and rerank-2.5-lite consistently increase performance across the board for all languages and first-stage retrieval methods. Specifically:
-
Averaged across the four first-stage retrieval methods, rerank-2.5 outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2 by 3.26%, 2.34%, and 1.35%, respectively.
-
Likewise, rerank-2.5-lite outperforms Cohere Rerank v3.5, Qwen3-Reranker-8B, and rerank-2-lite by 1.93%, 1.01%, and 2.70%, respectively.

Detailed domain-specific and multilingual results using BM25, voyage-3-large, and voyage-3.5 as first-stage retrieval methods can be found in Appendix B.
MAIR benchmark - The figures below illustrate the accuracy gains attained by rerank-2.5 and rerank-2.5-lite on MAIR. Both rerank-2.5 and rerank-2.5-lite consistently improve atop all first-stage search results. Specifically:
-
rerank-2.5 outperforms Cohere Rerank v3.5 and rerank-2 by an average of 12.70% and 4.90% when evaluated atop the four first-stage retrieval methods.
-
rerank-2.5-lite outperforms Cohere Rerank v3.5 and rerank-2 by an average of 10.36% and 2.57% when evaluated atop the four first-stage retrieval methods.

Detailed results: Numeric results for all evaluations are available in this spreadsheet.
Try rerank-2.5 and rerank-2.5-lite today!
Both rerank-2.5 and rerank-2.5-lite are available today with flexible, token-based pricing. For existing rerank-2 and rerank-2-lite users, we recommend upgrading to rerank-2.5 and rerank-2.5-lite, respectively. This upgrade provides better quality and double the context length at the same cost. We will continue to offer the rerank-2 series for existing users who do not wish to upgrade to rerank-2.5.
For new users, head over to our docs to get started and learn more; first 200M tokens are free. As our results show, combining Voyage embedding models with Voyage rerankers delivers the highest possible retrieval accuracy.
Appendix A – Examples of instruction following
Query & Instruction | Model Results (Top-1 Document) |
---|---|
Query:
Who is at highest risk for Hand-foot-and-mouth disease? Instruction: Focus on age-related risk factors and the most common age group. |
rerank-2 (No Instruction):
Children are at the highest risk of getting the disease. rerank-2.5 (With Instruction): The most important risk factor is age. The infection occurs most often in children under age 10, but can be seen in adolescents and occasionally adults. The outbreaks occur most often in the summer and early fall. |
Query:
what does unlock my device mean? Instruction: I am an international business person and need to unlock my work phone for use with multiple carriers. My search should prioritize the implications of unlocking a device provided by my employer, focusing on adherence to my company’s BYOD policies and the impact on global connectivity. |
rerank-2 (No Instruction):
The term “unlock my device” typically means making a mobile phone or tablet work with a different service provider’s network… It’s like having a key to open up your selection of providers. rerank-2.5 (With Instruction): Gaining access to other network services on a phone initially configured for a specific telecommunications company… For a professional who travels across borders on business transactions, this process allows seamless switching to local service providers, ensuring constant connection to corporate networks and clients, given that this does not contravene any pre-established protocols… |
Query:
Why are historical prices of stocks different on different websites? Which one should I believe? Instruction: Explain the process and challenges of collecting and reconciling historical stock price data across different financial websites. |
rerank-2 (No Instruction):
I still can’t understand why there is a price discrepancancy. There isn’t. It’s the same stock and price differences between such major exchanges will always be minimal… rerank-2.5 (With Instruction): The cause of incomplete/inaccurate financial data’s appearing on free sites is that it is both complicated and expensive to obtain and parse these data. Even within a single country, different pieces of financial data are handled by different authorities… There are some companies (e.g. Bloomberg) whose entire business model is to do the above… |
Appendix B – Figures for domain-specific and multilingual results
The figures below show results on domain-specific datasets without instructions for BM25, voyage-3-large, and voyage-3.5 as the first-stage retrieval method, respectively:

The figures below show results on multilingual datasets when using BM25, voyage-3-large, and voyage-3.5 as the first-stage retrieval method, respectively:
