What are Rerankers?
FAQs
What are the best reranker models?
The best reranking models depend on your search system's specific needs, such as the precision and latency of the initial retrieval process and the latency considerations for the user experience. A larger reranking model will offer best-in-class precision and accuracy but at the cost of more computational resources and increased downstream latency. Conversely, a smaller reranker will be faster but less precise. Many factors need to be taken into account, and extensive testing and experimentation should guide you in choosing the best reranker models for your needs.
What is the difference between reranking and embedding models?
Embedding models are bi-encoders, meaning they encode the query and multiple documents separately. They produce embeddings that capture semantic and contextual meaning in high-dimensional vectors. These embeddings are generated based on the initial training data of the embedding models and can be computed once and then used for vector search via efficient vector similarity computations.
Reranker models, on the other hand, are cross-encoders. They encode each query-document pair and produce a relevance score. Unlike embedding models, reranker models do not produce embeddings. Instead, the input data must pass through the model for each pair. This means that the reranking process is computationally expensive but results in more precise relevance scoring and a better understanding of contextual nuances and relationships between the query and the documents
What is the role of a reranker in a RAG pipeline?
Rerankers are usually found in multi-stage retrieval systems in RAG pipelines. Indeed, the search component in RAG is the most important parameter, as the quality of the document retrieved will impact the accuracy of the downstream response by the generative model. The reranking step ensures only the most relevant initially retrieved documents are fed to the prompt for more accuracy.
A typical RAG system involves ingesting unstructured data chunks and converting those into embeddings before loading them into a vector db. Then, in order to retrieve an initial set of relevant chunks, a vector search is performed using the specific query, the query is embedded, and the query vector is used for computing the similarity score between all the document embeddings. This enables fast and efficient retrieval but may lack high precision of the nuances and context. This is due to the nature of the embeddings that compress the semantic information into numerical representations. Then, a reranker is used to refine the broad set of initial chunks into a smaller set of highly relevant chunks. The reranked documents are then added together and sent with the query to the generative model for producing the response.