BlogAtlas Vector Search voted most loved vector database in 2024 Retool State of AI reportLearn more >>
MongoDB Developer
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right

How to Optimize LLM Applications With Prompt Compression Using LLMLingua and LangChain

Richmond Alake13 min read • Published Jun 18, 2024 • Updated Jun 18, 2024
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
Imagine this scenario: You're basking in the glory of your company-wide recognition for developing a groundbreaking GenAI application when suddenly, a dreaded notification from your manager pops up – the operational costs are skyrocketing. After a swift investigation, you uncover the culprit: LLM inference costs are increasing due to high input token usage caused by the growing number of users leveraging your application.
You can avoid this nightmare situation of skyrocketing operational costs with the right approach. As AI stack engineers and practitioners, we can appreciate the consistent expansion of the context window of LLMs and foundation models provided by the likes of OpenAI, Anthropic, Cohere, and Google.
I’m sure there is a future where you can place an entire library into a single prompt of OpenAI’s GPT model. However, operational efficiency and cost are still top considerations for software applications. By implementing prompt compression techniques, you can significantly reduce the token count of your LLM application's inputs, leading to substantial cost savings without compromising response quality.
This tutorial explores techniques for optimizing token usage within LLM applications, specifically reducing the token footprint of inputs or prompts to LLMs without compromising response quality. 
Here’s what we’ll cover:
  • Definition and overview of prompt compression
  • Implementing prompt compression in LLM applications with LangChain and LLMLingua
  • Applying prompt compression in RAG pipelines
  • Utilizing prompt compression in AI agents
All code presented in this tutorial is found in the repository.

What is prompt compression?

Prompt compression is the process of systematically reducing the number of tokens fed into a large language model to retain or closely match the output quality comparable to that of the original, uncompressed prompt.
The general premise of developing LLM applications is to provide extensive prompts crafted and designed to condition the LLM into providing outputs that follow a specification for structure, reasoning process, information inclusion, exclusion, etc. The positive results of providing LLMs with comprehensive prompts have led to the design of systematic prompt structures such as chain-of-thought, in-context learning, and ReAct prompting.
Although extensive prompting does provide desirable LLM results, there is a tradeoff between LLMs produced through extensive prompting and factors such as increased token count, computational overhead, and response latency.
To limit the tradeoffs developers and engineers have to make when building LLM applications, techniques for reducing the token count of input prompts began to emerge. Focus was placed on techniques for determining which aspects of an extensive prompt are important enough to retain in a compressed prompt version.
The guiding principle surrounding most prompt compression efforts is centered around the notion that although extensive prompts provide desirable output from LLMs, the length of a prompt or how descriptive the prompt content is is not the sole contributor to the desirable response from LLMs. Instead, including key information and context steers LLMs into a favourable response space. Although for humans, having an extensive explanation of an idea, concept, or problem can be helpful for comprehension and understanding, LLMs only require a small amount of information to gain understanding.
The practical advantage of employing prompt compression techniques is the ability to maintain the general information in an uncompressed prompt while significantly reducing the number of tokens. However, it's important to note that prompt compression techniques may result in information loss, mainly information mentioned once or rarely in the uncompressed prompt. Despite this, the overall benefits of prompt compression, such as improved token optimization and reduced computational overhead, make it a valuable strategy for LLM applications.
The paper “Prompt Compression and Contrastive Conditioning for Controllability and Toxicity Reduction in Language Models,” published on October 6, 2022, introduced prompt compression as an official term and technique.
Fast-forward just a few years, and several prompt compression techniques have emerged. These techniques collectively tackle the challenge of token optimization within LLM applications while maintaining LLM controllability and output quality. Microsoft’s LLMLingua Python library presents one method that has gained significant traction.

Implementing prompt compression in LLM  applications with LLMLingua

The paper “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models” by Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu from Microsoft Corporation introduces a framework, LLMLingua, for compressing prompts fed into large language models (LLMs). It addresses the challenges of rising computational cost inference latency from prompt engineering techniques that improve LLM output quality via extensive and descriptive prompts.
The results presented in the paper show that LLMLingua compresses prompts by a significant factor and still retains output quality similar to that of the uncompressed prompt.
Overview of LLMLingua-2 from the LLMLingua 2 paper:
Two subsequent papers followed from the Microsoft research team that improved on LLMLingua prompt compression capabilities. “LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression” introduced a method of determining important information within uncompressed prompts that led to better quality compression. “LLMLingua-2: Data Distillation for Efficient and Faithful Task-agnostic Prompt Compression” introduced techniques to make prompt compression generalizable and task-agnostic.

How to implement prompt compression in RAG pipelines

Implementing prompt compression techniques using LLMLingua in retrieval-augmented generation (RAG) pipelines can be relatively straightforward. The LLMLingua Python package provides intuitive methods and class constructors, enabling seamless access to compression techniques and models designed for efficient prompt compression.
One significant advantage of using LLMLingua is its integration with widely adopted LLM abstraction frameworks such as LangChain and LlamaIndex. In this tutorial section, you'll observe the implementation of compression techniques within a simple RAG pipeline and a RAG pipeline implementation with LangChain that leverages LLMLingua for prompt compression.
One thing to note is that the code snippets in this section are part of a broader notebook. The code snippets below highlight the key implementation of LLMLingua using just the Python package and leverage the LangChain integration.
Below is a code snippet that illustrates the initialization of a prompt compressor object from the LLMLingua library. First, we import the necessary class and initialize an instance of PromptCompressor with a specific model configuration.
The steps in the above code snippet are as follows:
  • Importing PromptCompressor: The PromptCompressor class is imported from the llmlingua module. This class compresses the prompts using a specified model, accompanying configuration and other specific compressor details.
  • Creating an instance of PromptCompressor: An instance of PromptCompressor is created and assigned to the variable llm_lingua. The parameters provided during initialization are:
  1. model_name: Specifies the model to be used for the task of compression, in this case, "microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank"
  2. model_config: Configuration settings for the model, here specifying the revision to use as "main"
  3. use_llmlingua2: A boolean flag indicating whether to use the LLMLingua2, which leverages a task-agnostic approach to prompt compression
  4. device_map: Specifies the device to load the compressor model on, in this case, "cpu"; if utilising hardware accelerators such as GPU, then “cuda” should be specified as the value for this argument
The next step is to define a function that takes a simple string containing the uncompressed prompt to be fed into the LLM and passes it as input to the instance of the prompt compressor created in the previous code. The function will also specify configuration parameters that steer the compression algorithm.
  • Function definition: A function named compress_query_prompt is defined and takes a single parameter context.
  • Compressing the prompt: Inside the function, the compress_prompt method of the PromptCompressor instance, llm_lingua, is called with the following arguments: 1.context: Converts the input context to a string format
  1. rate: Specifies the compression rate; here, 0.33 indicates that the compressor will aim to compress the prompt to 33% of its original uncompressed size
  2. force_tokens: A list of tokens (["!", ".", "?", "\n"]) that should be present in the compressed prompt and not removed during compression
  3. drop_consecutive: A boolean flag indicating whether to drop consecutive tokens specified in the force_token list but appear consecutively after each other in the compressed prompt
The LLMLingua library's PromptCompressor class provides a powerful compress_prompt method that returns a comprehensive dictionary containing crucial information about the prompt compression process. Below is a screenshot of the example of the output of the operation.
Output of compression process This dictionary encapsulates the following key elements:
  • compressed_prompt: The reduced, optimized version of the original prompt, achieved through LLMLingua's advanced compression techniques
  • origin_tokens: The initial token count calculated from the uncompressed, original prompt, serving as a baseline for evaluating the compression efficacy
  • compressed_tokens: The token count of the compressed prompt, providing a quantitative measure of the reduction achieved
  • ratio: A metric representing the ratio of the compressed token count to the original token count, offering a comparison of the compression level
  • rate: The rate at which compression was achieved, expressed as a ratio
  • saving: A monetary value indicating the projected cost savings resulting from the reduced token usage, calculated based on the current pricing model for GPT-4
Incorporating the compress_query_prompt function from LLMLingua into an existing RAG pipeline is a straightforward process that can significantly enhance the efficiency and cost-effectiveness of your LLM applications. By inserting the prompt compression operation right before the input is fed to the LLM to generate a response, you can leverage LLMLingua's advanced compression techniques to optimize token usage and reduce computational overhead.
Here's a code snippet illustrating the integration of the compress_query_prompt function within a RAG pipeline. The complete code is available in the notebook.
To an AI stack engineer, leveraging abstraction frameworks such as LangChain is a common practice for streamlining the development and deployment of LLM applications. In this context, incorporating prompt compression into your existing RAG pipeline is a process made simple by integrating LangChain and LLMLingua.
Notably, the improvements introduced by LLMLingua extend beyond enhancing prompt efficiency; it also enables the compression of documents within retrieval systems. By harnessing LLMLingua's advanced compression techniques, you can optimize the storage and retrieval of contextual information, leading to more efficient and cost-effective operations.
Here's how you can leverage LLMLingua within a LangChain retrieval pipeline:
Note that the code snippet below is part of an extensive implementation in the notebook.
The steps in the code snippet above are as follows:
  • Importing necessary modules: The code imports ContextualCompressionRetriever from langchain.retrievers and LLMLinguaCompressor from langchain_community.document_compressors.
  • Initializing the compressor: An instance of LLMLinguaCompressor is created with the model "openai-community/gpt2" and set to load the model on the CPU.
  • Creating the compression retriever: The ContextualCompressionRetriever is initialized with the LLMLinguaCompressor and a base retriever.
  • Invoking the retriever: The compression_retriever invokes a query, "Who is the CEO?", which retrieves and compresses relevant documents.
  • Output: The compressed documents are printed, showing the relevant information in a compressed format.
To further enhance the retrieval and query capabilities, you can integrate the prompt compression setup with RetrievalQA from LangChain. In the example below, RetrievalQA creates a question-answering chain that leverages the compression_retriever. 

How to implement prompt compression in AI agents

Let’s take things a step further.
Agentic systems are on the horizon, or perhaps you already have a few demos on AI agents. If you don’t, check out the repository for notebooks on agents with various models and frameworks using MongoDB as the agent’s memory provider.
An AI agent is an artificial computational entity that perceives its environment through inputs, acts using tools, and processes information with foundation models supported by long-term and short-term memory. These agents are developed to accomplish specified tasks or objectives, leveraging available resources.
The operation of an AI agent is characterized by a recursive or cyclical nature, often involving one or more interconnected systems. Each step in this iterative process requires input to an LLM, functioning as the agent's cognitive core or brain. It’s easy to see how the input fed into an LLM can grow within every cycle or iteration of an agent’s operation, especially if conversational memory is integrated into the agent's capabilities and required to be used as input to the LLM. A system such as this can increase an agentic system's operational cost and response times.
You now understand that prompt compression techniques reduce the token utilization of inputs to LLM applications, such as RAG applications. However, the advantages of prompt compression extend beyond these applications, proving beneficial for AI agents as well. The key takeaway is that AI agents executing in extensive operational windows and likely to utilize the full extent of an LLM's context window can significantly benefit from prompt compression.
AI agents require various input components, including extensive conversational histories, operational data, system definitions, and prompts. Prompt compression allows AI agents to manage and compactly organize the context provided as input, enabling efficient and scalable operations. By implementing prompt compression techniques, AI agents can intelligently compress and optimize the token count of their input components, such as conversational histories and operational data. 
Furthermore, prompt compression allows AI agents to manage the total token count of their inputs, ensuring that the combined context remains within the specified limits of the underlying LLM. This is particularly crucial for agents operating in complex environments or handling extensive conversational histories, where the cumulative token count can quickly escalate, leading to increased computational overhead and potential performance bottlenecks.
There are two ways you can provide agents with the ability to compress prompts using LangChain:
  1. Prompt compression as a tool: Define the prompt compression operation as a tool the agent can use during its operation.
  2. Retriever with compression: Create a LangChain retrieval tool object with a defined compression logic.
The code snippet below demonstrates implementing prompt compression logic as a tool definition for an agent built using LangChain. The code below is just an outline; get the complete code.
The code above creates a prompt compression tool and provides the tool to the agent. Now, the agent is aware of the tool and can determine when to utilize it during its operation. The disadvantage is that the compression of prompts or input to the LLM is left to the agent's discretion. This means the agent must effectively assess the need for compression, which could introduce variability in performance if the agent's decision-making process is not optimal. Additionally, there may be scenarios where the agent overuses or underuses the compression tool, potentially leading to inefficiencies or loss of important contextual information. 
Creating the retriever with base compression capabilities can significantly enhance consistency and effectiveness by reducing the variability in the AI agent's use of compression logic. By embedding the compression mechanism directly within the retriever, the agent can uniformly apply compression to all relevant data, ensuring prompts and inputs are consistently optimized for efficient processing. The code snippet below demonstrates how this logic can be implemented. View the entire implementation.
This approach minimizes the need for the agent to independently assess and decide when to apply compression, thereby reducing the potential for inconsistent performance and enhancing overall reliability.
Yet again, the agent is responsible for utilizing the retriever tool.


This tutorial presents from an explanatory and technical perspective that prompt compression is crucial for optimizing LLM applications such as RAG applications and AI agents. Some of the benefits of prompt compression explored in this tutorial are its ability to optimize token usage, reduce computational overhead, and minimize operational costs.
Through technical demonstration, you’ve explored the LLMLingua library, built by Microsoft Research. This library offers a robust framework for compressing prompts while preserving response quality. By integrating LLMLingua with popular abstraction frameworks like LangChain, AI stack engineers can effortlessly incorporate prompt compression into their existing RAG pipelines and AI agent systems.
As generative AI projects move from prototype to production, the demand for efficient and cost-effective solutions will only increase. Leveraging prompt compression and similar techniques within prototyping stages enables developers and engineers to understand the optimization areas of LLM applications, the tools best suited to optimize these areas, and some evaluative metrics to measure overall performance. 
For your next journey, you can explore one of the below:
Happy Hacking!


1. What is prompt compression? Prompt compression is the process of systematically reducing the number of tokens fed into a large language model to retain or closely match the output quality comparable to that of the original, uncompressed prompt. This helps in optimizing operational costs and efficiency.
2. Why is prompt compression important for LLM applications?  As the use of LLMs grows, the token count of inputs can quickly escalate, leading to increased operational costs and latency. Prompt compression allows developers to optimize token usage, reducing computational overhead, response latency, and associated expenses.
3. How does LLMLingua facilitate prompt compression? LLMLingua is a Python library introduced by Microsoft that provides a framework for compressing prompts fed into LLMs. It uses advanced compression techniques to significantly reduce the token count while preserving essential information and response quality. It integrates seamlessly with frameworks like LangChain and LlamaIndex, making it easy to implement in various applications.
4. How can prompt compression benefit AI agents? AI agents often require various input components, including conversational histories, operational data, and prompts. Prompt compression allows agents to manage and compact this context, enabling efficient and scalable operations while minimizing computational overhead and potential performance bottlenecks.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial

Building AI Applications with Microsoft Semantic Kernel and MongoDB Atlas Vector Search

Nov 27, 2023 | 8 min read

Quick Start 2: Vector Search With MongoDB and OpenAI

May 06, 2024 | 12 min read

Building AI and RAG Apps With MongoDB, Anyscale and PyMongo

Jul 17, 2024 | 7 min read

RAG Evaluation: Detecting Hallucinations with Patronus AI and MongoDB

Jul 19, 2024 | 11 min read
Table of Contents