Docs Menu
Docs Home
/
MongoDB Atlas
/ /

Get Started with the Haystack Integration

On this page

  • Background
  • Prerequisites
  • Set Up the Environment
  • Create the Atlas Vector Search Index
  • Store Custom Data in Atlas
  • Answer Questions on Your Data
  • Next Steps

You can integrate Atlas Vector Search with Haystack to build custom applications with LLMs and implement retrieval-augmented generation (RAG). This tutorial demonstrates how to start using Atlas Vector Search with Haystack to perform semantic search on your data and build a RAG implementation. Specifically, you perform the following actions:

  1. Set up the environment.

  2. Create an Atlas Vector Search index.

  3. Store custom data on Atlas.

  4. Implement RAG by using Atlas Vector Search to answer questions on your data.

Haystack is a framework for building custom applications with LLMs, embedding models, and vector search. By integrating Atlas Vector Search with Haystack, you can use Atlas as a vector database and use Atlas Vector Search to implement RAG by retrieving semantically similar documents from your data. To learn more about RAG, see Key Concepts.

To complete this tutorial, you must have the following:

  • An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list.

  • An OpenAI API Key. You must have a paid OpenAI account with credits available for API requests.

  • A notebook to run your Python project such as Colab.

You must first set up the environment for this tutorial. Create an interactive Python notebook by saving a file with the .ipynb extension, and then run the following code snippets in the notebook:

1
  1. Run the following command:

    pip --quiet install mongodb-atlas-haystack pymongo
  2. Run the following code to import the required packages:

    import getpass, os
    from haystack import Pipeline, Document
    from haystack.document_stores.types import DuplicatePolicy
    from haystack.components.writers import DocumentWriter
    from haystack.components.generators import OpenAIGenerator
    from haystack.components.builders.prompt_builder import PromptBuilder
    from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder
    from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
    from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
2

Run the following code and provide the following when prompted:

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["MONGO_CONNECTION_STRING"]=getpass.getpass("MongoDB Atlas Connection String:")

Note

Your connection string should use the following format:

mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net

In this section, you create the haystack_db database and test collection to store your custom data. Then, to enable vector search queries on your data, you create an Atlas Vector Search index.

To create an Atlas Vector Search index, you must have Project Data Access Admin or higher access to the Atlas project.

1
  1. If it is not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.

  2. If it is not already displayed, select your desired project from the Projects menu in the navigation bar.

  3. If the Clusters page is not already displayed, click Database in the sidebar.

2
  1. From the Atlas Clusters view, click the Browse Collections button for your cluster.

  2. Click the + Create Database button.

  3. For the Database name enter haystack_db.

  4. For the Collection name, enter test.

  5. Click Create to create the database and its first collection.

3
  1. Click Create Search Index.

  2. Under Atlas Vector Search, select JSON Editor and then click Next.

  3. In the Database and Collection section, find the haystack_db database, and select the test collection.

  4. In the Index Name field, enter vector_index.

  5. Replace the default definition with the following index definition and then click Next.

    This index definition specifies indexing the embedding field in an index of the vectorSearch type. The embedding field contains the embeddings that you'll create using OpenAI's text-embedding-ada-002 embedding model. The index definition specifies 1536 vector dimensions and measures similarity using cosine.

    1{
    2 "fields": [
    3 {
    4 "type": "vector",
    5 "path": "embedding",
    6 "numDimensions": 1536,
    7 "similarity": "cosine"
    8 }
    9 ]
    10}
4

A modal window displays to let you know that your index is building.

5

The index should take about one minute to build. While it builds, the Status column reads Initial Sync. When it finishes building, the Status column reads Active.

In this section, you instantiate Atlas as a vector database, also called a document store. Then, you create vector embeddings from custom data and store these documents in a collection in Atlas. Paste and run the following code snippets in your notebook.

1

Run the following code to instantiate Atlas as a document store. This code establishes a connection to your Atlas cluster and specifies the following:

  • haystack_db and test as the Atlas database and collection used to store the documents.

  • vector_index as the index used to run semantic search queries.

document_store = MongoDBAtlasDocumentStore(
database_name="haystack_db",
collection_name="test",
vector_search_index="vector_index",
)
2

This code defines a few sample documents and runs a pipeline with the following components:

  • An embedder from OpenAI to convert your document into vector embeddings.

  • A document writer to populate your document store with the sample documents and their embeddings.

# Create some example documents
documents = [
Document(content="My name is Jean and I live in Paris."),
Document(content="My name is Mark and I live in Berlin."),
Document(content="My name is Giorgio and I live in Rome."),
]
# Initializing a document embedder to convert text content into vectorized form.
doc_embedder = OpenAIDocumentEmbedder()
# Setting up a document writer to handle the insertion of documents into the MongoDB collection.
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
# Creating a pipeline for indexing documents. The pipeline includes embedding and writing documents.
indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")
# Connecting the components of the pipeline for document flow.
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")
# Running the pipeline with the list of documents to index them in MongoDB.
indexing_pipe.run({"doc_embedder": {"documents": documents}})
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00, 4.16it/s]
{'doc_embedder': {'meta': {'model': 'text-embedding-ada-002',
'usage': {'prompt_tokens': 32, 'total_tokens': 32}}},
'doc_writer': {'documents_written': 0}}

Tip

After running the sample code, you can view your vector embeddings in the Atlas UI by navigating to the haystack_db.test collection in your cluster.

This section demonstrates how to implement RAG in your application with Atlas Vector Search and Haystack.

The following code defines and runs a pipeline with the follow components:

In this example, you prompt the LLM with the sample query Where does Mark live?. The LLM generates an accurate, context-aware response from the custom data you stored in Atlas.

# Template for generating prompts for a movie recommendation engine.
prompt_template = """
You are an assistant allowed to use the following context documents.\nDocuments:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
\Query: {{query}}
\nAnswer:
"""
# Setting up a retrieval-augmented generation (RAG) pipeline for generating responses.
rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", OpenAITextEmbedder())
# Adding a component for retrieving related documents from MongoDB based on the query embedding.
rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store,top_k=15), name="retriever")
# Building prompts based on retrieved documents to be used for generating responses.
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
# Adding a language model generator to produce the final text output.
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
# Connecting the components of the RAG pipeline to ensure proper data flow.
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
# Run the pipeline
query = "Where does Mark live?"
result = rag_pipeline.run(
{
"text_embedder": {"text": query},
"prompt_builder": {"query": query},
});
print(result['llm']['replies'][0])
Mark lives in Berlin.

MongoDB also provides the following developer resources:

Tip

See also:

Back

Semantic Kernel

Next

Amazon Bedrock