/ /

MongoDB 및 LangChain을 사용하여 상위 문서 검색 수행

MongoDB Vector Search를 LangChain과 통합하여 상위 문서 검색을 수행할 수 있습니다. 이 튜토리얼에서는 다음 단계를 완료합니다.

환경을 설정합니다.
데이터를 준비합니다.
상위 문서 검색기를 인스턴스화합니다.
MongoDB Vector Search 인덱스 생성합니다.
RAG 파이프라인에서 검색기를 사용합니다.

이 튜토리얼의 실행 가능한 버전을 Python 노트북으로 사용합니다.

배경

상위 문서 검색은 큰 문서를 더 작은 하위 문서로 청크화하는 검색 기술입니다. 이 기술은 전체 상위 문서를 LLM에 반환하기 전에 더 작은 청크를 쿼리합니다. 이는 더 작은 청크에 대해 더 세분화된 검색을 수행하면서도 LLM에 상위 문서의 전체 맥락을 제공함으로써 RAG 에이전트와 애플리케이션의 응답을 개선할 수 있습니다.

MongoDB를 통한 상위 문서 조회를 사용하면 상위 문서와 하위 문서를 모두 단일 컬렉션에 저장할 수 있으므로 하위 문서의 임베딩을 계산하고 인덱싱하여 효율적으로 검색할 수 있습니다.

전제 조건

이 튜토리얼을 완료하려면 다음 조건을 충족해야 합니다.

다음 MongoDB cluster 유형 중 하나입니다.
- MongoDB 버전 6.0.11 을 실행 Atlas cluster , 7.0.2 이상입니다. 사용자의 IP 주소 가 Atlas 프로젝트의 액세스 목록에 포함되어 있는지 확인하세요.
- Atlas CLI 사용하여 생성된 로컬 Atlas 배포서버 입니다. 자세히 학습 로컬 Atlas 배포 만들기를 참조하세요.
- 검색 및 벡터 검색이 설치된 MongoDB Community 또는 Enterprise 클러스터.
Voyage AI API 키입니다. 계정과 API 키를 만들려면 Voyage AI 웹사이트참조하세요.
OpenAI API 키입니다. API 요청에 사용할 수 있는 크레딧이 있는 OpenAI 계정이 있어야 합니다. OpenAI 계정 등록에 대해 자세히 학습하려면 OpenAI API 웹사이트를 참조하세요.
Colab과같은 대화형 Python 노트북을 실행 수 있는 환경입니다.

환경 설정

이 튜토리얼의 환경을 설정합니다. 확장자가 .ipynb 인 파일 저장하여 대화형 Python 노트북을 만듭니다. 이 노트북을 사용하면 Python 코드 스니펫을 개별적으로 실행 수 있으며, 이 튜토리얼에서는 이를 사용하여 코드를 실행 .

노트북 환경을 설정하다 하려면 다음을 수행합니다.

종속성을 설치하고 가져옵니다.

다음 명령을 실행합니다:

pip install --quiet --upgrade langchain langchain-community langchain-core langchain-mongodb langchain-voyageai langchain-openai pymongo pypdf

환경 변수를 정의합니다.

다음 코드를 실행하고 자리 표시자를 다음 값으로 대체합니다.

Voyage AI 및 OpenAI API 키.
MongoDB 클러스터의 연결 문자열.

import os
os.environ["VOYAGE_API_KEY"] = "<voyage-api-key>"
os.environ["OPENAI_API_KEY"] = "<openai-api-key>"
MONGODB_URI = "<connection-string>"

참고

<connection-string>을 Atlas 클러스터 또는 로컬 Atlas 배포서버의 연결 문자열로 교체합니다.

연결 문자열은 다음 형식을 사용해야 합니다.

mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net

자세한 학습은 드라이버를 통해 클러스터에 연결을 참조하세요.

연결 문자열은 다음 형식을 사용해야 합니다.

mongodb://localhost:<port-number>/?directConnection=true

학습 내용은 연결 문자열을 참조하세요.

데이터 준비

최근 MongoDB 수익 보고서를 포함한 샘플 PDF를 로드하고 청크화하려면 다음 코드를 노트북에 붙여넣고 실행하세요.

이 코드는 텍스트 분할기를 사용하여 PDF 데이터를 더 작은 상위 문서로 청크화합니다. 각 문서에 대해 청크 크기(문자 수)와 청크 겹침(연속된 청크 사이에 겹치는 문자 수)을 지정합니다.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/12881/pdf") 
data = loader.load()
# Chunk into parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
docs = parent_splitter.split_documents(data)
# Print a document
docs[0]

Document(metadata={'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 0, 'page_label': '1'}, page_content='MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results\nDecember 9, 2024\nThird Quarter Fiscal 2025 Total Revenue of $529.4 million, up 22% Year-over-Year\nContinued Strong Customer Growth with Over 52,600 Customers as of October 31, 2024\nMongoDB Atlas Revenue up 26% Year-over-Year; 68% of Total Q3 Revenue\nNEW YORK , Dec. 9, 2024 /PRNewswire/ -- MongoDB, Inc. (NASDAQ: MDB) today announced its financial results for the third quarter ended October\n31, 2024.\n\xa0\n  \xa0\n"MongoDB\'s third quarter results were significantly ahead of expectations on the top and bottom line, driven by better-than-expected EA performance\nand 26% Atlas revenue growth.\xa0 We continue to see success winning new business due to the superiority of MongoDB\'s developer data platform in\naddressing a wide variety of mission-critical use cases," said Dev Ittycheria, President and Chief Executive Officer of MongoDB .\n"We continue to invest in our legacy app modernization and AI offerings as our document model and distributed architecture are exceptionally well\nsuited for customers looking to build highly-performant, modern applications.\xa0 MongoDB  is in a great position to be a central pillar of the emerging AI\ntech stack and benefit from the next wave of application development in the years ahead."\nThird Quarter Fiscal 2025 Financial Highlights\nRevenue: Total revenue was $529.4 million for the third quarter of fiscal 2025, an increase of 22% year-over-year.\nSubscription revenue was $512.2 million, an increase of 22% year-over-year, and services revenue was $17.2 million, an\nincrease of 18% year-over-year.\nGross Profit: Gross profit was $394.0 million for the third quarter of fiscal 2025, representing a 74% gross margin\ncompared to 75% in the year-ago period. Non-GAAP gross profit was $405.7 million, representing a 77% non-GAAP gross\nmargin, consistent with a non-GAAP gross margin of 77% in the year-ago period.')

조회기 인스턴스화

이 섹션에서는 상위 문서 검색기를 인스턴스화하고 이를 사용하여 데이터를 MongoDB 에 수집합니다.

MongoDBAtlasParentDocumentRetriever 상위 문서를 더 작은 하위 문서로 청크하고 하위 문서를 임베드한 다음 상위 문서와 하위 문서를 모두 MongoDB 의 동일한 컬렉션 으로 수집합니다. 이 리트리버는 내부적으로 다음을 생성합니다.

하위 문서에 대한 벡터 검색 쿼리를 처리하는 벡터 저장소인 MongoDBAtlasVectorSearch의 인스턴스입니다.
상위 문서를 저장하고 검색하는 문서 저장소인 MongoDBDocStore의 인스턴스입니다.

조회기를 인스턴스화합니다.

MongoDBAtlasParentDocumentRetriever 를 구성하는 가장 빠른 방법은 from_connection_string 메서드를 사용하는 것입니다. 이 코드는 다음 매개변수를 지정합니다.

connection_string: 클러스터에 연결하기 위한 Atlas 연결 문자열입니다.
child_splitter: 상위 문서를 더 작은 하위 문서로 분할하기 위해 사용할 텍스트 분할기입니다.
embedding_model: 하위 문서 임베딩에 사용할 임베딩 모델입니다.
database_name 및 collection_name: 문서를 삽입할 데이터베이스 및 컬렉션 이름입니다.
MongoDBAtlasVectorSearch 벡터 스토어를 구성할 때 다음과 같은 매개 변수를 선택 사항으로 사용할 수 있습니다.
- text_key: 문서에서 임베드할 텍스트가 포함된 필드입니다.
- relevance_score: 벡터 검색 쿼리에 사용할 관련성 점수입니다.
- search_kwargs: 초기 검색에서 조회할 하위 문서의 수입니다.

from langchain_mongodb.retrievers import MongoDBAtlasParentDocumentRetriever
from langchain_voyageai import VoyageAIEmbeddings
# Define the embedding model to use
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")
# Define the chunking method for the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)
# Specify the database and collection name
database_name = "langchain_db"
collection_name = "parent_document"
# Create the parent document retriever
parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(
    connection_string = MONGODB_URI,
    child_splitter = child_splitter,
    embedding_model = embedding_model,
    database_name = database_name,
    collection_name = collection_name,
    text_key = "page_content",
    relevance_score_fn = "dotProduct",
    search_kwargs = { "k": 10 },
)

팁

데이터를 입력합니다.

다음으로, 아래 코드를 실행해 조회기의 add_documents 메서드를 사용하여 문서를 Atlas에 삽입합니다. 조회기는 상위 문서를 입력으로 받아 사용자가 구성한 방식에 따라 상위 및 하위 문서를 모두 수집합니다.

parent_doc_retriever.add_documents(docs)

(선택 사항) 문서를 검토합니다.

샘플 코드를 실행한 후 클러스터의 langchain_db.parent_document 컬렉션으로 이동하여 Atlas UI에서 문서를 볼 수 있습니다.

상위 문서와 하위 문서 모두 청크된 텍스트를 포함하는 page_content 필드를 가지고 있습니다. 하위 문서에는 청크된 텍스트의 벡터 임베딩을 포함하는 추가 embedding 필드와 상위 문서의 _id에 해당하는 doc_id 필드가 있습니다.

Atlas UI에서 다음 쿼리를 실행하고 <id> 자리 표시자를 유효한 문서 ID로 대체합니다.

동일한 상위 문서 ID를 공유하는 하위 문서를 보려면:
```
{ doc_id: "<id>" }
```
해당 자식 문서들의 부모 문서를 보려면:
```
{ _id: "<id>" }
```

MongoDB Vector Search 인덱스 만들기

langchain_db.parent_document 컬렉션 에서 벡터 검색 쿼리를 활성화 하려면 MongoDB Vector Search 인덱스 만들어야 합니다. LangChain 헬퍼 메서드 또는 PyMongo 운전자 메서드를 사용할 수 있습니다. 원하는 방법으로 노트북에서 다음 코드를 실행합니다.

# Get the vector store instance from the retriever
vector_store = parent_doc_retriever.vectorstore
# Use helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024 # The number of dimensions to index
)

팁

create_vector_search_index API 참조

from pymongo import MongoClient
from pymongo.operations import SearchIndexModel
# Connect to your cluster
client = MongoClient(MONGODB_URI)
collection = client[database_name][collection_name]
# Create your vector search index model, then create the index
vector_index_model = SearchIndexModel(
   definition={
      "fields": [
         {
         "type": "vector",
         "path": "embedding",
         "numDimensions": 1024,
         "similarity": "dotProduct"
         }
      ]
   },
   name="vector_index",
   type="vectorSearch"
)
collection.create_search_index(model=vector_index_model)

인덱스 작성에는 약 1분 정도가 소요됩니다. 인덱스가 작성되는 동안 인덱스는 초기 동기화 상태에 있습니다. 빌드가 완료되면 컬렉션의 데이터 쿼리를 시작할 수 있습니다.

RAG 파이프라인에서 조회기 사용

MongoDB 인덱스 빌드하면 데이터에 대해 벡터 검색 쿼리를 실행 하고 RAG 파이프라인 에서 리트리버를 사용할 수 있습니다. 노트북에 다음 코드를 붙여넣고 실행 상위 문서 검색을 수행하는 샘플 RAG 파이프라인 구현 .

벡터 검색 쿼리를 실행합니다.

주어진 쿼리에 대해 가장 관련성이 높은 문서를 보려면 다음 코드를 붙여넣고 실행하여 컬렉션에서 샘플 벡터 검색 쿼리를 수행합니다. 조회기는 AI technology 문자열과 의미적으로 유사한 관련 하위 문서를 검색한 후, 해당 하위 문서의 상위 문서를 반환합니다.

parent_doc_retriever.invoke("AI technology")

[Document(metadata={'_id': '492a138c-1309-4791-a0d0-282d34ea1e55', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='downturns and/or the effects of rising interest rates, inflation and volatility in the global economy and financial markets on our business and future\noperating results; our potential failure to meet publicly announced guidance or other expectations about our business and future operating results; our\nlimited operating history; our history of losses; failure of our platform to satisfy customer demands; the effects of increased competition; our\ninvestments in new products and our ability to introduce new features, services or enhancements; our ability to effectively expand our sales and\nmarketing organization; our ability to continue to build and maintain credibility with the developer community; our ability to add new customers or\nincrease sales to our existing customers; our ability to maintain, protect, enforce and enhance our intellectual property; the effects of social, ethical and\nregulatory issues relating to the use of new and evolving technologies, such as artificial intelligence, in our offerings or partnerships; the growth and\nexpansion of the market for database products and our ability to penetrate that market; our ability to integrate acquired businesses and technologies\nsuccessfully or achieve the expected benefits of such acquisitions; our ability to maintain the security of our software and adequately address privacy\nconcerns; our ability to manage our growth effectively and successfully recruit and retain additional highly-qualified personnel; and the price volatility of'),
 Document(metadata={'_id': 'a937204a-0e85-4827-ac63-124735529d51', 'source': 'https://investors.mongodb.com/node/12881/pdf', 'page': 1, 'page_label': '2'}, page_content='that it obtained the AWS Modernization Competency designation and launched a MongoDB University course focused on\nbuilding AI applications with MongoDB  and AWS. At Microsoft Ignite, MongoDB  announced new technology integrations for\nAI, data analytics, and automating database deployments across on-premises, cloud, and edge environments.\nLaunched in July 2024, the MongoDB AI Applications Program (MAAP) is designed to help companies unleash the power\nof their data and to take advantage of rapidly advancing AI technologies. We recently announced that Capgemini,\nConfluent, IBM, Unstructured, and QuantumBlack, AI by McKinsey have joined the MAAP ecosystem, offering customers\nadditional integration and solution options.\nExecutive Leadership Update\nMichael Gordon, MongoDB\'s Chief Operating Officer and Chief Financial Officer, will be stepping down at the end of the Company\'s fiscal year on\nJanuary 31, 2025, and afterwards will serve as an advisor to ensure a smooth transition. The Company has commenced an executive search process\nfor a new CFO and will evaluate internal and external candidates.\xa0 Serge Tanjga, MongoDB\'s Senior Vice President of Finance, will serve as interim\nCFO starting February 1st if a permanent successor has not been named by that date.\nDev Ittycheria commented, "On behalf of everyone at MongoDB , I want to thank Michael for everything he has done to contribute to our success in his\nnearly 10 years with the company.\xa0 In Michael\'s time here, MongoDB  had a successful IPO, has grown revenue nearly 50x and has successfully\nscaled the business model to generate meaningful operating leverage. Michael has also built out a world-class finance team that I am confident will\ndeliver a smooth transition to a new CFO in the coming months."\nMichael Gordon said, "I am incredibly proud of what we have accomplished as a team in my almost ten years with the company.\xa0 While we have')]

LangChain을 사용한 벡터 검색 쿼리 예시에 대한 자세한 내용은 벡터 검색 쿼리 실행을 참조하세요.

RAG 파이프라인을 생성하고 실행합니다.

상위 문서 조회기를 사용하여 RAG 파이프라인을 생성하고 실행하려면 다음 코드를 붙여넣고 실행합니다. 이 코드는 다음을 수행합니다.

검색된 상위 문서를 쿼리 의 컨텍스트로 사용하도록 LLM에 지시하는 LangChain 프롬프트 템플릿 을 정의합니다. LangChain은 이러한 문서를 {context} 입력 변수에 전달하고 쿼리 {query} 변수에 전달합니다.
다음을 지정하는 체인 을 생성합니다.
- 관련 상위 문서를 조회하도록 구성한 상위 문서 조회기
- 사용자가 정의한 프롬프트 템플릿입니다.
- 컨텍스트 인식 응답을 생성하는 OpenAI의 LLM입니다. 기본값 으로 이 모델은 gpt-3.5-turbo 모델입니다.
샘플 쿼리 로 체인에 프롬프트를 표시하고 응답을 반환합니다. 생성된 응답은 다를 수 있습니다.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import  RunnablePassthrough
from langchain_openai import ChatOpenAI
# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {query}?
"""
prompt = PromptTemplate.from_template(template)
model = ChatOpenAI()
# Construct a chain to answer questions on your data
chain = (
   {"context": parent_doc_retriever, "query": RunnablePassthrough()}
   | prompt
   | model
   | StrOutputParser()
)
# Prompt the chain
query = "In a list, what are MongoDB's latest AI announcements?"
answer = chain.invoke(query)
print(answer)

1. MongoDB obtained the AWS Modernization Competency designation.
2. MongoDB launched a MongoDB University course focused on building AI applications with MongoDB and AWS.
3. MongoDB announced new technology integrations for AI, data analytics, and automating database deployments across various environments.
4. MongoDB launched the MongoDB AI Applications Program (MAAP) to help companies harness the power of data and future AI technologies.
5. Capgemini, Confluent, IBM, Unstructured, and QuantumBlack joined the MAAP ecosystem to offer customers additional integration and solution options.

보면서 배우기

LangChain 및 MongoDB 사용한 상위 문서 검색에 대한 이 동영상을 따라해 보세요.

소요 시간: 27분

돌아가기

하이브리드 검색

자체 쿼리 검색