本页提供有关如何使用大语言模型 (LLM) 从自然语言为数据生成MongoDB查询的指导。
示例,考虑以下自然语言查询在 mongosh
中为Atlas sample_mflix数据库生成的查询:
给定以下自然语言查询:
| 这会生成以下
|
可用方法
除了使用开箱即用的 LLM 之外,您还可以使用MongoDB构建的以下工具从自然语言生成MongoDB查询:
选择模型
在一般任务上表现良好的模型通常在MongoDB查询生成上也表现良好。在选择 LLM 来生成MongoDB查询时,请参考 MMLU-Pro 和 Chatbot Arena ELO 等流行基准来评估模型之间的性能。
有效的提示
本节概述了提示 LLM 生成MongoDB查询的有效策略。
注意
以下提示策略基于MongoDB创建的基准。要学习;了解更多信息,请参阅我们在mongosh
Hugging Face 上对 代码进行自然语言转换的公开基准测试。
基本提示
基本提示符(也称为系统提示符)应提供任务的清晰概述,包括:
要生成的查询类型。
有关预期输出结构的信息,例如执行查询的驾驶员语言或工具。
以下基本提示示例演示了如何为 mongosh
: 生成MongoDB读取操作或聚合:
You are an expert data analyst experienced at using MongoDB. Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query. Format the mongosh query in the following structure: `db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`
一般指导
为了提高查询质量,请在基本提示中添加以下指导,为模型提供生成有效MongoDB查询的常见提示:
Some general query-authoring tips: 1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate) 2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.) 3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible 4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management 5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays 6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null 7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. 8. For Decimal128 operations, prefer range queries over exact equality 9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks
思路
您可以在生成响应之前提示模型“大声思考”,以提高响应质量。这种被称为“思路提示链”的技术可以提高性能,但会增加生成时间和成本。
为了鼓励模型在生成查询之前逐步思考,请将以下文本添加到基本提示中:
Think step by step about the code in the answer before providing it. In your thoughts, consider: 1. Which collections are relevant to the query. 2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed. 3. What fields are relevant to the query. 4. Which indexes you can use to improve performance. 5. What specific transformations or projections are required. 6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.). 7. What edge cases to consider (empty results, null values, missing fields). 8. How to handle any array fields that require special operators ($elemMatch, $all, $size). 9. Any other relevant considerations.
包括示例文档
要显着提高查询质量,请从集合中包含一些具有代表性的示例文档。两到三个代表性文档通常为模型提供有关数据结构的足够上下文。
提供示例文档时,请遵循以下准则:
使用BSON.EJSON.serialize()函数将BSON文档转换为提示的EJSON字符串。
截断长字段或深嵌套对象。
排除长字符串值。
对于大型数组(例如向量嵌入),仅包含几个元素。
示例,对于 sample_mflix
数据库和 movies
集合,您可以在提示中包含以下文档:
[ { _id: { $oid: "573a13bbf29313caabd526d0", }, plot: "Van Erp shows us what the Dutch do in their spare time and takes a look at the industry behind all t...", genres: ["Documentary"], runtime: 90, title: "Pretpark Nederland", num_mflix_comments: 0, poster: "https://m.media-amazon.com/images/M/MV5BMTUwNjU0ODg3N15BMl5BanBnXkFtZTcwMzg3NjYxNA@@._V1_SY1000_SX67...", countries: ["Netherlands"], fullplot: "Van Erp displays the mechanics behind the Dutch tourism industry. Key figures behind events and dest...", languages: ["Dutch", "Mandarin"], released: { $date: "2006-10-18T00:00:00.000Z", }, directors: ["Michiel van Erp"], writers: ["Renè van 't Erve (scenario)", "Michiel van Erp (scenario)"], awards: { wins: 0, nominations: 1, text: "1 nomination.", }, lastupdated: "2015-02-26T00:48:24.883Z", year: 2006, imdb: { rating: 7.3, votes: 237, id: 882800, }, type: "movie", tomatoes: { viewer: { rating: 2.2, numReviews: 19, }, dvd: { $date: "2010-06-22T00:00:00.000Z", }, lastUpdated: { $date: "2014-11-24T14:15:50.000Z", }, }, hash: { low: -1866172407, high: -2147460187, unsigned: false, }, }, { _id: { $oid: "573a13caf29313caabd7c4e0", }, fullplot: "A drama centered on a rising country-music songwriter (Hedlund) who sparks with a fallen star (Paltr...", imdb: { rating: 6.3, votes: 14066, id: 1555064, }, year: 2010, plot: "A rising country-music songwriter works with a fallen star to work their way fame, causing romantic ...", genres: ["Drama", "Music"], rated: "PG-13", metacritic: 45, title: "Country Strong", lastupdated: "2015-09-03T00:39:54.710Z", languages: ["English"], writers: ["Shana Feste"], type: "movie", tomatoes: { website: "http://www.countrystrong-movie.com/?hs308=CST6186", viewer: { rating: 3.3, numReviews: 32825, meter: 53, }, dvd: { $date: "2011-04-12T00:00:00.000Z", }, critic: { rating: 4.5, numReviews: 130, meter: 22, }, boxOffice: "$20.2M", consensus: "The cast gives it their all, and Paltrow handles her songs with aplomb, but Country Strong's cliched...", rotten: 101, production: "Screen Gems", lastUpdated: { $date: "2015-08-17T18:04:40.000Z", }, fresh: 29, }, poster: "https://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_SY1000_SX67...", num_mflix_comments: 0, released: { $date: "2011-01-07T00:00:00.000Z", }, awards: { wins: 2, nominations: 6, text: "Nominated for 1 Oscar. Another 1 win & 6 nominations.", }, countries: ["USA"], cast: [ "Gwyneth Paltrow", "Tim McGraw", "Garrett Hedlund", "...and 1 more items", ], }, ];
最佳实践
从自然语言生成MongoDB查询时,针对特定使用案例应用以下提示最佳实践。
包括索引信息
在提示中包含集合索引,以鼓励 LLM 生成性能更高的查询。MongoDB驱动程序和 mongosh
提供了获取索引信息的方法。示例,Node.js驾驶员提供listIndexes() 方法来获取提示的索引。
基于时间的查询
大多数 LLM 工具的系统提示符中都包含日期。但是,如果您使用的是开箱即用的 LLM,则该模型不知道当前日期或时间。因此,在使用基本模型或构建自己的MongoDB工具自然语言时,请在提示中包含最新日期。使用适用于您的编程语言的方法以字符串形式获取当前日期,例如 JavaScript 的 new Date().toString()
或 Python 的 str(datetime.now())
。
带注释的数据库模式
在提示中包含相关数据库集合的带注释模式。虽然没有一种表示方法最适合所有法学硕士,但有些方法比其他方法更有效。
我们建议使用描述数据形状的编程语言原生类型来表示集合,例如Typescript类型、 Python Pydantic 模型或Go结构体。 如果您通过这些语言使用MongoDB ,则可能已经定义了数据形状。为了指南LLM 并减少歧义,请在提示中添加注释以描述每个字段。
以下示例显示了 sample_mflix.movies
集合的Typescript类型:
interface Movie { /** * Unique identifier for the movie document. */ _id: ObjectId; /** * Brief description of the movie's plot. */ plot: string; /** * List of genres associated with the movie. */ genres: string[]; /** * Duration of the movie in minutes. */ runtime: number; /** * Title of the movie. */ title: string; /** * Number of comments on the movie in the mflix system. */ num_mflix_comments: number; /** * URL to the movie's poster image. */ poster: string; /** * List of countries where the movie was produced. */ countries: string[]; /** * Detailed description of the movie's plot. */ fullplot: string; /** * Languages spoken in the movie. */ languages: string[]; /** * Release date of the movie. */ released: Date; /** * List of directors of the movie. */ directors: string[]; /** * List of writers of the movie. */ writers: string[]; /** * Awards received by the movie. */ awards: { /** * Number of awards won by the movie. */ wins: number; /** * Number of award nominations received by the movie. */ nominations: number; /** * Textual description of the awards. */ text: string; }; /** * Last updated timestamp for the movie document. */ lastupdated: string; /** * Year the movie was released. */ year: number; /** * IMDb information for the movie. */ imdb: { /** * IMDb rating of the movie. */ rating: number; /** * Number of votes the movie received on IMDb. */ votes: number; /** * IMDb identifier for the movie. */ id: number; }; /** * Type of the movie (e.g., movie, series). */ type: string; /** * Rotten Tomatoes information for the movie. */ tomatoes: { /** * Viewer ratings on Rotten Tomatoes. */ viewer?: { /** * Viewer rating score. */ rating: number; /** * Number of reviews by viewers. */ numReviews: number; /** * Viewer meter score. */ meter: number; }; /** * DVD release date. */ dvd?: Date; /** * Last updated timestamp for Rotten Tomatoes data. */ lastUpdated?: Date; /** * Official website for the movie. */ website?: string; /** * Critic ratings on Rotten Tomatoes. */ critic?: { /** * Critic rating score. */ rating: number; /** * Number of reviews by critics. */ numReviews: number; /** * Critic meter score. */ meter: number; }; /** * Box office earnings. */ boxOffice?: string; /** * Consensus statement from Rotten Tomatoes. */ consensus?: string; /** * Number of rotten reviews. */ rotten?: number; /** * Production company. */ production?: string; /** * Number of fresh reviews. */ fresh?: number; }; /** * Hash value for the movie document. */ hash: Long; /** * MPAA rating of the movie. */ rated?: string; /** * Metacritic score of the movie. */ metacritic?: number; /** * List of main cast members in the movie. */ cast: string[]; }
Prompt Template
以下示例演示了使用本页描述的从自然语言生成 mongosh
代码的策略的完整提示。
基本提示示例
使用以下系统提示符示例作为MongoDB查询生成任务的模板。示例提示包括以下组件:
任务概述和预期输出格式
MongoDB查询创作一般指导
You are an expert data analyst experienced at using MongoDB. Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query. Format the mongosh query in the following structure: `db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})` Some general query-authoring tips: 1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate). 2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.). 3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible. 4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management. 5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays. 6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null. 7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries. 8. For Decimal128 operations, prefer range queries over exact equality. 9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.
注意
您还可以添加思路提示,以鼓励在代码生成之前逐步思考。
用户消息模板
然后,使用以下用户消息模板为模型提供有关数据库和所需查询的必要上下文:
Generate MongoDB Shell (mongosh) queries for the following database and natural language query: ## Database Information Name: {{Database name}} Description: {{database description}} Latest Date: {{latest date}} (use this to inform dates in queries) ### Collections #### Collection `{{collection name. Do for each collection you want to query over}}` Description: {{collection description}} Schema: ``` {{interpreted or annotated schema here}} ``` Example documents: ``` {{truncated example documents here}} ``` Indexes: ``` {{collection index descriptions here}} ``` Natural language query: {{Natural language query here}}