Lenguaje natural a query de MongoDB

Esta página proporciona orientación sobre cómo generar queries en MongoDB para sus datos a partir de lenguaje natural usando un gran modelo de lenguaje (LLM).

Por ejemplo, considere la siguiente query en lenguaje natural generada en mongosh para la base de datos Atlas sample_mflix:

Dada la siguiente query en lenguaje natural:

Show me the genres and runtime of
10 movies from 2015 that have
the most comments

Esto genera el siguiente código mongosh:

db.movies.aggregate([
  {
    $match: {
      year: 2015,
    },
  },
  {
    $sort: {
      num_mflix_comments: -1,
    },
  },
  {
    $limit: 10,
  },
  {
    $project: {
      _id: 0,
      genres: 1,
      runtime: 1,
    },
  },
]);

Métodos disponibles

Además de usar LLM de fábrica, puede utilizar las siguientes herramientas compiladas por MongoDB para generar queries de MongoDB a partir de lenguaje natural:

Selección de un modelo

Los modelos que funcionan bien en tareas generales suelen funcionar bien también en la generación de queries de MongoDB. Al seleccionar un LLM para generar consultas de MongoDB, consulta benchmarks populares como MMLU-Pro y Chatbot Arena ELO para evaluar el rendimiento entre modelos.

Prompts eficaces

En esta sección, se describen estrategias eficaces para pedir a un LLM que genere queries de MongoDB.

Nota

Las siguientes estrategias de prompting se basan en benchmarks creados por MongoDB. Para aprender más, consultar nuestro benchmark público de lenguaje natural a mongosh código en Hugging Face.

Prompt base

Su prompt base, también llamado prompt del sistema, debe ofrecer una visión general clara de su tarea, incluyendo lo siguiente:

El tipo de query a generar.
Información sobre la estructura de salida prevista, como el lenguaje del controlador o la herramienta que ejecuta el query.

El siguiente ejemplo de prompt base demuestra cómo generar una operación de lectura de MongoDB o una agregación para mongosh:

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

Orientación general

Para mejorar la calidad de las queries, agregue la siguiente orientación a su mensaje base para proporcionarle al modelo consejos habituales para generar queries efectivas de MongoDB:

Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate)
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.)
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible
4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. 
8. For Decimal128 operations, prefer range queries over exact equality
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks

Cadena de pensamiento

Se puede solicitar al modelo que "piense en voz alta" antes de generar la respuesta para mejorar la calidad de la respuesta. Esta técnica, denominada cadena de pensamiento prompting, mejora el rendimiento pero aumenta el tiempo de generación y los costos.

Para que el modelo piense paso a paso antes de generar el query, añada el siguiente texto a su prompt base:

Think step by step about the code in the answer before providing it. In your thoughts, consider:
1. Which collections are relevant to the query.
2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed.
3. What fields are relevant to the query.
4. Which indexes you can use to improve performance.
5. What specific transformations or projections are required.
6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.).
7. What edge cases to consider (empty results, null values, missing fields).
8. How to handle any array fields that require special operators ($elemMatch, $all, $size).
9. Any other relevant considerations.

Incluir documentos de muestra

Para mejorar significativamente la calidad del query, incluya algunos documentos de muestra representativos de su colección. Dos o tres documentos representativos suelen proporcionar al modelo suficiente contexto sobre la estructura de los datos.

Al proporcionar documentos de muestra, siga estas directrices:

Utilice la función BSON.EJSON.serialize() a fin de convertir documentos BSON a strings EJSON para el prompt.
Trunque campos largos u objetos profundamente anidados.
Excluya valores de strings largas.
Para arreglos grandes, como las incrustaciones vectoriales, incluya solo unos pocos elementos.

Ejemplo de documentos de muestra

Documentos de muestra de la colección de películas para incluir en un prompt

Por ejemplo, para la base de datos sample_mflix y la colección movies, podría incluir los siguientes documentos en su prompt:

[
  {
    _id: {
      $oid: "573a13bbf29313caabd526d0",
    },
    plot: "Van Erp shows us what the Dutch do in their spare time and takes a look at the industry behind all t...",
    genres: ["Documentary"],
    runtime: 90,
    title: "Pretpark Nederland",
    num_mflix_comments: 0,
    poster:
      "https://m.media-amazon.com/images/M/MV5BMTUwNjU0ODg3N15BMl5BanBnXkFtZTcwMzg3NjYxNA@@._V1_SY1000_SX67...",
    countries: ["Netherlands"],
    fullplot:
      "Van Erp displays the mechanics behind the Dutch tourism industry. Key figures behind events and dest...",
    languages: ["Dutch", "Mandarin"],
    released: {
      $date: "2006-10-18T00:00:00.000Z",
    },
    directors: ["Michiel van Erp"],
    writers: ["Renè van 't Erve (scenario)", "Michiel van Erp (scenario)"],
    awards: {
      wins: 0,
      nominations: 1,
      text: "1 nomination.",
    },
    lastupdated: "2015-02-26T00:48:24.883Z",
    year: 2006,
    imdb: {
      rating: 7.3,
      votes: 237,
      id: 882800,
    },
    type: "movie",
    tomatoes: {
      viewer: {
        rating: 2.2,
        numReviews: 19,
      },
      dvd: {
        $date: "2010-06-22T00:00:00.000Z",
      },
      lastUpdated: {
        $date: "2014-11-24T14:15:50.000Z",
      },
    },
    hash: {
      low: -1866172407,
      high: -2147460187,
      unsigned: false,
    },
  },
  {
    _id: {
      $oid: "573a13caf29313caabd7c4e0",
    },
    fullplot:
      "A drama centered on a rising country-music songwriter (Hedlund) who sparks with a fallen star (Paltr...",
    imdb: {
      rating: 6.3,
      votes: 14066,
      id: 1555064,
    },
    year: 2010,
    plot: "A rising country-music songwriter works with a fallen star to work their way fame, causing romantic ...",
    genres: ["Drama", "Music"],
    rated: "PG-13",
    metacritic: 45,
    title: "Country Strong",
    lastupdated: "2015-09-03T00:39:54.710Z",
    languages: ["English"],
    writers: ["Shana Feste"],
    type: "movie",
    tomatoes: {
      website: "http://www.countrystrong-movie.com/?hs308=CST6186",
      viewer: {
        rating: 3.3,
        numReviews: 32825,
        meter: 53,
      },
      dvd: {
        $date: "2011-04-12T00:00:00.000Z",
      },
      critic: {
        rating: 4.5,
        numReviews: 130,
        meter: 22,
      },
      boxOffice: "$20.2M",
      consensus:
        "The cast gives it their all, and Paltrow handles her songs with aplomb, but Country Strong's cliched...",
      rotten: 101,
      production: "Screen Gems",
      lastUpdated: {
        $date: "2015-08-17T18:04:40.000Z",
      },
      fresh: 29,
    },
    poster:
      "https://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_SY1000_SX67...",
    num_mflix_comments: 0,
    released: {
      $date: "2011-01-07T00:00:00.000Z",
    },
    awards: {
      wins: 2,
      nominations: 6,
      text: "Nominated for 1 Oscar. Another 1 win & 6 nominations.",
    },
    countries: ["USA"],
    cast: [
      "Gwyneth Paltrow",
      "Tim McGraw",
      "Garrett Hedlund",
      "...and 1 more items",
    ],
  },
];

Mejores prácticas

Aplique las siguientes mejores prácticas de creación de prompts para casos de uso específicos al generar queries de MongoDB a partir de lenguaje natural.

Incluir información del índice

Incluya índices de colección en su prompt para que el LLM genere queries de mayor rendimiento. Los controladores de MongoDB y mongosh proporcionan métodos para obtener información sobre los índices. Por ejemplo, el controlador de Node.js proporciona el método listIndexes() para obtener índices para su prompt.

Queries basadas en tiempo

La mayoría de las herramientas LLM incluyen la fecha en su prompt del sistema. Sin embargo, si se utiliza un LLM listo para usar, el modelo no conoce la fecha ni la hora actuales. Por lo tanto, al trabajar con modelos base o al crear el propio lenguaje natural para herramientas de MongoDB, se debe incluir la fecha más reciente en el prompt. Se debe usar el método del lenguaje de programación para obtener la fecha actual como una string, como new Date().toString() de JavaScript o str(datetime.now()) de Python.

Esquemas de bases de datos anotados

Incluya esquemas anotados de colecciones relevantes de bases de datos en su prompt. Si bien ningún método de representación funciona mejor para todos los LLM, algunos son más efectivos que otros.

Recomendamos representar las colecciones utilizando tipos nativos del lenguaje de programación que describan la estructura de los datos, como los tipos de TypeScript, los modelos de Pydantic de Python o las estructuras de Go. Si utiliza MongoDB desde estos lenguajes, probablemente ya tenga definida la estructura de los datos. Para guiar al LLM y reducir la ambigüedad, añada comentarios al prompt para describir cada campo.

El siguiente ejemplo muestra un tipo de TypeScript para la colección sample_mflix.movies:

Ejemplo de esquema de TypeScript

Ejemplo de esquema anotado de TypeScript para la colección sample_mflix.movies

interface Movie {
  /**
   * Unique identifier for the movie document.
   */
  _id: ObjectId;
  /**
   * Brief description of the movie's plot.
   */
  plot: string;
  /**
   * List of genres associated with the movie.
   */
  genres: string[];
  /**
   * Duration of the movie in minutes.
   */
  runtime: number;
  /**
   * Title of the movie.
   */
  title: string;
  /**
   * Number of comments on the movie in the mflix system.
   */
  num_mflix_comments: number;
  /**
   * URL to the movie's poster image.
   */
  poster: string;
  /**
   * List of countries where the movie was produced.
   */
  countries: string[];
  /**
   * Detailed description of the movie's plot.
   */
  fullplot: string;
  /**
   * Languages spoken in the movie.
   */
  languages: string[];
  /**
   * Release date of the movie.
   */
  released: Date;
  /**
   * List of directors of the movie.
   */
  directors: string[];
  /**
   * List of writers of the movie.
   */
  writers: string[];
  /**
   * Awards received by the movie.
   */
  awards: {
    /**
     * Number of awards won by the movie.
     */
    wins: number;
    /**
     * Number of award nominations received by the movie.
     */
    nominations: number;
    /**
     * Textual description of the awards.
     */
    text: string;
  };
  /**
   * Last updated timestamp for the movie document.
   */
  lastupdated: string;
  /**
   * Year the movie was released.
   */
  year: number;
  /**
   * IMDb information for the movie.
   */
  imdb: {
    /**
     * IMDb rating of the movie.
     */
    rating: number;
    /**
     * Number of votes the movie received on IMDb.
     */
    votes: number;
    /**
     * IMDb identifier for the movie.
     */
    id: number;
  };
  /**
   * Type of the movie (e.g., movie, series).
   */
  type: string;
  /**
   * Rotten Tomatoes information for the movie.
   */
  tomatoes: {
    /**
     * Viewer ratings on Rotten Tomatoes.
     */
    viewer?: {
      /**
       * Viewer rating score.
       */
      rating: number;
      /**
       * Number of reviews by viewers.
       */
      numReviews: number;
      /**
       * Viewer meter score.
       */
      meter: number;
    };
    /**
     * DVD release date.
     */
    dvd?: Date;
    /**
     * Last updated timestamp for Rotten Tomatoes data.
     */
    lastUpdated?: Date;
    /**
     * Official website for the movie.
     */
    website?: string;
    /**
     * Critic ratings on Rotten Tomatoes.
     */
    critic?: {
      /**
       * Critic rating score.
       */
      rating: number;
      /**
       * Number of reviews by critics.
       */
      numReviews: number;
      /**
       * Critic meter score.
       */
      meter: number;
    };
    /**
     * Box office earnings.
     */
    boxOffice?: string;
    /**
     * Consensus statement from Rotten Tomatoes.
     */
    consensus?: string;
    /**
     * Number of rotten reviews.
     */
    rotten?: number;
    /**
     * Production company.
     */
    production?: string;
    /**
     * Number of fresh reviews.
     */
    fresh?: number;
  };
  /**
   * Hash value for the movie document.
   */
  hash: Long;
  /**
   * MPAA rating of the movie.
   */
  rated?: string;
  /**
   * Metacritic score of the movie.
   */
  metacritic?: number;
  /**
   * List of main cast members in the movie.
   */
  cast: string[];
}

Prompt Template

El siguiente ejemplo demuestra un prompt completo utilizando las estrategias descritas en esta página para generar mongosh código a partir del lenguaje natural.

Ejemplo de prompt base

Utiliza el siguiente ejemplo de prompt del sistema como plantilla para tus tareas de generación de query en MongoDB. El mensaje de muestra incluye los siguientes componentes:

Descripción general de la tarea y formato de salida esperado
Guía general para la redacción de queries de MongoDB

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`
Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate).
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.).
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible.
4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management.
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays.
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null.
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries.
8. For Decimal128 operations, prefer range queries over exact equality.
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.

Nota

También se podría añadir el indicador de cadena de pensamiento para fomentar el pensamiento paso a paso antes de la generación del código.

Plantilla de mensaje de usuario

Luego, se utiliza la siguiente plantilla de mensaje de usuario para proporcionar al modelo el contexto necesario sobre la base de datos y la query deseada:

Generate MongoDB Shell (mongosh) queries for the following database and natural language query:
## Database Information
Name: {{Database name}}
Description: {{database description}}
Latest Date: {{latest date}} (use this to inform dates in queries)
### Collections
#### Collection `{{collection name. Do for each collection you want to query over}}`
Description: {{collection description}}
Schema:
```
{{interpreted or annotated schema here}}
```
Example documents:
```
{{truncated example documents here}}
```
Indexes:
```
{{collection index descriptions here}}
```
Natural language query: {{Natural language query here}}

Volver

De SQL a MongoDB

Búsqueda de texto