How to $match before $search

I’m currently using MongoDB’s $search with a knnBeta pipeline for a k-nearest neighbours search to retrieve the 10 most similar text documents based on their egVector field. Then, I apply a $match pipeline to filter the texts by a specific file_name , “Test.txt”, and finally a $project pipeline to return the information that I need. Here’s my current query:

let text = await Text.aggregate([
    {
      $search: {
        index: "default",
        knnBeta: {
          vector: resp.data.data[0].embedding,
          path: "egVector",
          k: 10,
        },
      },
    },
    {
      $match: {
        file_name: "Test.txt",
      },
    },
    {
      $project: {
        egVector: 0,
      },
    },
])

The issue I’m running into is that if the “Test.txt” document isn’t a part of the initial 10 documents retrieved by $search , it’s not considered in my query, even when it might exist in my database. This situation occurs when “Test.txt” would be part of the top-k returned documents if I were to run the query with a larger k parameter (like k=20 ). However, I’m only interested in getting the top 10 results for this specific file name. As such, I’m trying to figure out how I can apply a $match filter on file_name before running $search , so that I consider only the documents where file_name equals “Test.txt”. However, I have found out that $search needs to be the first operator in a MongoDB aggregation pipeline with the Full-Text Search feature. Given this, how can I modify my query so that I return the top 10 most similar documents (based on their egVector field) where file_name is equal to “Test.txt”? Is there an alternative approach to this problem? Any help would be much appreciated!

Hi @Josh_Sang_Hoon_Cho - Welcome to the community.

Thanks for providing the $search query you’ve attempted initially.

Have you tried using the filter option noted in the knnBeta operator documentation? There’s a filter example in the documentation too. As per the documentation for the example, I believe it somewhat matches what you’ve described for what you are expecting:

The following query filters the documents for cheese produced before or in ( ) the year 2021 , then searches the egVector field in the filtered documents for vector dimensions, and requests up to 3 nearest neighbors in the results.

i.e. Filtering first then performing the vector search.

You could try it with the text or phrase operator but let me know if those do not work for you.

Look forward to hearing from you.

Regards,
Jason

2 Likes
    let text = await Text.aggregate([
      {
        $search: {
          index: "default",
          knnBeta: {
            vector: resp.data.data[0].embedding,
            path: "egVector",
            k: 10,
            filter: {
              regex: {
                query: "TEST_FILE_NAME.txt",
                path: "file_name",
                allowAnalyzedField: true,
              },
            },
          },
        },
      },
      {
        $project: {
          egVector: 0,
        },
      },
    ]);

Good I got it work like this. Thank you very much!

1 Like

Thanks for posting your updated aggregation with the filter option used :slight_smile: Glad to hear it works for you.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.