knnBeta search with filter/must on array

Zacheusz_Siedlecki · October 8, 2023, 1:44am

I’m using sample_mflix.movies collection from Sample Mflix Dataset with knnBeta and knnVector index. First, I followed this tutorial of semantic search, and as a next step, I want to filter the movies by genres array field, before doing the semantic search.

(In the following queries {{queryEmbedding}} is the embedding array - It’s environment variable in Postman.)

The regular vector search works fine:

  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector": {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "embedding": 0
      }
    }
  ]
}

Some of the documents have Comedy in genres array.
When I try to filter on genres field, I get an empty result. This query doesn’t work, and I don’t know why:

{
  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector": {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5,
          "filter": {
            "in": {
              "path": "genres",
              "value": [
                "Comedy"
              ]
            }
          }
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "embedding": 0
      }
    }
  ]
}

Interestingly, the text filter works so I can filter on rated field:

{
  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector": {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5,
          "filter": {
            "text": {
              "path": "rated",
              "query": "PASSED"
            }
          }
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "embedding": 0
      }
    }
  ]
}

Additionally, I tried modifying the index to include the genres field. This is the definition of vector_01 index:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "genres": {
        "analyzer": "lucene.keyword",
        "type": "string"
      },
      "plot_embedding": {
        "dimensions": 1536,
        "similarity": "cosine",
        "type": "knnVector"
      }
    }
  },
  "storedSource": {
    "include": [
      "title",
      "plot",
      "genres"
    ]
  }
}

I tried with a single filter as in the examples above and with must in a compound filter. The results are the same.

How can I filter on arrays while using knnBeta at the same time?

Zacheusz_Siedlecki · October 8, 2023, 11:45am

It appears that text and phrase filters on the array return the expected results, but how can I filter using multiple values? Do I have to use compound filter with multiple must statements?

This is a query for a single value (works fine):

{
  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector": {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5,
          "filter": {
            "text": {
              "query": "Comedy",
              "path": "genres"
            }
          }
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "plot_embedding": 0
      }
    }
  ]
}

and compound filter with multiple values (it works):

{
  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector":  {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5,
          "filter": {
            "compound": {
              "must": [
                {
                  "text": {
                    "query": "Comedy",
                    "path": "genres"
                  }
                },
                {
                  "text": {
                    "query": "Drama",
                    "path": "genres"
                  }
                }
              ]
            }
          }
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "plot_embedding": 0
      }
    }
  ]
}

phrase filter works well too, and I suppose that’s a better choice:

{
  "collection": "movies",
  "database": "sample_mflix",
  "dataSource": "Cluster0",
  "pipeline": [
    {
      "$search": {
        "index": "vector_01",
        "knnBeta": {
          "vector":  {{queryEmbedding}},
          "path": "plot_embedding",
          "k": 5,
          "filter": {
            "compound": {
              "must": [
                {
                  "phrase": {
                    "query": "Comedy",
                    "path": "genres"
                  }
                },
                {
                  "phrase": {
                    "query": "Drama",
                    "path": "genres"
                  }
                }
              ]
            }
          }
        }
      }
    },
    {
      "$set": {
        "score": {
          "$meta": "searchScore"
        }
      }
    },
    {
      "$project": {
        "plot_embedding": 0
      }
    }
  ]
}

Is it the right approach, or should I use in operator for filtering an array?

Aasawari · October 9, 2023, 9:33am

Hi @Zacheusz_Siedlecki and welcome to MongoDB community forums!!

As mentioned by you in the above post, if using “phrase” solves the issue and returns the correct documents that you are looking for and does not result into performance issues, you can continue to use that

If you do not wish to use compound filter with multiple must statements, you can use the $vectorSearch aggregation pipeline stage.
This stage allows you to query the indexed vector data in your Atlas cluster. You can also use comparison operator and aggregation pipeline operators to pre-filter the data that you perform the semantic search on.

Please note that, the vector search is in public preview therefore, it is not recommended for production deployments and is subjected to change in the future.

Please reach out in case of any questions.

Warm regards
Aasawari

Zacheusz_Siedlecki · October 23, 2023, 10:04am

@Aasawari, thank you. How can I use aggregation pipeline to pre filter the data before $vectorSearch? The documentation says that $vectorSearch must be the first stage of any pipeline where it appears.