Vector search pre-filter using the in clause

Giovanni_Birkelund · December 27, 2023, 12:05am

I have the following pipeline that i’m trying to execute, but it isn’t working.

My idea: I want to filter on certain field (userId), which works correctly, then want to further filter on the fileName field, where the fileName (string) must be in the model_documents (string) list. I can’t find documentation on how to properly use the in clause for the knnBeta filter.

Here is the error I am recieving:

pymongo.errors.OperationFailure: This analyzer is expected to produce exactly one token, but got many, full error: {'ok': 0.0, 'errmsg': 'This analyzer is expected to produce exactly one token, but got many', 'code': 8, 'codeName': 'UnknownError', '$clusterTime': {'clusterTime': Timestamp(1703635448, 7), 'signature': {'hash': b'R\xff\xe5jJw\xb1\xca \xf5;\x1b\x97A\xbbt\xaf\xa2\xaf^', 'keyId': 7274371884303515650}}, 'operationTime': Timestamp(1703635448, 7)}

similar_docs = document_collection.aggregate([
            {
                "$search": {
                    "index": 'default',
                    "knnBeta": {
                        "vector": input_embedding,
                        "path": 'embedding',
                        "k": top_k,
                        "filter": {
                            "compound": {
                                "must": {
                                    "text": {
                                        "path": "userId",
                                        "query": user_id
                                    }
                                },
                                "must": [
                                    {
                                         "in": {
                                            "path": "fileName",
                                            "value": model_documents,
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        ])

Would appreciate any help forming this correclty, thanks.

Benjamin_Flast · December 28, 2023, 3:06am

Hey @Giovanni_Birkelund , thanks for the question!

Is there any reason that you can’t use the new “Vector Search Index” with the $vectorSearch aggregation stage?

The reason I ask is that it should be much more concise to implement the logic you’ve got here in the MQL Syntax we have for the $vectorSearch stage.

It would look something like this:

{
    "$vectorSearch": {
      "index": "default",
      "path": "embedding",
      "filter": {
        "$and": [{
          "userId": {
            "$eq": user_id
          },
          "fileName": {
            "$in": model_documents
          }
        }]
      },
      "queryVector": input_embedding,
      "numCandidates": 150,
      "limit": 10
    }
  }

The way to setup this new index definition is captured in the documentation here:

Giovanni_Birkelund · January 5, 2024, 6:13am

Hey Benjamin, thanks for the reply.

I did try to play around with the new documentation earlier, but couldn’t set it up properly with the and operator.

I tried your code you pasted, but get an error related to the and operator. Possibly a syntax issue? Surely the and operator takes more than one argument.

pymongo.errors.OperationFailure: PlanExecutor error during aggregation :: caused by :: "filter.$and[0]" more than 1 filter, full error: {'ok': 0.0, 'errmsg': 'PlanExecutor error during aggregation :: caused by :: "filter.$and[0]" more than 1 filter', 'code': 8, 'codeName': 'UnknownError', '$clusterTime': {'clusterTime': Timestamp(1704434548, 13), 'signature': {'hash': b'\xda\x9f\xe2\xefS\x9e\xd1hy\x00\x14K\xe0\xc7\xf0\xef/\x1f\x12\xd7', 'keyId': 7274371884303515650}}, 'operationTime': Timestamp(1704434548, 13)}

I tried removing one of the operands in the and operator as the error suggests, to just see if it works with just one, like this:

similar_docs = document_collection.aggregate([
        {
            "$vectorSearch": {
            "index": "default",
            "path": "embedding",
            "filter": {
                "$and": [{
                # "userId": {
                #     "$eq": user_id
                # },
                "fileName": {
                    "$in": model_documents
                }
                }]
            },
            "queryVector": input_embedding,
            "numCandidates": 150,
            "limit": top_k,
            }
        }
    ])

but I get an error saying that the field fileName needs to be indexed as a token. Is this something I must do for alll fields I want to filter on in vector search?

pymongo.errors.OperationFailure: PlanExecutor error during aggregation :: caused by :: Path 'fileName' needs to be indexed as token, full error: {'ok': 0.0, 'errmsg': "PlanExecutor error during aggregation :: caused by :: Path 'fileName' needs to be indexed as token", 'code': 8, 'codeName': 'UnknownError', '$clusterTime': {'clusterTime': Timestamp(1704434725, 8), 'signature': {'hash': b'\xb7P#\x8a\x1c\x19&h\xd1\xfd\xdem\xcaZ\x16\x98\xd2\x8c\xbe\xc7', 'keyId': 7274371884303515650}}, 'operationTime': Timestamp(1704434725, 8)}

Apologies for all the questions, i’m new to mongodb. I have tried to search in the docs but don’t find a clear answer. All I want to do is just do is filter on multiple aggregations, before the cosine similarity lookup…

Giovanni_Birkelund · January 5, 2024, 6:36am

I was wondering if maybe i’m niot using the latest version, but I was already on the latest of pymongo, pymongo==4.6.1.

I see that the cluster version of mongodb i’m using is not the latest (6.0) and it says I must use 6.0.11 to use the new vector search. If you think this may be the issue, is it possible to upgrade an existing cluster to thel atest version? I’m using the free tier.

Giovanni_Birkelund · January 6, 2024, 5:59am

I was able to eventually solve this myself. I had to index the fields as tokens, and update the syntax according to the error message. The syntax suggested in the documentation is slightly incorrect, the correct looks to be like this in the and clause:

            {
                "$vectorSearch": {
                "index": "default",
                "path": "embedding",
                "filter": {
                    "$and": [
                        {
                            "userId": {
                                "$eq": user_id
                            },
                        },
                        {
                            "fileName": {
                                "$in": model_documents
                            }
                        }
                    ]
                },
                "queryVector": input_embedding,
                "numCandidates": 150,
                "limit": top_k,
                }
            }

system · January 25, 2024, 12:06am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.