Filter in knnBeta $search

Hi, im currently trying to run a $search pipeline on a collecion with vector embedding. this collection has multiple objects with an item id and im trying to make sure the search in running only on a specific item id
im using the following pipeline:

'index': search_index,
'knnBeta':
 {
	'vector': vector,
	'path': embedding_path, 
	'k': k, 
	'filter': 
	{
		'compound':{
			'must':[{
				'text':{
					'path': item_id_path, 
					'query': item_id
				}
			}]
		}
	}
}

This pipeline returns an empty response, without the filter I’m getting k item that match the vector, but from a few different item ids, I’m not sure why the filter causes the pipeline to return nothing, the id is present in the collection

Hi @Guy_Machat,

To better assist you here, can you provide the following information:

  1. Output from the $search without filter
  2. Sample documents
  3. Expected output using filter based off the sample documents in 2.
  4. The search index definition

I have some ideas why it may be returning nothing but its difficult to say without the above information.

Look forward to hearing from you.

Regards,
Jason

Hi @Jason_Tran , thanks for reaching out

lets assume the following the documents:

doc #1:

item_id: 9f41e31c-882f-42ef-add4-18688e810e01
embeddings: <some array>
text: foo

doc #2:

item_id: 6f539716-00f0-42ea-b4af-fdf1db09183e
embeddings: <some array>
text: foo

doc #3:

item_id: 9f41e31c-882f-42ef-add4-18688e810e01
embeddings: <some array 2>
text: foooo

some notes: item_id is a string representation of a uuid v4, all embeddings are the same length, for this example they are 384, and they are all indexed in the same search index.

for k=2 where the input im getting is embedding for “foo” I would get docs #1 and #2 as expected,
however, I would like to get docs #1 and #3 as they are both with the same item_id, but adding the filter returns nothing

the mapping definition looks something like this:

{
  "mappings": {
    "fields": {
      "embeddings": [
        {
          "dimensions": 384,
          "similarity": "cosine",
          "type": "knnVector"
        }
      ]
    }
  }
}

Thanks for providing those details @Guy_Machat,

As a note for future posts, it would be easier for users (including myself) to have copy and paste-able documents (and any code snippets) in valid format with the values you’re experiencing the behaviour described to help with the troubleshooting. In saying so, I have tested with those documents but had to guess the array values although I believe you may be receiving nothing in return possibly due to the index definition.

Can you try with the following? You may need to wait a few minutes after saving the changes to run the $search query (you might need to altert the dimensions value as i’ve changed this to match the documents in my test environment):

{
  "mappings": {
    "fields": {
      "embeddings": [
        {
          "dimensions": 4,
          "similarity": "cosine",
          "type": "knnVector"
        }
      ],
      "item_id": {
        "type": "string"
      }
    }
  }
}

For reference, in my test environment with the below sample documents:

db.vectors.find({},{_id:0})
[
  {
    item_id: '9f41e31c-882f-42ef-add4-18688e810e01',
    embeddings: [ -0.01, -0.02, -0.03, -0.04 ],
    text: 'foo'
  },
  {
    item_id: '6f539716-00f0-42ea-b4af-fdf1db09183e',
    embeddings: [ -0.01, -0.02, -0.03, -0.04 ],
    text: 'foo'
  },
  {
    item_id: '9f41e31c-882f-42ef-add4-18688e810e01',
    embeddings: [ -0.011, -0.021, -0.031, -0.041 ],
    text: 'fooo'
  }
]

I was able to run the following knnBeta $search with a filter on "item_id" to return documents 1 and 3 as you have mentioned:

db.vectors.aggregate({
  '$search': {
    'index': 'default',
    'knnBeta': {
      'vector': [-0.01,-0.02,-0.03,-0.04],
      'path': 'embeddings',
      'k': 2,
      'filter': {
        'text': {
          'path': 'item_id',
          'query': '9f41e31c-882f-42ef-add4-18688e810e01'
        }
      }
    }
  }
})
[
  {
    _id: ObjectId("64d18ff706683323f56ba731"),
    item_id: '9f41e31c-882f-42ef-add4-18688e810e01',
    embeddings: [ -0.01, -0.02, -0.03, -0.04 ],
    text: 'foo'
  },
  {
    _id: ObjectId("64d18ff706683323f56ba733"),
    item_id: '9f41e31c-882f-42ef-add4-18688e810e01',
    embeddings: [ -0.011, -0.021, -0.031, -0.041 ],
    text: 'fooo'
  }
]

If you’re still running into issues with the filter can you share the documents (redacting any sensitive information) as well as the $search stage and index definition? I assume the index definition will probably differ each time with testing which is why I am requesting for it again if further help is required.

Look forward to hearing from you.

Regards,
Jason

Just wanted to also add, due to the way the search query is analyzed, the following document (a new document from the sample 3), will probably be returned with the same query above:

{
    _id: ObjectId("64d1b4a8ee4ddb16520b78a0"),
    embeddings: [ -0.011, -0.021, -0.031, -0.041 ],
    text: 'fooo',
    item_id: 'abcde-18688e810e01-defgh' /// <--- matches the `text` operator value within the `filter` portion in the previous post reply
}

Take note of the item_id value. The middle portion "18688e810e01" will match the filter used in my previous example. You may wish to consider maybe using phrase instead of text in the filter to help with these scenarios.

Please also note that i’ve only done this on the 3 sample documents + the document noted here so I am not aware of any other cases where different values of item_id may possibly be returned.

Regards,
Jason