Hi,
Is it possible to use the search operator knnBeta on nested fields?
I’m using the new search operator knnBeta to find and retrieve similar texts. I have a couple of thousand documents that I have split in to subdocuments in order to compute embeddings to each subdocument. I want to search and retrieve the k most similar subdocuments that are nested within their respective document. I have tried this on a flat database (and it worked) now I want to try this in a nested setting because I want to avoid repeating the meta data regarding each document for each subdocument. Here is a simplyfied (I have omitted some fields and sliced all but the beginning and end of the arrays, also the embeddings ar actually of 768 dimension) example of the structure of my data:
[{'dok_id': 'h50377',
'sender': 'Finansdepartementet',
'titel': 'Nya regler om betaltjänster',
'doctype': 'prop',
'subdocuments': [{'page': 1, 'nr': 0, 'embedding':[0.542523,..., 0.343321]},
{'page': 2, 'nr': 1, 'embedding':[0.1455423,..., 0.543325]},
{'page': 692, 'nr': 980, 'embedding':[0.1455423,..., 0.543325]}]},
{'dok_id': 'h503185d2',
'sender': '',
'titel': 'prop 2017/18 185 d2',
'doctype': 'prop',
'subdocuments': [{'page': 1, 'nr': 0, 'embedding':[0.192523,..., 0.113321]},
{'page': 2, 'nr': 1,'embedding':[0.5655423,..., 0.013325]},
{'page': 645, 'nr': 864,'embedding':[0.522423,..., 0.145325]}]}
]
And this is my search index:
{
"mappings": {
"dynamic": true,
"fields": {
"doctype": {
"type": "string"
},
"embedding": [
{
"dimensions": 768,
"similarity": "dotProduct",
"type": "knnVector"
}
],
"nr": {
"type": "number"
},
"sender": {
"type": "string"
},
"year": {
"type": "number"
}
}
}
}
I’m using pymongo. This is the aggregation pipeline that unfortunately gives me an empty list:
cursor=collection.aggregate([
{'$search': {
'knnBeta': {'vector': embedding,
'path': 'subdocuments.embedding',
'k': 10}}
},
{'$addFields': {'subdocuments.score': {'$meta': 'searchScore'}}},
{'$project': {'_id': 0}},
])
results=list(cursor)
results
This is a simplyfied aggregation pipeline. I actually want to filter out subdocuments that does not score above a threshold and group the results on document-level and compute a max score per document. But since this simple pipeline does not work I suspect that vector search is not feasible on nested documents?