knnBeta on field nested in an array

Hi,

Is it possible to use the search operator knnBeta on nested fields?

I’m using the new search operator knnBeta to find and retrieve similar texts. I have a couple of thousand documents that I have split in to subdocuments in order to compute embeddings to each subdocument. I want to search and retrieve the k most similar subdocuments that are nested within their respective document. I have tried this on a flat database (and it worked) now I want to try this in a nested setting because I want to avoid repeating the meta data regarding each document for each subdocument. Here is a simplyfied (I have omitted some fields and sliced all but the beginning and end of the arrays, also the embeddings ar actually of 768 dimension) example of the structure of my data:

[{'dok_id': 'h50377',
  'sender': 'Finansdepartementet',
  'titel': 'Nya regler om betaltjänster',
  'doctype': 'prop',
  'subdocuments': [{'page': 1, 'nr': 0, 'embedding':[0.542523,..., 0.343321]},
   {'page': 2, 'nr': 1, 'embedding':[0.1455423,..., 0.543325]},
   {'page': 692, 'nr': 980, 'embedding':[0.1455423,..., 0.543325]}]},
 {'dok_id': 'h503185d2',
  'sender': '',
  'titel': 'prop 2017/18 185 d2',
  'doctype': 'prop',
  'subdocuments': [{'page': 1, 'nr': 0, 'embedding':[0.192523,..., 0.113321]},
   {'page': 2, 'nr': 1,'embedding':[0.5655423,..., 0.013325]},
   {'page': 645, 'nr': 864,'embedding':[0.522423,..., 0.145325]}]}
]

And this is my search index:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "doctype": {
        "type": "string"
      },
      "embedding": [
        {
          "dimensions": 768,
          "similarity": "dotProduct",
          "type": "knnVector"
        }
      ],
      "nr": {
        "type": "number"
      },
      "sender": {
        "type": "string"
      },
      "year": {
        "type": "number"
      }
    }
  }
}

I’m using pymongo. This is the aggregation pipeline that unfortunately gives me an empty list:

cursor=collection.aggregate([
    {'$search': {
     'knnBeta': {'vector': embedding, 
            'path': 'subdocuments.embedding', 
            'k': 10}}
     },
    {'$addFields': {'subdocuments.score': {'$meta': 'searchScore'}}},
    {'$project': {'_id': 0}},
])
results=list(cursor)
results

This is a simplyfied aggregation pipeline. I actually want to filter out subdocuments that does not score above a threshold and group the results on document-level and compute a max score per document. But since this simple pipeline does not work I suspect that vector search is not feasible on nested documents?

Hi Joakim,
Could you please try using embeddedDocument to index and query your documents? Similar to other operators, it supports knnBeta.

Let me know if that works for you!

1 Like

Thank you Alexander! I actually also just found that solution. However, I still have a problem. I need to retrieve the score for each embedded document (my subdocument). As of now I only get a document aggregate (e.g. maximum). I would like that document score and the individual score (subdocument score).

Here is my current pipeline:

knn_dict={'knnBeta': {'vector': embedding, 
            'path': 'subdocuments.embedding', 
            'k': 100}}

cursor=collection.aggregate([
    {'$search': {
    'embeddedDocument':{
        'path': 'subdocuments',
        'operator': knn_dict,
        "score": {
        "embedded": {
          "aggregate": "maximum"
        }
      }
    }}},
    {'$addFields': {'score': {'$meta': 'searchScore'}}},
    {'$project': {'_id': 0,'dok_id':1,'score':1,'subdocuments.page':1,'subdocuments.nr':1}},
])

I tried {‘$addFields’: {‘subdocument.score’: {‘$meta’: ‘searchScore’}}} but I only got the aggregate score repeated on all subdocuments.

Kind regards
Joakim

Hi Joakim,
Today we only support sum/max/min/mean scoring options for embedded documents, see the details here:

Unfortunately it’s not possible to output individual embedded document score, feel free to request that on https://feedback.mongodb.com/forums/924868-atlas-search

Thank you Alexander for the timely response. I’ve made a feature request regarding this now. In our small application this is of strategic importance, because it determines the structure of our database (flat/long structure or thick/nested). If we go by the thick/nested structure we will have to calculate the dot-product a second time outside Mongo for the retrieved documents (naturally that is something I want to avoid).

Kind regards
Joakim

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.