Search highlights, keep only hits

Hi,

I made an aggregation pipeline including a “search” stage on several fields of a collection containing clients information (name, first name, street name, locality, …).

In the “project” stage at the end of the pipeline, I include the “highlights” metadata:

        {
            "$project": {
                "_id": 0,
                "object_id": "$customerId",
                "object_infos": {
                    "$concat": [
                        {"$toString": "$customerId"}, 
                        " - ", 
                        "$firstName", 
                        " - ", 
                        "$name", 
                        " - ", 
                        "$streetName", 
                        " - ", 
                        "$locality"
                    ]
                },
                "score": { "$meta": "searchScore"},
                "highlights": { "$meta": "searchHighlights" }
            }

During my tests, I provided two strings in input (“Becker” and “Ketangi”), so that I have some results for which there’s a match on the name (at least partially), and the locality:

                {
                    "object_id": 750445,
                    "object_infos": "750445 - Madelena - O'Connell and Becker - Clarendon Street -  Ketangi",
                    "score": 25.159177780151367,
                    "highlights": [
                        {
                            "score": 6.748023509979248,
                            "path": "name",
                            "texts": [
                                {
                                    "value": "O'Connell and ",
                                    "type": "text"
                                },
                                {
                                    "value": "Becker",
                                    "type": "hit"
                                }
                            ]
                        },
                        {
                            "score": 7.059268474578857,
                            "path": "locality",
                            "texts": [
                                {
                                    "value": "Ketangi",
                                    "type": "hit"
                                }
                            ]
                        }
                    ]
                },

The highlights metadata currently provide a lot of information that I actually don’t need. I would like to “reduce” them only to the values for which there was a hit. So for my example above, I would like to have something like this:

                {
                    "object_id": 750445,
                    "object_infos": "750445 - Madelena - O'Connell and Becker - Clarendon Street -  Ketangi",
                    "score": 25.159177780151367,
                    "highlights": ["Becker", "Ketangi"]
                },

The goal of this is to help to identify on which value there was a hit, in the case where that value doesn’t correspond exactly to my input string (in case of fuzzy search).

Removing the “highlights.score” and “highlights.path” is easy (by just adding a “project” stage and setting to fields to 0). However in my example, I still need to do two more steps, but so far I didn’t find a way to do it:

  • remove the complete “highlights.texts” entry for which the type is “text”
  • remove the “type” field in the 2 remaining "“highlights.texts” entries
  • merge the 2 remaining “highlights.texts.value” in a single array (this step could be optional if too complex to do)

Adding an “unwind” stage to split the array of highlights is not an option, as I want to keep everything in one single document. I already tried to use the conditional removal, like this :

        {
            "$project" : {
                "object_id": 1,
                "object_infos": 1,
                "highlightsNEW": {
                    "$cond": {
                        "if": { "$eq": [ "$highlights.texts.type", "text" ] },
                        "then": "$$REMOVE",
                        "else": "$highlights.texts.value"
                    }
                }
            }
        }

… but it doens’t work. Here’s the result that I have:

               {
                    "object_id": 750445,
                    "object_infos": "750445 - Madelena - O'Connell and Becker - Clarendon Street -  Ketangi",
                    "highlightsNEW": [
                        [
                            "White, O'Connell and ",
                            "Becker"
                        ],
                        [
                            "Ketangi"
                        ]
                    ]
                },

Would someone have an idea about how I could do that ?

You could do it with two unwinds and a re-group:

db.collection.aggregate([
  {
    $unwind: "$highlights"
  },
  {
    $unwind: "$highlights.texts"
  },
  {
    $match: {
      "highlights.texts.type": "hit"
    }
  },
  {
    $group: {
      _id: {
        _id: "$_id",
        "object_infos": "$object_infos",
        "score": "$score",
        
      },
      highlights: {
        $push: "$highlights.texts.value"
      }
    }
  },
  {
    $project: {
      _id: "$_id._id",
      "object_infos": "$_id.object_infos",
      "score": "$_id.score",
      highlights: 1
    }
  }
])
1 Like

works perfectly ! Thank you !

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.