Search highlight is incorrect when using wildcard query

We are using a search query:

[
  {
    $search: {
      index: "datasets",
      compound: {
        must: [
          {
            text: {
              query: "subscription",
              path: "name",
              fuzzy: {
                maxEdits: 1,
                prefixLength: 3,
              },
            },
          },
        ],
        should: [
          {
            wildcard: {
              path: "tags",
              query: "*",
              allowAnalyzedField: true,
              score: {
                constant: { value: 100 },
              },
            },
          },
        ],
      },
      highlight: {
        path: ["name"],
      },
    },
  },
  {
    $addFields: {
      highlight: {
        $meta: "searchHighlights",
      },
    },
  },
]

This query finds subscription in entity name, and also boosts the score if the entity has any tag. Both name and tags are string fields. The highlight should return matches in name.
However, for name cs_subscription_v2 the highlight response is

- value: "cs", type: "hit"
- value: "_", type: "text"
- value: "subscription", type: "hit"
- value: "_", type: "text"
- value: "v2", type: "hit"

There are 3 hits, while it should only be 1.

We also notice that if we remove the should section using wildcard match in the query, then the highlight works correctly.

Anyone knows why this happens, and how to walk around it? Is this a bug? Thanks!

Hi @Yi_Wang,

Thanks for providing those details. Can you also share the following:

  1. Sample documents (preferably at least one document that has { 'name' : 'cs_subscription_v2' })
  2. The index definition used

Regards,
Jason

Hi Jason, thanks for looking into this issue!

Some sample documents:

[{
  "entityId": "DATASET~17127EC430CE8D0D89D0EEDB808B2A40",
  "name": "metaphor-data.test.k_20220112",
  "description": "gator baits",
  "tags": null
},
{
  "entityId": "DATASET~17127EC430CE8D0D89D0EEDB808B2A40",
  "name": "metaphor-data.test.cs_subscription_v2",
  "description": "subscriptions v2",
  "tags": ["customer"]
},
{
  "entityId": "DATASET~17127EC430CE8D0D89D0EEDB808B2A40",
  "name": "metaphor-data.prod.subscription_replacement",
  "description": "subscription replacement",
  "tags": ["customer", "GOLD"]
}]

index spec

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "description": {
        "type": "string"
      },
      "tags": [
        {
          "type": "string"
        },
        {
          "type": "stringFacet"
        }
      ],
      "name": [
        {
          "analyzer": "delimiter_pattern",
          "multi": {
            "keyword": {
              "analyzer": "keyword_lowercase",
              "type": "string"
            }
          },
          "type": "string"
        }
      ]
    }
  },
  "analyzers": [
    {
      "name": "delimiter_pattern",
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ],
      "tokenizer": {
        "pattern": "[ \\.\\/\\-_,;:]+",
        "type": "regexSplit"
      }
    },
    {
      "name": "keyword_lowercase",
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

Hi @Yi_Wang,

Thanks for providing those details and sample documents.

I believe the behaviour you’re experiencing here can also be shown in the following example highlighted in our documentation. More specifically related to the example, the query itself only matches for:

"text": {
        "path": "description",
        "query": "varieties"
   }

i.e. Documents that match the text "varierties" for the path "description".

Yet, since the highlight is on both "description" and "summary", the example shows hits for both:

"highlights" : [
    {
      "path" : "summary", /// <--- summary path
      "texts" : [
        {
          "value" : "Pear ",
          "type" : "text"
        },
        {
          "value" : "varieties",
          "type" : "hit" /// <--- hit on summary path even though it is not part of the `text` operator path
        }
      ],
      "score" : 1.3891443014144897 },
    {
      "path" : "description",
      "texts" : [
        {
          "value" : "Bosc and Bartlett are the most common ",
          "type" : "text"
        },
        {
          "value" : "varieties",
          "type" : "hit"
        },
        {
          "value" : " of pears.",
          "type" : "text"
        }
      ],
      "score" : 1.2691514492034912
    }

In short, the search highlighting metadata feature by design will return a "hit " for any of the above terms occurring in ANY of the below highlight paths from the matching result set.

As for your particular example, the should with the wildcard query value "*" is a "hit" on the values you provided in your highlight response relating to the path "name". If I were to change it to "c*" then the highlight response would be (for the sample documents you provided):

[
  {
    _id: ObjectId("645c6c375d5c23c575e620b8"),
    entityId: 'DATASET~17127EC430CE8D0D89D0EEDB808B2A40',
    name: 'metaphor-data.test.cs_subscription_v2',
    description: 'subscriptions v2',
    tags: [ 'customer' ],
    highlight: [
      {
        score: 1.321254014968872,
        path: 'name',
        texts: [
          { value: 'metaphor-data.test.', type: 'text' },
          { value: 'cs', type: 'hit' },
          { value: '_subscription_v2', type: 'text' }
        ]
      }
    ]
  },
  {
    _id: ObjectId("645c6c375d5c23c575e620b9"),
    entityId: 'DATASET~17127EC430CE8D0D89D0EEDB808B2A40',
    name: 'metaphor-data.prod.subscription_replacement',
    description: 'subscription replacement',
    tags: [ 'customer', 'GOLD' ],
    highlight: []
  }
]

Hope the above helps. If it’s not the desired behaviour then you can raise a feedback post in regards to your use case.

Regards,
Jason