Strange behavior (bug?) in Atlas Search Scoring

Patrick_Zenhausern · February 9, 2022, 12:26pm

The score / sorting isn’t working as expected.
Everything which matches the title should have a higher score - which should according to the query also be what’s happening. (At least as far as I understand, but it’s also possible I’m missing something ).

However for some reason the “Body” match has a higher priority than the “Title” match. To me it’s not clear at all why this is happening.

But first let’s see the data and the query:

Database seed

/* 1 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa07"),
    "SubscriptionId" : "b31a0037-5316-4220-958b-1f8c1a4c2759",
    "ItemId" : "matched-by-body",
    "GroupIds" : [ 
        "default-group-id"
    ],
    "Contents" : {
        "en" : {
            "Language" : "en",
            "Active" : true,
            "BodyPlainText" : "I have a dream"
        }
    }
}

/* 2 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa08"),
    "SubscriptionId" : "b31a0037-5316-4220-958b-1f8c1a4c2759",
    "ItemId" : "matched-by-title",
    "GroupIds" : [ 
        "default-group-id"
    ],
    "Contents" : {
        "en" : {
            "Language" : "en",
            "Active" : true,
            "Title" : "I have a dream"
        }
    }
}

/* 3 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa0a"),
    "SubscriptionId" : "other-sub",
    "ItemId" : "matched-by-title-but-wrong-sub",
    "GroupIds" : [ 
        "default-group-id"
    ],
    "Contents" : {
        "en" : {
            "Language" : "en",
            "Active" : true,
            "Title" : "I have a dream"
        }
    }
}

/* 4 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa0b"),
    "SubscriptionId" : "b31a0037-5316-4220-958b-1f8c1a4c2759",
    "ItemId" : "matched-by-title-but-wrong-group",
    "GroupIds" : [ 
        "nobody-has-access-group"
    ],
    "Contents" : {
        "en" : {
            "Language" : "en",
            "Active" : true,
            "Title" : "I have a dream"
        }
    }
}

/* 5 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa0c"),
    "SubscriptionId" : "b31a0037-5316-4220-958b-1f8c1a4c2759",
    "ItemId" : "no-match",
    "GroupIds" : [ 
        "default-group-id"
    ],
    "Contents" : {
        "en" : {
            "Language" : "en",
            "Active" : true,
            "BodyPlainText" : "So lonely"
        }
    }
}

The query

db.getCollection('news').aggregate([
  {
    "$search": {
      "index": "my-search",
      "compound": {
        "filter": [
          {
            "text": {
              "path": "SubscriptionId",
              "query": "b31a0037-5316-4220-958b-1f8c1a4c2759"
            }
          },
          {
            "text": {
              "path": "GroupIds",
              "query": [
                "default-group-id",
                "hr-group-id",
                "everybody-is-editor-group"
              ]
            }
          }
        ],
        "should": [
          {
            "phrase": {
              "query": "dream",
              "path": {
                "wildcard": "Contents.*.Title"
              },
              "score": {
                "boost": {
                  "value": 7
                }
              }
            }
          },
          {
            "phrase": {
              "query": "dream",
              "path": {
                "wildcard": "Contents.*.BodyPlainText"
              },
              "score": {
                "boost": {
                  "value": 3
                }
              }
            }
          },
          {
            "text": {
              "query": "dream",
              "path": {
                "wildcard": "Contents.*.Title"
              },
              "score": {
                "boost": {
                  "value": 5
                }
              }
            }
          },
          {
            "text": {
              "query": "dream",
              "path": {
                "wildcard": "Contents.*"
              },
              "fuzzy": {
                "maxEdits": 2
              }
            }
          }
        ],
        "minimumShouldMatch": 1
      }
    }
  },
  {
    $project: {
      ItemId: 1,
      "Contents.en.BodyPlainText": 1,
      "Contents.en.Title": 1,
      "Contents.en.Body": 1,
      score: {
        $meta: "searchScore"
      },
      
    },
    
  }
])

Search index definition:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "GroupIds": [
        {
          "dynamic": true,
          "type": "document"
        },
        {
          "analyzer": "lucene.keyword",
          "norms": "omit",
          "searchAnalyzer": "lucene.keyword",
          "type": "string"
        }
      ],
      "SubscriptionId": [
        {
          "dynamic": true,
          "type": "document"
        },
        {
          "analyzer": "lucene.keyword",
          "norms": "omit",
          "searchAnalyzer": "lucene.keyword",
          "type": "string"
        }
      ]
    }
  }
}

Execute the query

If we execute the query now, we get back the following:

/* 1 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa07"),
    "ItemId" : "matched-by-body",
    "Contents" : {
        "en" : {
            "BodyPlainText" : "I have a dream"
        }
    },
    "score" : 1.10903561115265
}

/* 2 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa08"),
    "ItemId" : "matched-by-title",
    "Contents" : {
        "en" : {
            "Title" : "I have a dream"
        }
    },
    "score" : 0.514158070087433
}

As you can see for some reason the body match has a higher score.

However - and now this is really strange - if we DELETE all other documents apart from the 2 which matches from the db and execute the same query again, we get back the following:

/* 1 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa08"),
    "ItemId" : "matched-by-title",
    "Contents" : {
        "en" : {
            "Title" : "I have a dream"
        }
    },
    "score" : 1.69993948936462
}

/* 2 */
{
    "_id" : ObjectId("62037e3d24fc15277c2bfa07"),
    "ItemId" : "matched-by-body",
    "Contents" : {
        "en" : {
            "BodyPlainText" : "I have a dream"
        }
    },
    "score" : 0.523058295249939
}

As you can see, suddenly it behaves as expected. But I don’t understand why or how this makes sense.

Patrick_Zenhausern · February 16, 2022, 7:51am

I actually brought this question to the support and the answer is this:

The behaviour you have observed is expected, one of the factors in a score of a term is how frequent it is in the corpus (collection) and how frequent it is in a specific document.
For instance, if you add many documents with the term X that term is common, less special, so your documents that previously matched X are not as “relevant” as they were before.
You can read more about how a score is calculated in Lucene by default here

I suppose in a real world application (with more data) than in my test data above you most likely wouldn’t really see this behaviour because you have may more variety in data.
I could still imagine in some situations this behaviour is still strange, but I guess it is as it is…

system · February 21, 2022, 7:52am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.