Score varies by length

Hi, how can I set returned score to be not affected by length ? Not sure it is affected by search term or the value stored.

I’m trying to use AtlasSearch to do text similarity.
Then I use a score threshold to determine if the text is similar.
Is this the best way? Or should I use Text Index?

Here is my pipelines:

[
  {
    $search: {
      index: "cache_chat_message",
      compound: {
        must: [
          {
            search: {
              query:
                "Help me choose a good fund",
              path: "context",
            },
          },
          {
            equals: {
              value: ObjectId(
                "649e487b6465e9fa440db8f5"
              ),
              path: "projectId",
              score: { "boost": { "value": 1 } }
            },
          },
        ],
      },
    },
  },
  {
    $project: {
      _id: 1,
      slots: 1,
      context: 1,
      score: {
        $meta: "searchScore",
      },
    },
  },
  {
    $limit:
      /**
       * Provide the number of documents to limit.
       */
      2,
  },
]

Attached example. If I change the query and context, the score will even though they are exact match, making it difficult to determine the similarity

Hi @HS-Law,

What’s your current search index definition?

I’m not entirely sure of your expected output or use case with the single document output you’ve provided but you can consider looking at:

  1. The string field type properties. More specifically the norms:
  • include - to include the field length when scoring.
  • omit - to omit the field length when scoring.
  1. Modifying the score using one of the available options.

If you need further help, please provide sample documents and the index definition along with the current output + expected output (what you want the scores to be for example for a particular search term).

Regards,
Jason

1 Like

Hi @Jason_Tran

My index definition and sample docs: gist:6e4214f72106c73c5209452ab9ddb2f7 · GitHub

I am trying to use search to find similar question (context field) so that I can group the answers together (slot array field)

For example, these questions are similar to “What is the best performing fund of ABC Company in July?”
So, when the apps return the answer, it should update it’s answers’ slot.

-What is the best performing fund of ABC Company in July?
-What is the best performing fund in July?
-Show me best performing fund in July

If it cannot find a similar question (context field), then it should create a new doc.


Current situation:
My pipeline: gist:575558af6ddfeaccc4d14793a9f3dfbb · GitHub

  1. Query “What is the best performing fund of ABC Company in July?” , gets score of 4.621779441833496

  2. Query “Who is the CTO of ABC Company” gets score of 2.6413586139678955

  3. If I delete the “best performing fund” doc, my second query “Who is the CTO of ABC Company” returns score of 2.409642219543457


My questions:

  1. How to configure so that exact match return a maximum constant score?
  2. What is the score threshold to decide a question is similar to one of the context in doc?
  3. Is Atlas Search the best choice for this task?

Thanks.

As per the scoring documentation:

Every document returned by an Atlas Search query is assigned a score based on relevance, and the documents included in a result set are returned in order from highest score to lowest.

Many factors can influence a document’s score, including:

  • The position of the search term in the document,
  • The frequency of occurrence of the search term in the document,
  • The type of operator the query uses,
  • The type of analyzer the query uses.

You can use the searchScoreDetails option to help analyse the scoring but I believe in this particular case, the idf value is changing due one/both of the following values being changed when you deleted the document:

  • N is the total number of documents with the field.
  • n is the number of documents containing the term.

Have you tried putting a constant scoring option in the search / text operator portion of your query? I can see your pipeline has a constant scoring option for the equals operator but not the search / text portion. Please see the score field details for the text operator here.

Regards,
Jason

image

In this example, why the search “tell me a joke” can return “tell me more about the World Series Fund” with a very high score ?

My index:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "context": {
        "norms": "omit",
        "type": "string"
      },
      "projectId": {
        "type": "objectId"
      }
    }
  }
}

As noted in the previous post: Every document returned by an Atlas Search query is assigned a score based on relevance, and the documents included in a result set are returned in order from highest score to lowest.

It may be that in your environment, there is many other documents that do not contain matching terms. So relative to those, the returned document you have shown may have a higher score. Let’s take a look at an example using the same document in my test environment (using the same index definition you provided):

Only 1 document in this collection:

search>db.collection.find({},{_id:0})
[ { context: 'tell me more about the World Series Fund' } ]

search> db.collection.aggregate({$search:{text:{path:'context',query:'tell me a joke'}}},{$project:{_id:0,context:1,score:{$meta:'searchScore'}}})
[
  {
    context: 'tell me more about the World Series Fund',
    score: 0.4073374569416046
  }
]

We can see that the score value is 0.4073374569416046.

Now, I insert another 4 documents that do not contain any matching terms and perform the exact same $search:

search> db.collection.find({},{_id:0})
[
  { context: 'tell me more about the World Series Fund' },
  { context: 'random string' },
  { context: 'random string abcdef' },
  { context: 'random string testing' },
  { context: 'random string new' }
]

search> db.collection.aggregate({$search:{text:{path:'context',query:'tell me a joke'}}},{$project:{_id:0,context:1,score:{$meta:'searchScore'}}})
[
  {
    context: 'tell me more about the World Series Fund',
    score: 1.804081678390503
  }
]

We can now see the same doucment is returned but with a score of 1.804081678390503 this time.

As noted previously as well, you can see more into the scoring using searchScoreDetails for your environment.

In terms of the use case from the example you provided - Is it that you believe the score is too high or is it that you believe the document should not be returned at all based off the search term?

Regards,
Jason

Thanks, I guess this is not suitable for my use case of identifying similarity. I will explore other options.

Just to try get a bit of feedback here, if there are “similar documents” - are you wanting these to have the same score?

For example, 2 documents which hit a match for the search term "tell me a joke" with the constant scoring option:

search> a
[
  {
    '$search': {
      text: {
        path: 'context',
        query: 'tell me a joke',
        score: { constant: { value: 1 } }
      }
    }
  },
  {
    '$project': { _id: 0, context: 1, score: { '$meta': 'searchScore' } }
  }
]
search> db.collection.find({},{_id:0})
[
  { context: 'tell me more about the World Series Fund' },
  { context: 'random string new' },
  { context: 'random string new' },
  { context: 'random string new' },
  { context: 'tell me a joke' }
]

Output: 2 matching documents with same score due to constant scoring option.

search> db.collection.aggregate(a)
[
  { context: 'tell me more about the World Series Fund', score: 1 },
  { context: 'tell me a joke', score: 1 }
]

Thanks in advance.

Jason

My dataset would have many questions from customers such as:

  1. Who is the CTO of company ABC?
  2. Who is the CEO of company ABC?
  3. What product options do you have for a beginner in hiking?
  4. What product does your company offer?

So, these four are considered unique.

When users ask more questions such as:

  • I’m a beginner looking for hiking products, can you make some recommendation?
    We want to match them with question 3, which we can pick the random reply set by customer service.

Thanks.

I added those as documents to my test environment and used the search term you provided - Question 3’s document was returned as the highest result.

In regards to my above statement, I have a few questions:

  1. Is this what you are seeing in your environment?
  2. If so, are you wanting it to be the only result? If this is the case, you could just take the top result.

Test documents:

search>db.collection.find({},{_id:0})
[
  { context: 'tell me more about the World Series Fund' },
  { context: 'random string new' },
  { context: 'random string new' },
  { context: 'random string new' },
  { context: 'tell me a joke' },
  { context: 'Who is the CTO of company ABC?' },
  { context: 'Who is the CEO of company ABC?' },
  {
    context: 'What product options do you have for a beginner in hiking?'
  },
  { context: 'What product does your company offer?' }
]
search> db.collection.aggregate({$search:{text:{path:'context',query:'I’m a beginner looking for hiking products, can you make some recommendation?'}}},{$project:{_id:0,score:{$meta:'searchScore'},context:1}})
[
  {
    context: 'What product options do you have for a beginner in hiking?',
    score: 6.164970397949219
  },
  { context: 'tell me a joke', score: 0.9522761702537537 }
]

Document with the highest score correlates with the “Question 3” you mentioned.

Looking forward to hearing from you.

Regards,
Jason

Yup, the issue is the highest score range.
In our use case, the questions are populated and grouped by similarity from zero record.

For example, if we change the question to expert instead of beginner:

-what products do you have for expert in hiking?,
-I’m an expert looking for hiking products, can you make some recommendation?

it is returning very high score as well.

We need to know that question for expert variant does not exist yet, and we will create a new doc for customer service to add the answers for it. So that next time it can pick answers for question related to “products for hiking’s expert.”

We need a controlled range to identity if user’s question carry the exact meaning. Maybe something like 0-1, where 0.9 can be considered as very similar ?

Thanks.