How to use Atlas Search to query in list of strings

williamwjs · April 27, 2023, 1:43am

Hi Team,

What I am trying to do with Atlas search is essentially something similar to the $in query.

So my documents are like:

[
   { "item": "Pens", "quantity": 350, "tags": [ "school", "office" ] },
   { "item": "Erasers", "quantity": 15, "tags": [ "home", "school" ] },
   { "item": "Maps", "tags": [ "office", "storage" ] },
   { "item": "Books", "quantity": 5, "tags": [ "school" ] },
   { "item": "Another Maps", "quantity": 5, "tags": [ "storage" ] }
]

And my search index is configured as:

      "tags": {
        "analyzer": "lucene.keyword",
        "indexOptions": "docs",
        "norms": "omit",
        "searchAnalyzer": "lucene.keyword",
        "type": "string"
      },

And my query is like:

{
    $search: {
      "compound": {
        "should": [
         "queryString": {
            "defaultPath": "tags",
            "query": "school OR storage"
         }]
      }
    }
  }

Expected results:
Returning all the results with the same score

Actual results:
There are two score groups, and records with “school” tag and records with “storage” tag are having different scores.

(My actual data is a little different from the above, but the query is the same)

If I want all those records having the same score, given that I am using an OR query, how would I change my configuration here?

Thank you!

Jason_Tran · April 27, 2023, 2:58am

Hi @williamwjs,

Have you tried using the constant scoring option? I believe the behaviour you’ve described in terms of the scoring is expected (at least from analyzing the sample documents). Based off your example, there are a total of 5 documents of which:

2 documents contain the value "storage" in the "tags" array field
3 documents contain the value "school" in the "tags" array field

As per the Score the Documents in the Results documentation:

Many factors can influence a document’s score, including:

The position of the search term in the document,

The frequency of occurrence of the search term in the document,

The type of operator the query uses,

The type of analyzer the query uses.

In this particular case, I believe the frequency of occurrence is one of the main factors for why you are seeing the results having different scores even though they each only contain each of the terms once. Example below from my test environment based off your sample documents provided:

5 documents total, for the tags array - 2 documents containing “storage”, 3 documents containing “school”:

[
  { tags: [ 'office', 'storage' ], score: 0.47005030512809753 },
  { tags: [ 'storage' ], score: 0.47005030512809753 },
  { tags: [ 'school', 'office' ], score: 0.2893940806388855 },
  { tags: [ 'home', 'school' ], score: 0.2893940806388855 },
  { tags: [ 'school' ], score: 0.2893940806388855 }
]

We can see here the first 2 documents have a higher score (probably what you are experiencing).

Now, let’s add in another document that contains the "storage" value inside of the "tags" array and perform the same search.

6 documents total, for the tags array - 3 documents containing “storage”, 3 documents containing “school”:

[
[
  { tags: [ 'school', 'office' ], score: 0.3767103850841522 },
  { tags: [ 'home', 'school' ], score: 0.3767103850841522 },
  { tags: [ 'office', 'storage' ], score: 0.3767103850841522 },
  { tags: [ 'school' ], score: 0.3767103850841522 },
  { tags: [ 'storage' ], score: 0.3767103850841522 },
  { tags: [ 'storage', 'home' ], score: 0.3767103850841522 }
]

We can see here that the scores are now all the same for this result set.

(Test environment i’m using for this has the index named "tagsindex") - Reverting back to the original 5 documents, when using constant scoring value of 1:

db.tags.aggregate({
    "$search": {
      "index": "tagsindex",
      "compound": {
        "should": [{
          "queryString": {
            "defaultPath": "tags",
            "query": "school OR storage",
            "score": { "constant" : { "value" : 1} }
          }
        }
      ]
    }
  }
},
{
  "$project": {
    "_id": 0,
    "tags": 1,
    "score": { "$meta": "searchScore"}
  }
})

Output:

[
  { tags: [ 'storage' ], score: 1 },
  { tags: [ 'school' ], score: 1 },
  { tags: [ 'office', 'storage' ], score: 1 },
  { tags: [ 'home', 'school' ], score: 1 },
  { tags: [ 'school', 'office' ], score: 1 }
]

Wondering if this would work for you / your use case and if the explanation above helps with the scoring differences you may be seeing.

Regards,
Jason

Jason_Tran · April 27, 2023, 3:11am

Also you may wish to check out Return the Score Details which is relatively new that provides the scoreDetails boolean option in your $search stage for a detailed breakdown of the score for each document in the query results.

williamwjs · April 27, 2023, 3:43am

Jason_Tran:

Now, let’s add in another document that contains the "storage" value inside of the "tags" array and perform the same search.

6 documents total, for the tags array - 3 documents containing “storage”, 3 documents containing “school”:
[
[
  { tags: [ 'school', 'office' ], score: 0.3767103850841522 },
  { tags: [ 'home', 'school' ], score: 0.3767103850841522 },
  { tags: [ 'office', 'storage' ], score: 0.3767103850841522 },
  { tags: [ 'school' ], score: 0.3767103850841522 },
  { tags: [ 'storage' ], score: 0.3767103850841522 },
  { tags: [ 'storage', 'home' ], score: 0.3767103850841522 }
]
We can see here that the scores are now all the same for this result set.

@Jason_Tran Thank you for your reply!!!

Interesting to see that by adding one more document, it could return all records with the same score. Do you know why that would make a difference here?

Jason_Tran · April 27, 2023, 3:52am

Hi @williamwjs

As per my previous reply:

I believe in the example with 6 documents (3 docs containing "storage" and 3 docs containing "school"), the amount of docs matching "storage" is the same as the amount of docs matching "school" out of the total of 6 docs (3/6 vs 3/6). In the example with 5 docs, its (2/5 for “storage” and 3/5 for “school”).

I believe you will then get a different set of results yet again if I had added another document with "storage" (total of 7 documents, 4 containing "storage").

You can test to verify if you wish.

Regards,
Jason

Jason_Tran · April 27, 2023, 3:57am

Tested it out with 7 docs just now :

[
  { tags: [ 'school' ], score: 0.44143030047416687 },
  { tags: [ 'home', 'school' ], score: 0.44143030047416687 },
  { tags: [ 'school', 'office' ], score: 0.44143030047416687 },
  { tags: [ 'storage' ], score: 0.3072332739830017 },
  { tags: [ 'office', 'storage' ], score: 0.3072332739830017 },
  { tags: [ 'storage' ], score: 0.3072332739830017 },
  { tags: [ 'storage', 'test' ], score: 0.3072332739830017 }
]

williamwjs · April 27, 2023, 4:01am

Also, one issue with "score": { "constant" : { "value" : 1} } is that, it would set the score to be the same for all the matching ones.

But my intention is to do a search with multiple factors, for example:

{
    $search: {
      "compound": {
        "should": [
         "near": {
            "path": "quantity",
            "origin": 350,
            "pivot": 400
         },
         "queryString": {
            "defaultPath": "tags",
            "query": "school OR storage"
         }]
      }
    }
  }

so that the result score is the combined factoring of the two “should” clauses here.
(Let me know if this makes sense to you )

Any suggestions on this?

Thank you!

Jason_Tran · April 27, 2023, 4:13am

Are you saying that the constant scoring option would cause the near to return the same results? I think you would just need the constant on the queryString portion and not the near operator. For example, output using constant only on queryString:

[
  { quantity: 350, tags: [ 'school', 'office' ], score: 2 },
  {
    tags: [ 'office', 'storage' ],
    quantity: 50,
    score: 1.5714285373687744
  },
  {
    quantity: 15,
    tags: [ 'home', 'school' ],
    score: 1.5442177057266235
  },
  { quantity: 5, tags: [ 'school' ], score: 1.5369126796722412 },
  { quantity: 5, tags: [ 'storage' ], score: 1.5369126796722412 }
]

The $search used for above:

db.tags.aggregate({
    "$search": {
      "index": "tagsindex",
      "compound": {
        "should": [{
          "near": {
            "path": "quantity",
            "origin": 350,
            "pivot": 400
          }
         },{
          "queryString": {
            "defaultPath": "tags",
            "query": "school OR storage",
            "score": { "constant" : { "value" : 1} }
          }
          }
      ]
    }
  }
},
{
  "$project": {
    "_id": 0,
    "tags": 1,
    "quantity": 1,
    "score": { "$meta": "searchScore"}
  }
})

williamwjs · April 27, 2023, 5:29am

You are right! It could reflect.
Initially I set the constant value to be 10, thus, for some reason, all the results would have the same score. After changing to 1, it would work now.

Thank you so much for all your help!!!

system · May 2, 2023, 5:30am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.