Analyzer is not working as expected in Atlas search text?

I am trying to search for words like - “on time”, “on-time” and "on." I am using whitespace analyzer to get the exact result and using phrase rather than text as I need the word together, but it is not giving me accurate result it gave me some result which has only “on”

{
  "analyzer": "lucene.whitespace",
  "searchAnalyzer": "lucene.whitespace",
  "mappings": {
    "dynamic": false,
    "fields": {
      "client": {
        "type": "string"
      },
      "fulltext": {
        "type": "string"
      },
      "dateRange": {
        "type": "date"
      }
    }
  }
}

and My search query is like this -

[
    {
        "$search":{
            "index":"fulltext",
            "compound":{
                "filter":{
                    "range":{
                        "path":"dateRange",
                        "gte":"ISODate(""2023-03-01T00:00:00.000Z"")",
                        "lte":"ISODate(""2023-03-31T18:29:59.000Z"")"
                    }
                },
                "must":{
                    "phrase":{
                        "query":"ABC",
                        "path":[
                            "client"
                        ]
                    }
                },
                "should":[
                    {
                        "phrase":{
                            "query":"on-time",
                            "path":[ 
                                "fulltext"
                            ]
                        }
                    },
                    {
                        "phrase":{
                            "query":"on time",
                            "path":[ 
                                "fulltext"
                            ]
                        }
                    }
                ]
            }
        }
    }
]

Why I am getting a result also without the phrase “on time” , I mean data which has on will also included in result why ?

Hi Utsav,

Can you provide the output you’re getting with the search query you have provided and identify which document’s you are / aren’t expecting in the results?

I have tried with 3 sample documents:

db.collection.find({},{_id:0})
[
  { fulltext: 'on time' },
  { fulltext: 'on' },
  { fulltext: 'on-time' }
]

With a similar $search query and same analyzers (removing the range and must options), I received:

db.collection.aggregate([ { $search: { index: 'ftindex', compound: { "should": [ { "phrase": { "query": "on-time", "path": [ "fulltext"] } }, { "phrase": { "query": "on time", "path": [ "fulltext"] } }] } } },{$project:{_id:0}}])
[
 { fulltext: 'on time' },
 { fulltext: 'on-time' }
]

I didn’t include the range and must options as sample documents for me to test with weren’t available and I only created 3 simple documents containing the fulltext field which had string values of those which you provided in this post as an example in my test environment.

Regards,
Jason

@Jason_Tran , Below are some documents in which I am trying to find phrases and words like - “on time”, “on-time”

  1. I think airlines have always underpriced their product Tony Fernandes, CEO, Air Asia, on airlines taking advantage of the post-pandemic travel boom to lock in higher airfares.
  2. 2.AI Express, AirAsia India move to unified reservation system
  3. The ad film that captures the essence perfectly as it shows a group of friends on a road trip in a yellow vintage Microbus is conceptualised by Makani Creatives
  4. CIVIL Aviation Minister Jyotiraditya Scindia, who was addressing the first-time voters at ‘Yuva Samvada’ organised by East Point College of Engineering on Tuesday, said that youth are the future of India, and they should make their political choices carefully

above are 4 documents if you search together for both words - “on time”, “on-time” with should operator you can see you get all these results. I do not understand why, though I am using phrases and whitespace analyzer

Thanks for providing those strings. However, in future, please provide sample documents that match the aggregation you provided at the start. For example, you include "client" and "dateRange" fields in your initial aggregation which could affect the results. It also makes it easier to reproduce the behaviour you’re experiencing as the sample documents could be more readily imported into any test environments.

In saying so, I tried to create several test documents one containing "client" : "ABC":

testdb> db.collection.find({},{_id:0})
[
  {
    fulltext: 'I think airlines have always underpriced their product Tony Fernandes, CEO, Air Asia, on airlines taking advantage of the post-pandemic travel boom to lock in higher airfares.',
    client: 'ABC'
  },
  {
    fulltext: 'AI Express, AirAsia India move to unified reservation system'
  },
  {
    fulltext: 'The ad film that captures the essence perfectly as it shows a group of friends on a road trip in a yellow vintage Microbus is conceptualised by Makani Creatives'
  },
  {
    fulltext: 'CIVIL Aviation Minister Jyotiraditya Scindia, who was addressing the first-time voters at ‘Yuva Samvada’ organised by East Point College of Engineering on Tuesday, said that youth are the future of India, and they should make their political choices carefully'
  }
]

Please take a look at the two inidividual $search stages i’ve used. The main difference being that minimumShouldMatch is used in one (var c):

testdb> b /// NO minimumShouldMatch used
[
  {
    '$search': {
      index: 'ftindex',
      compound: {
        must: [
          { phrase: { query: 'ABC', path: [ 'client' ] } }
        ],
        should: [
          { phrase: { query: 'on-time', path: [ 'fulltext' ] } },
          { phrase: { query: 'on time', path: [ 'fulltext' ] } }
        ]
      }
    }
  },
  { '$project': { _id: 0 } }
]
testdb> c /// minimumShouldMatch of 1 used
[
  {
    '$search': {
      index: 'ftindex',
      compound: {
        must: [
          { phrase: { query: 'ABC', path: [ 'client' ] } }
        ],
        should: [
          { phrase: { query: 'on-time', path: [ 'fulltext' ] } },
          { phrase: { query: 'on time', path: [ 'fulltext' ] } }
        ],
        minimumShouldMatch: 1
      }
    }
  },
  { '$project': { _id: 0 } }
]

Output using var b:

testdb> db.collection.aggregate(b)
[
  {
    fulltext: 'I think airlines have always underpriced their product Tony Fernandes, CEO, Air Asia, on airlines taking advantage of the post-pandemic travel boom to lock in higher airfares.',
    client: 'ABC'
  }
]

Output using var c:

testdb> db.collection.aggregate(c)
/// no documents returned
testdb>

It could possibly be just minimumShouldMatch is required to be used for your use case but please let me know if works.

Are you expecting none of the documents you provided to be returned?

Regards,
Jason

1 Like

solved, thanks, I used miniShouldMatch

1 Like

Glad to hear and thanks for marking the solution Utsav :+1:

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.