Atlas Text Search highlights results - sentence-centric?

Bill_Reynolds · February 28, 2020, 8:25pm

In reviewing the information at
https://docs.atlas.mongodb.com/reference/full-text-search/highlighting/
the examples with the fruits all work fine. I was using FTS to search sample stock analyst
reports. For the matching documents, the highlights texts element values seem to be only
be based on sentences. For example, if there are 4 sentences before a sentence containing the
search term. I was expecting to see the text of those 4 sentences as part of the surrounding text
of the match. Instead, I just see only the sentence 5 pieces. The fruit examples are all single sentence
in nature. I could not find any mention of such a sentence-centric behavior. Maybe I missed something?

Also, deeming a sentence to stop on a dot is incorrect when the text has something like “MongoDB, Inc. blah blah blah”. It stops after “Inc.”. Perhaps the same issue arises with salutations like Mr. and Mrs. ?

timfrietas · February 28, 2020, 11:39pm

Hi Bill,

I don’t totally follow the behavior you’re describing. If you could share a piece of a document you are querying and the results you are receiving hopefully I can help.

Bill_Reynolds · February 29, 2020, 8:27pm

Here is an example to show the unexpected results which seem very "“sentence centric”.

db.stock_news.remove( {} );
db.stock_news.insertMany([
	{ author: "Nasdaq Technology Sector Update", 
		text: "Technology giants were gaining Thursday. Early movers include MongoDB, Inc. which gained more than 20%. Microsoft also gained 8%."},
	{ author: "PRNewswire",
		text: "Hello world. MongoDB, Inc. (NASDAQ: MDB) is the leading modern database platform. Where is this text?"},
] );

The FTS index was defined with the name “myFtsIndex” as

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "text": {
        "type": "string",
        "analyzer": "lucene.standard",
        "multi": {
          "keywordAnalyzer": {
            "type": "string",
            "analyzer": "lucene.keyword"
          }
        }
      }
    }
  }
}

A query to demonstrate the unexpected results is
(Note I did not use the default index name for some reason so you see the index name “myFtsIndex” below)

db.stock_news.aggregate([
  {
    $searchBeta: {
      index: "myFtsIndex",
      "search": {
        "query": "MongoDB",
        "path": "text"
        },
      "highlight": {
        "path": "text"
      }    }
  },
  {
    $project: {
       "text": 1,
      "_id": 0,
      "highlights": { "$meta": "searchHighlights" }
    }
  }]).pretty()

This shows

{
  "text" : "Hello world. MongoDB, Inc. (NASDAQ: MDB) is the leading modern database platform. Where is this text?",
  "highlights" : [
        {
            "path" : "text",
            "texts" : [
                {
                    "value" : "MongoDB",
                    "type" : "hit"
                },
                {
                    "value" : ", Inc. ",
                    "type" : "text"
                }
            ],
            "score" : 1.8908861875534058
        }
    ]
}
{
 "text" : "Technology giants were gaining Thursday. Early movers include MongoDB, Inc. which gained more than 20%. Microsoft also gained 8%.",
 "highlights" : [
        {
            "path" : "text",
            "texts" : [
                {
                    "value" : "Early movers include ",
                    "type" : "text"
                },
                {
                    "value" : "MongoDB",
                    "type" : "hit"
                },
                {
                    "value" : ", Inc. which gained more than 20%. ",
                    "type" : "text"
                }
            ],
            "score" : 1.4883723258972168
        }
    ]
}

The first result document is seemingly missing:

the “Hello world.” sentence in front of the match,
the rest of the sentence containing the match, and
the subsequent sentence(s) after the sentence containing the match.

A similar observation occurs in the second match.
This is the basis for my confusion about the highlights data based on the doc at
https://docs.atlas.mongodb.com/reference/full-text-search/highlighting/

Thank you for your help,
Bill

timfrietas · March 4, 2020, 5:51pm

Thanks for the detail, Bill. This is indeed the default behavior, which is based on a Lucene option called the Unified Highlighter, which defaults to sentence-level matches. This is our current sane default, but it also unfortunately our only current option as we haven’t made highlighting customizable yet.

However I captured your feature request here:

Feel free to vote for it, follow along, and, if you like, comment describing the ideal behavior you would like to see.

Bill_Reynolds · March 5, 2020, 4:14pm

Hi Tim,
Thank you very much for the sanity check and for confirming my observations. All the FTS examples I found were with one sentence, which did not help my research. While the possible enhancement goes through the process to perhaps be implemented, I suggest adding some text to the documentation page noted above so others are not confused or implement code that results in incorrect output.
Thanks again.

Bill