Different scores depending on partial match position

Let’s say I have an author field and I want to return any document partially containing the search input as a match.

So, for example, if I have author = Lionel Messi, either "Lio", "Me" or "ssi" should return documents having that value.

Now, suppose I search "Lio", and there are also some documents where author = Delio Valdez. I want those to be returned also (“lio” is a partial match), but the ones with author = Lionel Messi should have a higher score in this case, given that the match is at the beginning of the string.

What would be the best way to accomplish this in terms of index definition and search configuration?

Hi @German_Medaglia ! Have you seen the Partial Match tutorial? I would recommend looking into the autocomplete field mapping and operator.

Hi @amyjian! Thanks for your quick reply. Yes, I’ve seen that tutorial and also though autocomplete is the best option here. But I couldn’t figure out how to specify different tokenizations for one single field, and assign a different score boost depending on the position of the match.

And, for example, if I have this on my index definition:

          "author":  {
              "analyzer": "lucene.standard",
              "foldDiacritics": false,
              "maxGrams": 7,
              "minGrams": 2,
              "tokenization": "edgeGram",
              "type": "autocomplete"
            }

then when searching "lio", I will get documents with author = Lionel Messi but not those with author = Delio Valdez.

Since you are using the edgeGram tokenization strategy, Atlas Search creates tokens from your documents from left-to-right, with a minimum of 2 characters and a maximum of 7 characters.

For “Lionel Messi”, the token outputs would be: [li, lio, lion, lione, lionel, lionel[SPACE]]. Since the search term “lio” matches one of the token outputs, the document with author = Lionel Messi is returned.
Similarly, “Delio Valdez” will be tokenized from left-to-right to generate the following output tokens: [de, del, deli, delio, delio[SPACE] , delio V]. Since the search term “lio” does not match any of the output tokens, the document with author = Delio Valdez is not returned.

To achieve the experience you are describing, you can use the nGram tokenization strategy, which would create the following tokens for “Delio Valdez”: [de, del, deli, delio, delio[SPACE] , delio V, el, eli, elio, elio[SPACE], elio V, elio Va, li, lio, lio[SPACE], lio V, lio Va, lio Val, io, io[SPACE], ..., va, val, vald, valde, valez, al, ald, ...., ld, lde, ..., de, dez, ez]. As you can see, a search for “lio” would match the “lio” token generated by Atlas Search for this document and it would be returned in the query results.

It should be noted using the nGram tokenization strategy significantly increases the number of tokens generated and stored in your Atlas Search index, subsequently increasing the size of your search index.

Yes yes, I already know how edgeGram and nGram work. What I need is a combination of both, that’s what I’m asking for.

When I search for "lio", I need both documents where author is Lionel Messi and documents where author is Delio Valdez to be retrieved as results.

And the problem is that when using edgeGram I only get results for Lio Messi, and when using nGram only for Delio Valdez.

I think probably wildcards with a keyword analyzer would be a better approach for this use case.