Different scores depending on partial match position

German_Medaglia · March 7, 2023, 4:51pm

Let’s say I have an author field and I want to return any document partially containing the search input as a match.

So, for example, if I have author = Lionel Messi, either "Lio", "Me" or "ssi" should return documents having that value.

Now, suppose I search "Lio", and there are also some documents where author = Delio Valdez. I want those to be returned also (“lio” is a partial match), but the ones with author = Lionel Messi should have a higher score in this case, given that the match is at the beginning of the string.

What would be the best way to accomplish this in terms of index definition and search configuration?

amyjian · March 7, 2023, 4:56pm

Hi @German_Medaglia ! Have you seen the Partial Match tutorial? I would recommend looking into the autocomplete field mapping and operator.

German_Medaglia · March 7, 2023, 5:27pm

Hi @amyjian! Thanks for your quick reply. Yes, I’ve seen that tutorial and also though autocomplete is the best option here. But I couldn’t figure out how to specify different tokenizations for one single field, and assign a different score boost depending on the position of the match.

And, for example, if I have this on my index definition:

          "author":  {
              "analyzer": "lucene.standard",
              "foldDiacritics": false,
              "maxGrams": 7,
              "minGrams": 2,
              "tokenization": "edgeGram",
              "type": "autocomplete"
            }

then when searching "lio", I will get documents with author = Lionel Messi but not those with author = Delio Valdez.

amyjian · March 7, 2023, 8:25pm

Since you are using the edgeGram tokenization strategy, Atlas Search creates tokens from your documents from left-to-right, with a minimum of 2 characters and a maximum of 7 characters.

For “Lionel Messi”, the token outputs would be: [li, lio, lion, lione, lionel, lionel[SPACE]]. Since the search term “lio” matches one of the token outputs, the document with author = Lionel Messi is returned.
Similarly, “Delio Valdez” will be tokenized from left-to-right to generate the following output tokens: [de, del, deli, delio, delio[SPACE] , delio V]. Since the search term “lio” does not match any of the output tokens, the document with author = Delio Valdez is not returned.

To achieve the experience you are describing, you can use the nGram tokenization strategy, which would create the following tokens for “Delio Valdez”: [de, del, deli, delio, delio[SPACE] , delio V, el, eli, elio, elio[SPACE], elio V, elio Va, li, lio, lio[SPACE], lio V, lio Va, lio Val, io, io[SPACE], ..., va, val, vald, valde, valez, al, ald, ...., ld, lde, ..., de, dez, ez]. As you can see, a search for “lio” would match the “lio” token generated by Atlas Search for this document and it would be returned in the query results.

It should be noted using the nGram tokenization strategy significantly increases the number of tokens generated and stored in your Atlas Search index, subsequently increasing the size of your search index.

German_Medaglia · March 9, 2023, 10:51am

Yes yes, I already know how edgeGram and nGram work. What I need is a combination of both, that’s what I’m asking for.

When I search for "lio", I need both documents where author is Lionel Messi and documents where author is Delio Valdez to be retrieved as results.

And the problem is that when using edgeGram I only get results for Lio Messi, and when using nGram only for Delio Valdez.

I think probably wildcards with a keyword analyzer would be a better approach for this use case.

Harsh_Taliwal · July 22, 2023, 11:48am

Subject: Need Assistance with Phone Number Search in MongoDB

Hey @German_Medaglia,

I hope you’re doing well. I came across your post on the forum and it seems like we have a similar use case. I’m dealing with a “phone” field of string type in my MongoDB documents and would like to implement a search functionality for phone numbers.

For Example, I want to perform a search for the digits “987” and retrieve all the documents that contain this sequence. However, I also want to rank the results in a way that gives higher scores to documents where “987” appears at the beginning, followed by occurrences at the end, and then finally occurrences in the middle.

I’ve been trying to implement this functionality, but I haven’t succeeded so far. Could you please share any insights or solutions you might have for achieving this kind of search behavior in MongoDB?

Thank you in advance for your help!

Best regards,

amyjian · August 1, 2023, 7:49pm

Hi @Harsh_Taliwal , sorry for the delayed response! Can you try adding a “string” field mapping to “phone”? The field mapping for “phone” would look something like this

{
  "phone": [
    {
      "type": "autocomplete",
      "tokenization": "nGram",
      "minGrams": 2,
      "maxGrams": 5,
    }.
    {
      "type": "string"
    }
  ]
}