uaxUrlEmail doesn't work if email is 'invalid'?

Hi,

I am using Atlas search to index and search by emails. I have the following index:

{
  "mappings": {
    "fields": {
      "fields": {
        "email": {
          "analyzer": "basicEmailAddressAnalyzer",
          "type": "string"
        }
      }
    }
  },
  "analyzers": [
    {
      "name": "basicEmailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      },
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ]
    }
  ]
}

And I use this query to search:

[
  {
    $search: {
      index: "v1",
      text: {
        query: "some.email@hotmail.ff", // --> .ff instead of .fr
        path: "email",
        fuzzy: { maxEdits: 2 }
      }
    }
  }
]

Now if I have this email in my database: some.email@hotmail.fr, then it won’t be returned in the search results, even tho i only have 1 typo. but if I query for some.email@hotmail.es then it will work even if it has with 2 typos.

It looks like AtlasSearch is not returning results if it thinks the email is not valid? Is this a lucene thing? it seems weird to me to allow typos and then filter ‘invalid’ email. I put ‘invalid’ in quotes because it is almost impossible to 100% validate an email so why not relax the rules here?

Hi @pscl and welcome to the community forum!!

I tried to replicate the issue by generating some sample data like:

[
  {
    "_id": 1,
    "email": "john.doe@example.com"
  },
  {
    "_id": 2,
    "email": "jane.smith@example.com"
  },
  {
    "_id": 3,
    "email": "bob.jones@example.com"
  },
  {
    "_id": 4,
    "email": "alice.wonder@example.com"
  }
]

and created the index as above.

Now when I try:

  1. bob.jones@example.com
  2. bob.jone@example.com
  3. bob.jone@example.cm

I get the document with _id: 2. but when I m using bob.jone@examle.cm it does not return the document as it has more than two types.

To understand your case better, could you help with some sample data where I can test the index the index.

Also, please help me with the MongoDB version you are using.

Best Regards
Aasawari

But I specifically mentioned replacing the .fr with .ff. Not .com with .cm

Hi @pscl

Thank you for specifying.

As mentioned in the documentation for email token,

we recommend using uaxUrlEmail tokenizer only when the indexed field value includes URLs and email addresses. For fields that don’t include URLs or email addresses, use the standard tokeniser to create tokens based on word break rules.

Therefore, in my examples above, since the email Ids were valid, the tokens were formed differently. While in your case, when the emails are not valid, hence the fuzzy cannot detect from the invalid addresses.
Therefore, the recommendation is to use standard tokeniser rather than uaxUrlEmail for example email addresses.

I hope that answers the query. Please reach out in case of further questions.

Best Regards
Aasawari

the emails stored in the database are valid emails. It is just the query that is inputted that has a small typo. I would have expected that the correct results would be returned since the database contains the valid emails. but okay