Another Special Character Question

Mark_Mann · August 28, 2024, 6:59pm

Apologies if answered, I did try to search and while I found similar topics, it seems Atlas search can be quite specific depending on the settings.

I have a basic collection with search on property “Name”, which is a string:

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "Name": {
        "type": "string"
      }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "whitespaceLower",
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      }
    }
  ],
  "storedSource": true
}

I used the whitespace and lowercase analyzer because I don’t really want “fuzzy”(human/phonetic/etc) matches, I do not care about space and generally speaking, whitespace works fairly well for the type of data I am searching.
A common example would be searching for an invoice number. People know exactly what they are typing and looking for.

Anyway, I have the following document(extra fields removed):
{
“_id”: {
“$oid”: “66cdc230a1024453b04e0097”
},
“Name”: “SS-SC-240827-43824320”
}

The following search works:

[
  {
    $search: {
      index: "TextSearch",
      compound: {
        must: [
          {
            compound: {
              should: [
                {
                  regex: {
                    query:
                      ".*43824320.*",
                    path: "Name",
                    allowAnalyzedField: true
                  }
                }
              ],
              minimumShouldMatch: 1
            }
          }
        ]
      },
      highlight: {
        path: [
          "Name"
        ]
      },
      returnStoredSource: true
    }
  }
]

However when I try to search for the entire “phrase”: “SS-SC-240827-43824320” I cannot find it.

I have tried escaping the “-” character, replacing it with an “anything” regex indication(.*), etc.

SS-SC-240827-43824320
SS.*SC.*240827.*43824320
SS-SC-240827-43824320

I can only assume the analyzer is doing something funky with the “-” but I cannot figure out what and the documentation is not clear as it just states that is separates tokens based on whitespace and leaves the case(hence by special analyzer to force lowercase):

Could use some guidance here or link to similar case