Atlas Search + $regex

Mark_Mann · August 24, 2023, 6:19pm

Escaping special characters does not seem to be working with the Atlas Search + Regex combination. I am running this directly via the Atlas Search Portal in the “index tester” section. No code, no drivers.

Does work and returns anything with “to” in it

[
  {
    $search: {
      index: "TextSearch",
      regex: {
        query: [".*to.*"],
        path: "Name",
        allowAnalyzedField : true
      }
    }
  }
]

Does not work, I would expect it to return anything with “to?” in it.

[
  {
    $search: {
      index: "TextSearch",
      regex: {
        query: [".*to\\?.*"],
        path: "Name",
        allowAnalyzedField : true
      }
    }
  }
]

Mark_Mann · August 24, 2023, 6:37pm

Actually I may have found my answer. I believe the standard analyzer removes punctuation, which probably includes most special regex characters.

So maybe I can turn this into an advice thread on the same topic?

What should I use for a “fuzzy” meaning regex, not language or phonetic fuzzy, to fully utilize atlas search on a simple text field?

Example:
Data:
America Fast And Proud
International Fasteners
Heavy Bolts & Fasteners
Steadfast Industrial Manufacturing

I want a search of “fast” to return all 4. Text does not work because it requires full match so it misses the last 3. Autocomplete does not work because it requires “fast” to be the first 4 letters of a token, so I use regex and fill in the start and end with “.*”, which seems to work.

However what would be the appropriate analzyer & Search to match ANY input from a user, including special characters, no “fuzzy” in the sense of swapping out characters or finding similar items, allowing for start and end wildcards and case insensitivity.

Example of “real-world” where special characters matter:
PO#1234
Lot-4567
Part Number 2346/2124
Heat Treat 45#999

In those cases if someone searches “#999” or “46/21” I want to treat that literally with a wildcard on each end.

Mark_Mann · August 24, 2023, 10:24pm

Well, I created and solved my own thread The info is out there but a bit buried…

What I needed was a custom Analyzer. In my case, what I wanted was whitespace + ignore case, which works out to the below. Then I just set my fields to use that. Success so far!

  "analyzers": [
    {
      "charFilters": [],
      "name": "whitespaceLower",
      "tokenFilters": [
        {
          "type": "lowercase"
        }
      ],
      "tokenizer": {
        "type": "whitespace"
      }
    }
  ],

Jason_Tran · August 24, 2023, 10:41pm

Nice! Thanks for posting your solution @Mark_Mann - I’m sure it’ll be useful to others which encounter the same / similar use cases, especially with the examples you provided.

system · August 29, 2023, 10:41pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.