Tokenizer / Token filter to add space between non-spaced Atlas search input

viraj_thakrar · May 26, 2024, 10:40am

I am configuring search index for one of the projects, have been creating custom analysers with different combination of Character filter, Tokenizer and Token filters.

I’ve a specific case which seems to be almost impossible to achieve keeping $search stage as first stage (we have not other option at the moment because Atlas doesn’t support any other stages before the search stage I believe), aggregation level with available set of Tokenizer and token filters on Atlas.

Case:

Stored data:

There are strings stored as “xxxx xxxx 1.5 mL xxxxxxx xxxxxxx”, “xxxx xxxx 1.0 cc xxxxxxx xxxxxxx”.

User input:

“1.5mL”, “1.0cc”, “1cc” (notice the input is without space vs, what’s stored in the data)

Things tried:

Wordgraphdelimeter

Wordgraphdelimeter to split word on case change, number, and concatenate on word, numbers and all

This gives results but results with 15ml coming before 1.5 ml

Regexsplit

Split on alphanumeric boundaries

Not giving results at all

Used custom analyser with Search analyser only, not in index analyser

When I take a look at the explain stage, there are no term or phrase query with [“1.5 ml”]

The only way possible I am able to see right now is to generate combinations at a driver level and pass it to the query.

I want to manage this at an analyser level or at aggregation level, before thinking about generation combinations at a driver level.

Any thoughts / ideas / inputs would be helpful.

I think if we don’t find a way and have to go to driver to manage this case, I think the feedback request for allowing stages before $search stage would really help here.

Aasawari · May 27, 2024, 7:49pm

Hi @viraj_thakrar, its good to see you at the forum.

Could you help me clarify the above statement ? Does this mean, you would like to see the results in ascending order ?

If I understand correctly, the user input mentioned describes a quantity, would it be possible to query by making use of vector search ?

The $search is the first stage of the pipeline as under the covers, it instructs the system to execute a text search operation against an internally synchronised Lucene full-text index.
But if you wish to have a this feature, would recommend you going through the MongoDB Feedback Engine to raise the request.

Regards
Aasawari

viraj_thakrar · May 30, 2024, 7:13am

Hi @Aasawari ,

Thank you for taking a look at this scenario.

The order of results should be the default returned by Atlas search.

The main problem is this:

Data stored in the database. “xxxxxx 1.5 ml liquid xxxxxx xxxxx”

User searches for. “1.5mL”

Results that atlas gives in following order for above search query

xxxxx 15ml liquid xxxxxxxx
xxxxx xxxxxxxx 15 ml liquid
1.5 ml liquid xxxxx xxxxxx
xxxxx xxxxx 1.5 ml water xxxxx

Ideally, The 3rd and 4th result should come above 1st and 2nd correct, because that matches with user’s input of 1.5mL. That’s the main goal I am trying to achieve.

Things tried addition to what I mentioned in my post:

I created a custom analyser with edgeGram (min gram - 3 and max gram- 15) and used it just with search analyser for this field.
Surprisingly, its not giving single result where edgeGram would have generated combinations which is present in the database already. I verified that by searching the individual terms generated by edgeGram tokens.

Aasawari · May 30, 2024, 12:39pm

Hi @viraj_thakrar ,

Are you using scores to return the results in a specific order ?
You can take a look at the documentation to show the results based on your relevance of the search.

Regards