Trying to make a full text search index with nGram and ignoring diacritics

Ben_Kuhne · February 27, 2024, 10:13am

Hello together,

I can’t get any further with my problem and hope to get help here.

I have a collection of documents that are structured as follows:

{
    "number": 0,
    "title": "The Great Adventure",
    "author": "Alex Johnson",
    "description": "A thrilling story of a journey through uncharted territories.",
    "location": "somewhere",
    "embeddedObj": {
      "bla1":"asdf",
      "bla2": "asdf"
    },
    "embeddedArr":[
      {
        "bla3":"asdf",
        "bla4": "asdf"
      },
      {
        "bla3":"asdf",
        "bla4": "asdf"
      } 
    ]
}

for example, if you search for “lodz”, documents should be found if they have one of the following contents, no matter where:
“lodz”
“Lodz”
“Łódź”
“asdfŁódź”
“Łódźasdf”
“fdsaŁódźasdf”
“Łódź asdf”
“asdf Łódź”
“asdf Łódź asdf”

i have no experience with mongodb and have been reading up on the subject for a week, asking “Ask MongoDB AI”, and have tried and combined a lot, such as icuNormalize charFilter, icuFolding tokenizer, nGram, but as a beginner it is difficult for me to get the right thing out of all the tutorials, ai suggestions etc. and put it together into an index that really works.

my question is, what does the search index look like in json form, and what does the search look like in json form?

It should also be noted that this is a first step. the actual document structure will be more complex and when everything is up and running, 200,000 new documents will be added per day in production, about 10 times as many document updates. old documents (older than one week) will no longer receive any changes. after 3 months, the docs will be deleted or tagged as deleted.

so performance is important. but for me, any solution would be enough for my proof of concept. performance would come later for me.

many thanks in advance for your support

Aasawari · February 27, 2024, 2:25pm

Hi @Ben_Kuhne and welcome to the MongoDB community forum!!

Firstly, the right way to learn would be to go through our MongoDB university Atlas Search course to understand the basics.

If my understanding is correct, you are trying to understand, how to write the index definition and the search query in order to Atlas Search capability.
You can follow the steps mentioned in Atlas Search Index Syntax to see how the indexes are defined and Create and Run Atlas Search Queries to understand how search queries are written.

Could you kindly share the particular scenario where you intend to utilise Atlas search with the provided sample data? Understanding your expectations regarding the Atlas search query would greatly assist in tailoring the solution to your needs.

Best Regards
Aasawari

Ben_Kuhne · March 1, 2024, 9:06am

Hi @Aasawari , thank you for your input.

I have completed the “MongoDB university Atlas Search course” today and will see if I can find a wrong handling of the “search analyzer” in my last search index creation attempts.

the scenario for which i need to create the search index and query is the following:

a website is to be created in which data analysts can perform full-text searches on a collection.
on the one hand, it is important that words or numbers are found if the search input is still incomplete. for example, when searching for 1234, documents containing 12345678 or those containing 99123499 should also be found. the same applies to words. on the other hand, the search should ignore diacritics.

Ben_Kuhne · March 4, 2024, 8:27am

Now, happily, i was able to implement the needed search index

{
    "analyzer": "indexAnalyzer",
    "searchAnalyzer": "queryFilter",
    "mappings": {
      "dynamic": true
    },
    "analyzers": [
      {
        "charFilters": [
          {
            "type": "icuNormalize"
          }
        ],
        "name": "queryFilter",
        "tokenFilters": [],
        "tokenizer": {
          "type": "standard"
        }
      },
      {
        "charFilters": [],
        "name": "indexAnalyzer",
        "tokenFilters": [
          {
            "type": "icuFolding"
          }
        ],
        "tokenizer": {
          "maxGram": 15,
          "minGram": 2,
          "type": "nGram"
        }
      }
    ]
  }

Next to refine this index would be, to consider the performance and ressource consumption with this maxGram 15 ^^°.

But i have a question: is it possible, to include also fields of type number to this index? When I search for 1234, I would also like to search numeric fields that contain the number sequence 1234.

Aasawari · March 4, 2024, 12:50pm

Hi @Ben_Kuhne

I am glad you have found your solution to your search query.
Now,

If I understand your question correctly, you are trying to search for the 1234 on the “number” field on which the index is created.
Based on the sample document shred, I tried to create a sample document in my collection as:
Atlas atlas-xp4gev-shard-0 [primary] test> db.searchNumber.find()

[
  {
    _id: ObjectId('65e5af33def449f02ac598fd'),
    number: 1234,
    title: 'The Great Adventure'
  },
  {
    _id: ObjectId('65e5af51def449f02ac598fe'),
    number: 12788,
    title: 'The Great Adventure'
  }
]

with search index created on the “number” field as:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "number": {
        "representation": "int64",
        "type": "number"
      }
    }
  }
}

Now finally, as mentioned in the documentation on How to Index Numeric Fields, you can use the $equals operator to perform the search.

For example:

Atlas atlas-xp4gev-shard-0 [primary] test> db.searchNumber.aggregate([{ '$search': { equals: { value: 1234, path: 'number' } } }])
[
  {
    _id: ObjectId('65e5af33def449f02ac598fd'),
    number: 1234,
    title: 'The Great Adventure'
  }
]

Please do not hesitate to reach out in case of any further questions.

Best Regards
Aasawari

system · May 13, 2024, 11:58am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.