Search including special characters in MongoDB Atlas

Jelena_Arsinova · October 20, 2022, 8:41am

Hello!

I faced with the issue when I try to search for several words including a special character (section sign “§”). Example: AB § 32.
I want all words “AB”, “32” and symbol “§” to be included in found documents.
In some cases document can be found, in some not.
If my document contains the following text then search find its:
Lagrum: 32 § 1 mom. första stycket a) kommunalskattelagen (1928:370) AB

But if document contains this text then search doesn’t find:
Lagrum: 32 § 1 mom. första stycket AB

For symbol “§” I use UT8-encoding “\xc2\xa7”.

Index uses “lucene.swedish” analyzer.

      "Content": [
        {
          "analyzer": "lucene.swedish",
          "minGrams": 4,
          "tokenization": "nGram",
          "type": "autocomplete"
        },
        {
          "analyzer": "lucene.swedish",
          "type": "string"
        }
      ]

Query looks like:

{
	"index": "test_index",
	"compound": {
		"filter": [
			{
				"text": {
					"query": [
						"111111111111"
					],
					"path": "ProductId"
				}
			},
		],
		"must": [
			{
				"autocomplete": {
					"query": [
						"AB"
					],
					"path": "Content"
				}
			},
			{
				"autocomplete": {
					"query": [
						"\xc2\xa7",
					],
					"path": "Content"
				}
			},
			{
				"autocomplete": {
					"query": [
						"32"
					],
					"path": "Content"
				}
			}
		],
	},
	"count": {
		"type": "lowerBound",
		"threshold": 500
	}
}

The question is what is wrong with search and how can I make it working?

Marcus · October 26, 2022, 1:13am

The first issue in this field definition for content is the autocomplete definition should use edgeGram rather than nGram, which should almost always be used for left-to-right languages that respect whitespace. Please also add minGram value.

Jelena_Arsinova:

 "Content": [
        {
          "analyzer": "lucene.swedish",
          "minGrams": 4,
          "tokenization": "nGram",
          "type": "autocomplete"
        },
        {
          "analyzer": "lucene.swedish",
          "type": "string"
        }
      ]

If you want to understand how the Lucene analysis worked here, you can try this tool for understanding Atlas Search analysis. It’s not maintained by engineering, but a different team. It could disappear and has no official support. In it, you will discover that the § symbol is stripped out as non-essential to search relevance in the Swedish analyzer. If you need to preserve the symbol, you need to index that field with the Keyword or Whitespace analyzers. The other option is a custom analyzer..

Let me know if any of these options work for you!

Jelena_Arsinova · October 31, 2022, 4:47pm

Hello!

I tried edgeGram. but it doesn’t work for us as well.
In our project we would like to search in Swedish text using autocomplete operator with nGram tokenization, since we want to find as in the beginning, as in the middle, as at the end of the word (like mentioned here https://www.mongodb.com/docs/atlas/atlas-search/autocomplete/). We want special characters to be included in found documents as well. May be you could give us an example how the custom analyzer could look like for us?

Marcus · November 7, 2022, 11:15pm

Focusing only on the content field, here is an index definition that should work for your requirements. The docs are here. Let me know if this works for you.

  "mappings": {
    "dynamic": false,
    "fields": {
      "content": [
        {
          "type": "autocomplete",
          "tokenization": "nGram",
          "minGrams": 4,
          "maxGrams": 7,
          "foldDiacritics": false,
          "analyzer": "lucene.whitespace"
        },
        {
          "analyzer": "lucene.swedish",
          "type": "string"
        }
      ]
    }
  }
}

Jelena_Arsinova · November 13, 2022, 6:28pm

I tried, but it doesn’t work for us as well. No any documents were found. Even if I don’t use a special character in the search, I got empty result in this case.

Marcus · November 26, 2022, 5:56pm

What specifically doesn’t work? If you search for AB and have minGram:4 you not have reached the minimum number of characters. Could you specify the documents, query, and index definition?