Search including stop words in MongoDB Atlas

Hello!

Our team is trying to implement search using MongoDB Atlas and got a problem with getting expected results when our query contains a stop word. The problem occurs when we want to use AND condition via compound operator.
By default it seems that when you use text or autocomplete operator then Atlas looks for a match for each term in the string separately. So it means if you type more words you will get more results, so this is kinda OR condition.
To handle that we are splitting each term into separate autocomplete operator, this gives us an AND condition behavior, but this is not working when stop-word occurs as one of the terms.
For example, we want all words (with/without stop word) to be included in found document:
word1 stopWord word2

Our query looks like:

{
	"index": "test_index",
	"compound": {
		"filter": [
			{
				"text": {
					"query": [
						"111111111111"
					],
					"path": "ProductId"
				}
			},
		],
		"must": [
			{
				"autocomplete": {
					"query": [
						"word1"
					],
					"path": "fieldA"
				}
			},
			{
				"autocomplete": {
					"query": [
						"stopWord",
					],
					"path": "fieldA"
				}
			},
			{
				"autocomplete": {
					"query": [
						"word2"
					],
					"path":  "fieldA"
				}
			}
		],
	},
	"count": {
		"type": "lowerBound",
		"threshold": 500
	}
}

Expected result: documents containing “word1”, “word2” (with/without stopWord)
Actual result: no any document found

Our index uses “lucene.swedish” analyzer:

      "Content": [
        {
          "analyzer": "lucene.swedish",
          "minGrams": 4,
          "tokenization": "nGram",
          "type": "autocomplete"
        },
        {
          "analyzer": "lucene.swedish",
          "type": "string"
        }
      ],

The question is how to get all documents containing all words with/without a stop word?

Hi there, could you provide a concrete example or sample document we can use to try to replicate the issue ourselves?

1 Like

Hello!

Yes, here is the example of document:

{
  "_id": "111111111122",
  "ProductId": "111111111111",
  "Name": "Testdokument Jelena",
  "Url": "/test-portal/test-page-jelena",
  "Content": "Testdokument Jelena Vidare vill regeringen införa ändringar som medför skyldighet för Försäkringskassan och kommunerna att informera Inspektionen för vård och omsorg när en enskild kan antas bedriva verksamhet för personlig assistans utan tillstånd.",
  "Description": "Testdokument Jelena Vidare vill regeringen införa ändringar som medför skyldighet för Försäkringskassan och kommunerna att informera Inspektionen för vård och omsorg när en enskild...",
  "AccessItems": [
    "Admin",
  ],
  "FilterRoute": "test-page-jelena",
  "TypeOfContent": "page"
}

I can find this document if I search for: Testdokument kommuner
But cannot find it if I search for:
Testdokument kommuner att
Testdokument kommuner och
Testdokument kommuner för

We search in Content field using index that is mentioned in the first post.

If the requirement is as written in the post, a simple index like:

{
	"mappings": {
		"dynamic": false,
		"fields": {
			"ProductId": {
				"type": "string"
			},
			"Content": [
				{
					"analyzer": "lucene.swedish",
					"minGrams": 4,
					"tokenization": "nGram",
					"type": "autocomplete"
				}
			]
		}
	}
}

Allows a search like:

        compound: {
			filter: [
				{
					text: {
						query: ["111111111111"],
						path: "ProductId",
					},
				},
			],
			must: [
				{
					autocomplete: {
						query: "Testdokument kommuner <anything here>",
						path: "Content",
					},
				},
			],
		},

To match the given document.
Note that the query doesn’t need to be split beforehand in separate terms.

If there are more requirements like:

  • All non-stopwords must exists in the matched text
  • Only last word must match partially
  • etc

Please let us know, but in testing this should work for you!

Also thanks to @Alan_Reyes for helping with this one.

Thank you for your fast reply, but we are splitting in separate terms to have AND condition behavior, it means we want all words to be in the document (with/without stop word). With your query we find the documents with OR condition, that means a found document contains at least one from the words.

must works like an AND statement – see docs here, does this work when we don’t take stop words into consideration? I wonder if that is where the issue is specifically. In other words, that you specifically WANT to index stop words?

As is mentioned in the example above, with must we can find documents if not to use stop word in the search, but if we use it then document is not found even though it has all words including a stop word.
Looks like we need to index stop words. Is it possible and how?

If you want to match also stopwords, you can use the simple , whitespace or even keyword analyzers (If you are already storing the individual words as a string)