Trailing spaces on text search

Katsuhiro_Mihara · September 30, 2020, 10:46am

On text search, is text with trailing spaces handled not same as text without trailing spaces?

On SQL Server, because it bases on SQL-92, text with trailing spaces hits text with no trailing space.
https://support.microsoft.com/en-us/help/316626/inf-how-sql-server-compares-strings-with-trailing-spaces

The behavior of MongoDB seems to be obvious. But I can’t find the document that describe it.

Our customer had a trouble by the behavior of SQL DB and requests us to express MongoDB’s spec clearly.

Best Regards,
Katsuhiro Mihara

Steve_Hand · October 2, 2020, 5:27pm

Hi Katsuhiro,

At least with v4.0.18, the trailing text is significant and processed as part of a contains search on a field with a text index.

Regards,
Steve

Katsuhiro_Mihara · October 5, 2020, 4:47am

Dear Steve,

Thank you for declaring the spec. I can answer to the our customer.

Best Regards,
Katsuhiro Mihara

Stennie_X · October 6, 2020, 8:18pm

Welcome to the community @Katsuhiro_Mihara!

Per the Text Search documentation, whitespace characters are not part of the search terms:

$text will tokenize the search string using whitespace and most punctuation as delimiters, and perform a logical OR of all such tokens in the search string.

This means the behaviour will be similar to SQL-92 by default, however you can achieve a stricter result using phrase matching if appropriate.

You can confirm the behaviour by setting up some test data:

db.cuppa.insert(
   [
       { _id: 1, name: "coffee"   },
       { _id: 2, name: " coffee"  },
       { _id: 3, name: "coffee "  },
       { _id: 4, name: " coffee " },
       { _id: 5, name: "tea" },
   ]
)
db.cuppa.createIndex( { name: "text" } )

Default text search behaviour

A search on coffee will match all of these example documents:

db.cuppa.find( { $text: { $search: "coffee" } } )
{ "_id" : 4, "name" : " coffee " }
{ "_id" : 3, "name" : "coffee " }
{ "_id" : 2, "name" : " coffee" }
{ "_id" : 1, "name" : "coffee" }

A search with trailing spaces will perform identically:

> db.cuppa.find( { $text: { $search: "coffee " } } )
{ "_id" : 4, "name" : " coffee " }
{ "_id" : 3, "name" : "coffee " }
{ "_id" : 2, "name" : " coffee" }
{ "_id" : 1, "name" : "coffee" }

Phrase matching

You can match a trailing space by wrapping the search term in double quotes to perform a phrase match:

> db.cuppa.find( { $text: { $search: "\"coffee \"" } } )
{ "_id" : 4, "name" : " coffee " }
{ "_id" : 3, "name" : "coffee " }

Explaining the results

If you want to understand differences in how these text search queries are processed, you can explain() the queries and look at the parsedTextQuery outcome for the winning plan:

> db.cuppa.find( { $text: { $search: "coffee " } } ).explain().queryPlanner.winningPlan.parsedTextQuery
{
	"terms" : [
		"coffe"
	],
	"negatedTerms" : [ ],
	"phrases" : [ ],
	"negatedPhrases" : [ ]
}

In this example, coffe is the stemmed version of coffee in English (according to the Snowball stemming algorithm), and whitespace characters have been removed by default.

A query with phrase matching has the same stemming outcome but adds an additional phrase match filter:

> db.cuppa.find( { $text: { $search: "\"coffee \"" } } ).explain().queryPlanner.winningPlan.parsedTextQuery
{
	"terms" : [
		"coffe"
	],
	"negatedTerms" : [ ],
	"phrases" : [
		"coffee "
	],
	"negatedPhrases" : [ ]
}

If you are performing a different type of text search (for example a regular expression match or Atlas Search), please provide a sample document, example of your search query, and the version of your MongoDB server.

Regards,
Stennie

Katsuhiro_Mihara · October 8, 2020, 6:16am

Dear Stennie,

Thanks for describing. This information is helpful.

I associate it with a Internet search engine. Even if a user enter spaces, the engine searches documents by words only.

Best Regards,
Katsuhiro Mihara