Docs Menu
Docs Home
/
MongoDB Manual
/ / /

Text Index Properties on Self-Managed Deployments

On this page

  • Case Insensitivity
  • Diacritic Insensitivity
  • Tokenization Delimiters
  • Index Entries
  • Supported Languages and Stop Words
  • Sparse Property
  • Learn More

This page describes the behavior of version 3 text indexes.

Text indexes are case insensitive. The text index does not distinguish between capitalized and lower-case characters, such as e and E.

Text indexes support case foldings as specified in Unicode 8.0 Character Database Case Folding:

  • Common C

  • Simple S

  • Special T for Turkish languages

  • Characters with diacritics, such as é and É

  • Characters from non-Latin alphabets, such as И and и in the Cyrillic alphabet.

Previous text index versions are only case insensitive for non-diacritic Latin characters [A-z]. Previous text index versions treat all other characters as distinct.

Text indexes are diacritic insensitive. The text index does not distinguish between characters that contain diacritical marks and their non-marked counterparts, such as é, ê, and e. More specifically, the text index strips the markings categorized as diacritics in the Unicode 8.0 Character Database Prop List.

Previous versions of the text index treat characters with diacritics as distinct.

For tokenization, text indexes use the delimiters categorized under Dash, Hyphen, Pattern_Syntax, Quotation_Mark, Terminal_Punctuation, and White_Space in the Unicode 8.0 Character Database Prop List.

For example, in the phrase Il a dit qu'il «était le meilleur joueur du monde», the quotation marks («, ») and spaces are delimiters.

Previous versions of the index treat « as part of the term «était and » as part of the term monde».

Text indexes tokenize and stem the terms in the indexed fields for the index entries. The index uses simple language-specific suffix stemming. For each document in the collection, the text index stores one index entry for each unique stemmed term in each indexed field.

MongoDB supports text search for various languages. Text indexes use simple language-specific suffix stemming. Text indexes also drop language-specific stop words such as the, an, a, and and in English. For a list of the supported languages, see Text Search Languages on Self-Managed Deployments.

To specify a language for the text index, see Specify the Default Language for a Text Index on Self-Managed Deployments.

Text indexes are always sparse. When you create a text index, MongoDB ignores the sparse option.

If an existing or newly inserted document lacks a text index field (or the field is null or an empty array), MongoDB does not add a text index entry for the document.

To learn about text index restrictions, see Text Index Versions on Self-Managed Deployments.

Back

Limit Entries