Docs Menu

Docs HomeLaunch & Manage MongoDBMongoDB Atlas

Token Filters

On this page

  • asciiFolding
  • Attributes
  • Example
  • daitchMokotoffSoundex
  • Attributes
  • Example
  • edgeGram
  • Attributes
  • Example
  • englishPossessive
  • Attributes
  • Example
  • flattenGraph
  • Attributes
  • Example
  • icuFolding
  • Attributes
  • Example
  • icuNormalizer
  • Attributes
  • Example
  • kStemming
  • Attributes
  • Example
  • length
  • Attributes
  • Example
  • lowercase
  • Attributes
  • Examples
  • nGram
  • Attributes
  • Example
  • porterStemming
  • Attributes
  • Example
  • regex
  • Attributes
  • Example
  • reverse
  • Attributes
  • Example
  • shingle
  • Attributes
  • Example
  • snowballStemming
  • Attributes
  • Example
  • spanishPluralStemming
  • Attributes
  • Example
  • stempel
  • Attributes
  • Example
  • stopword
  • Attributes
  • Example
  • trim
  • Attributes
  • Example
  • wordDelimiterGraph
  • Attributes
  • Example

A token filter performs operations such as the following:

  • Stemming, which reduces related words, such as "talking", "talked", and "talks" to their root word "talk".

  • Redaction, the removal of sensitive information from public documents.

Token Filters always require a type field, and some take additional options as well. It has the following syntax:

"tokenFilters": [
{
"type": "<token-filter-type>",
"<additional-option>": <value>
}
]

The following sample index definitions and queries use the sample collection named minutes. If you add the minutes collection to a database in your Atlas cluster, you can create the following sample indexes from the Atlas Search Visual Editor or JSON Editor in the Atlas UI and run the sample queries against this collection. To create these indexes, after you select your preferred configuration method in the Atlas UI, select the database and collection, and refine your index to add custom analyzers that use token filters.

Note

When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.

Name
Label that identifies the custom analyzer.
Used In
Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields.
Character Filters
Atlas Search character filters configured in the custom analyzer.
Tokenizer
Atlas Search tokenizer configured in the custom analyzer.
Token Filters
Atlas Search token filters configured in the custom analyzer.
Actions

Clickable icons that indicate the actions that you can perform on the custom analyzer.

  • Click to edit the custom analyzer.

  • Click to delete the custom analyzer.

The asciiFolding token filter converts alphabetic, numeric, and symbolic unicode characters that are not in the Basic Latin Unicode block to their ASCII equivalents, if available.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be asciiFolding.
originalTokens
string
no

String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:

  • include - include the original tokens with the converted tokens in the output of the token filter. We recommend this value if you want to support queries on both the original tokens as well as the converted forms.

  • omit - omit the original tokens and include only the converted tokens in the output of the token filter. Use this value if you want to query only on the converted forms of the original tokens.

Default: omit

The following index definition indexes the page_updated_by.first_name field in the minutes collection using a custom analyzer named asciiConverter. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the asciiFolding token filter to convert the field values to their ASCII equivalent.

The following query searches the first_name field in the minutes collection for names using their ASCII equivalent.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "Sian",
"path": "page_updated_by.first_name"
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.last_name": 1,
"page_updated_by.first_name": 1
}
}
])
[
{
_id: 1,
page_updated_by: { last_name: 'AUERBACH', first_name: 'Siân'}
}
]

Atlas Search returns document with _id: 1 in the results because Atlas Search created the following tokens (searchable terms) for the page_updated_by.first_name field in the document, which it then used to match to the query term Sian:

Field Name
Output Tokens
page_updated_by.first_name
Sian

The daitchMokotoffSoundex token filter creates tokens for words that sound the same based on the Daitch-Mokotoff Soundex phonetic algorithm. This filter can generate multiple encodings for each input, where each encoded token is a 6 digit number.

Note

Don't use the daitchMokotoffSoundex token filter in:

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be daitchMokotoffSoundex.
originalTokens
string
no

String that specifies whether to include or omit the original tokens in the output of the token filter. Value can be one of the following:

  • include - include the original tokens with the encoded tokens in the output of the token filter. We recommend this value if you want queries on both the original tokens as well as the encoded forms.

  • omit - omit the original tokens and include only the encoded tokens in the output of the token filter. Use this value if you want to only query on the encoded forms of the original tokens.

Default: include

The following index definition indexes the page_updated_by.last_name field in the minutes collection using a custom analyzer named dmsAnalyzer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the daitchMokotoffSoundex token filter to encode the tokens for words that sound the same.

The following query searches for terms that sound similar to AUERBACH in the page_updated_by.last_name field of the minutes collection.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "AUERBACH",
"path": "page_updated_by.last_name"
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.last_name": 1
}
}
])
[
{ "_id" : 1, "page_updated_by" : { "last_name" : "AUERBACH" } }
{ "_id" : 2, "page_updated_by" : { "last_name" : "OHRBACH" } }
]

Atlas Search returns documents with _id: 1 and _id: 2 because the terms in both documents are phonetically similar, and are coded using the same six digit numbers (097400 and 097500). The following table shows the tokens (searchable terms and six digit encodings) that Atlas Search creates for the documents in the results:

Document ID
Output Tokens
"_id": 1
AUERBACH, 097400, 097500
"_id": 2
OHRBACH, 097400, 097500

The edgeGram token filter tokenizes input from the left side, or "edge", of a text input into n-grams of configured sizes.

Note

Typically, token filters operate similarly to a pipeline, with each input token yielding no more than 1 output token that is then inputted into the subsequent token. The edgeGram token filter, by contrast, is a graph-producing filter that yields multiple output tokens from a single input token.

Because synonym and autocomplete field type mapping definitions only work when used with non-graph-producing token filters, you can't use the edgeGram token filter in synonym or autocomplete field type mapping definitions.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be edgeGram.
minGram
integer
yes
Number that specifies the minimum length of generated n-grams. Value must be less than or equal to maxGram.
maxGram
integer
yes
Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to minGram.
termNotInBounds
string
no

String that specifies whether to index tokens shorter than minGram or longer than maxGram. Accepted values are:

  • include

  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

Default: omit

The following index definition indexes the title field in the minutes collection using a custom analyzer named titleAutocomplete. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following filters on the tokens:

    • icuFolding token filter to apply character foldings to the tokens.

    • edgeGram token filter to create 4 to 7 character long tokens from the left side.

The following query searches the title field of the minutes collection for terms that begin with mee, followed by any number of other characters.

db.minutes.aggregate([
{
"$search": {
"wildcard": {
"query": "mee*",
"path": "title",
"allowAnalyzedField": true
}
}
},
{
"$project": {
"_id": 1,
"title": 1
}
}
])
[
{ _id: 1, title: 'The team's weekly meeting' },
{ _id: 3, title: 'The regular board meeting' }
]

Atlas Search returns documents with _id: 1 and _id: 3 because the documents contain the term meeting, which matches the query criteria. Specifically, Atlas Search creates the following 4 to 7 character tokens (searchable terms) for the documents in the results, which it then matches to the query term mee*:

Document ID
Output Tokens
"_id": 1
team, team', team's, week, weekl, weekly, meet, meeti, meetin, meeting
"_id": 3
regu, regul, regula, regular, boar, board, meet, meeti, meetin, meeting

The englishPossessive token filter removes possessives (trailing 's) from words.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be englishPossessive.

The following index definition indexes the title field in the minutes collection using a custom analyzer named englishPossessiveStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens (search terms) based on word break rules.

  2. Apply the englishPossessive token filter to remove possessives (trailing 's) from the tokens.

The following query searches the title field in the minutes collection for the term team.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "team",
"path": "title"
}
}
},
{
"$project": {
"_id": 1,
"title": 1
}
}
])
[
{
_id: 1,
title: 'The team's weekly meeting'
},
{
_id: 2,
title: 'The check-in with sales team'
}
]

Atlas Search returns results that contain the term team in the title field. Atlas Search returns the document with _id: 1 because Atlas Search transforms team's in the title field to the token team during analysis. Specifically, Atlas Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term:

Document ID
Output Tokens
"_id": 1
The, team, weekly, meeting
"_id": 2
The, check, in, with, sales, team

The flattenGraph token filter transforms a token filter graph into a flat form suitable for indexing. If you use the wordDelimiterGraph token filter, use this filter after the wordDelimiterGraph token filter.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be flattenGraph.

The following index definition indexes the message field in the minutes collection using a custom analyzer called wordDelimiterGraphFlatten. The custom analyzer specifies the following:

  1. Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

  2. Apply the following filters to the tokens:

    • wordDelimiterGraph token filter to split tokens based on sub-words, generate tokens for the original words, and also protect the word SIGN_IN from delimination.

    • flattenGraph token filter to flatten the tokens to a flat form.

The following query searches the message field in the minutes collection for the term sign.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "sign",
"path": "message"
}
}
},
{
"$project": {
"_id": 1,
"message": 1
}
}
])
[
{
_id: 3,
message: 'try to sign-in'
}
]

Atlas Search returns the document with _id: 3 in the results for the query term sign even though the document contains the hyphenated term sign-in in the title field. The wordDelimiterGraph token filter creates a token filter graph and the flattenGraph token filter transforms the token filter graph into a flat form suitable for indexing. Specifically, Atlas Search creates the following tokens (searchable terms) for the document in the results, which it then matches to the query term sign:

Document ID
Output Tokens
_id: 3
try, to, sign-in, sign, in

The icuFolding token filter applies character folding from Unicode Technical Report #30 such as accent removal, case folding, canonical duplicates folding, and many others detailed in the report.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be icuFolding.

The following index definition indexes the text.sv_FI field in the minutes collection using a custom analyzer named diacriticFolder. The custom analyzer specifies the following:

  1. Apply the keyword tokenizer to tokenize all the terms in the string field as a single term.

  2. Use the icuFolding token filter to apply foldings such as accent removal, case folding, canonical duplicates folding, and so on.

The following query uses the the wildcard operator to search the text.sv_FI field in the minutes collection for all terms that contain the term avdelning, preceded and followed by any number of other characters.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"wildcard": {
"query": "*avdelning*",
"path": "text.sv_FI",
"allowAnalyzedField": true
}
}
},
{
"$project": {
"_id": 1,
"text.sv_FI": 1
}
}
])
[
{
_id: 1,
text: { sv_FI: 'Den här sidan behandlar avdelningsmöten' }
},
{
_id: 2,
text: { sv_FI: 'Först talade chefen för försäljningsavdelningen' }
}
]

Atlas Search returns the document with _id: 1 and _id: 2 in the results because the documents contain the query term avdelning followed by other characters in the document with _id: 1 and preceded and followed by other characters in the document with _id: 2. Specifically, Atlas Search creates the following tokens for the documents in the results, which it then matches to the query term *avdelning*.

Document ID
Output Tokens
_id: 1
den har sidan behandlar avdelningsmoten
_id: 2
forst talade chefen for forsaljningsavdelningen

The icuNormalizer token filter normalizes tokens using a standard Unicode Normalization Mode.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be icuNormalizer.
normalizationForm
string
no

Normalization form to apply. Accepted values are:

  • nfd (Canonical Decomposition)

  • nfc (Canonical Decomposition, followed by Canonical Composition)

  • nfkd (Compatibility Decomposition)

  • nfkc (Compatibility Decomposition, followed by Canonical Composition)

To learn more about the supported normalization forms, see Section 1.2: Normalization Forms, UTR#15.

Default: nfc

The following index definition indexes the message field in the minutes collection using a custom analyzer named textNormalizer. The custom analyzer specifies the following:

  1. Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

  2. Use the icuNormalizer token filter to normalize tokens by Compatibility Decomposition, followed by Canonical Composition.

The following query searches the message field in the minutes collection for the term 1.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "1",
"path": "message"
}
}
},
{
"$project": {
"_id": 1,
"message": 1
}
}
])
[ { _id: 2, message: 'do not forget to SIGN-IN. See ① for details.' } ]

Atlas Search returns the document with _id: 2 in the results for the query term 1 even though the document contains the circled number in the message field because the icuNormalizer token filter creates the token 1 for this character using the nfkc normalization form. The following table shows the tokens (searchable terms) that Atlas Search creates for the document in the results using the nfkc normalization form and by comparison, the tokens it creates for the other normalization forms.

Normalization Forms
Output Tokens
Matches 1
nfd
do, not, forget, to, SIGN-IN., See, , for, details.
X
nfc
do, not, forget, to, SIGN-IN., See, , for, details.
X
nfkd
do, not, forget, to, SIGN-IN., See, 1, for, details.
nfkc
do, not, forget, to, SIGN-IN., See, 1, for, details.

The kStemming token filter combines algorithmic stemming with a built-in dictionary for the english language to stem words. It expects lowercase text and doesn't modify uppercase text.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be kStemming.

The following index definition indexes the text.en_US field in the minutes collection using a custom analyzer named kStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following filters on the tokens:

    • lowercase token filter to convert the tokens to lowercase.

    • kStemming token filter to stem words using a combination of algorithmic stemming and a built-in dictionary for the english language.

The following query searches the text.en_US field in the minutes collection for the term Meeting.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "Meeting",
"path": "text.en_US"
}
}
},
{
"$project": {
"_id": 1,
"text.en_US": 1
}
}
])
[
{
_id: 1,
text: {
en_US: '<head> This page deals with department meetings. </head>'
}
}
]

Atlas Search returns the document with _id: 1, which contains the plural term meetings in lowercase. Atlas Search matches the query term to the document because the lowercase token filter normalizes token text to lowercase and the kStemming token filter lets Atlas Search match the plural meetings in the text.en_US field of the document to the singular query term. Atlas Search also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer). Specifically, Atlas Search creates the following tokens (searchable terms) for the document in the results, which it then uses to match to the query term:

head, this, page, deal, with, department, meeting, head

The length token filter removes tokens that are too short or too long.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be length.
min
integer
no

Number that specifies the minimum length of a token. Value must be less than or equal to max.

Default: 0

max
integer
no

Number that specifies the maximum length of a token. Value must be greater than or equal to min.

Default: 255

The following index definition indexes the text.sv_FI field in the minutes collection using a custom analyzer named longOnly. The custom analyzer specifies the following:

  1. Use the standard tokenizer to create tokens based on word break rules.

  2. Apply the following filters on the tokens:

    • icuFolding token filter to apply character foldings.

    • length token filter to index only tokens that are at least 20 UTF-16 code units long after tokenizing.

The following query searches the text.sv_FI field in the minutes collection for the term forsaljningsavdelningen.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "forsaljningsavdelningen",
"path": "text.sv_FI"
}
}
},
{
"$project": {
"_id": 1,
"text.sv_FI": 1
}
}
])
[
{
_id: 2,
text: {
sv_FI: 'Först talade chefen för försäljningsavdelningen'
}
}
]

Atlas Search returns the document with _id: 2, which contains the term försäljningsavdelningen. Atlas Search matches the document to the query term because the term has more than 20 characters. Additionally, although the query term forsaljningsavdelningen doesn't include the diacritic characters, Atlas Search matches the query term to the document by folding the diacritics in the original term in the document. Specifically, Atlas Search creates the following tokens (searchable terms) for the document with _id: 2.

forsaljningsavdelningen

Atlas Search won't return any results for a search for any other term in the text.sv_FI field in the collection because all other terms in the field have less than 20 characters.

The lowercase token filter normalizes token text to lowercase.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be lowercase.

The nGram token filter tokenizes input into n-grams of configured sizes. You can't use the nGram token filter in synonym or autocomplete mapping definitions.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be nGram.
minGram
integer
yes
Number that specifies the minimum length of generated n-grams. Value must be less than or equal to maxGram.
maxGram
integer
yes
Number that specifies the maximum length of generated n-grams. Value must be greater than or equal to minGram.
termNotInBounds
string
no

String that specifies whether to index tokens shorter than minGram or longer than maxGram. Accepted values are:

  • include

  • omit

If include is specified, tokens shorter than minGram or longer than maxGram are indexed as-is. If omit is specified, those tokens are not indexed.

Default: omit

The following index definition indexes the title field in the minutes collection using the custom analyzer named titleAutocomplete. The custom analyzer function specifies the following:

  1. Apply the standard tokenizer to create tokens based on the word break rules.

  2. Apply a series of token filters on the tokens:

    • englishPossessive to remove possessives (trailing 's) from words.

    • nGram to tokenize words into 4 to 7 characters in length.

The following query uses the wildcard operator to search the title field in the minutes collection for the term meet followed by any number of other characters after the term.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"wildcard": {
"query": "meet*",
"path": "title",
"allowAnalyzedField": true
}
}
},
{
"$project": {
"_id": 1,
"title": 1
}
}
])
[
{ _id: 1, title: 'The team's weekly meeting' },
{ _id: 3, title: 'The regular board meeting' }
]

Atlas Search returns the documents with _id: 1 and _id: 3 because the documents contain the term meeting, which Atlas Search matches to the query criteria meet* by creating the following tokens (searchable terms).

Normalization Forms
Output Tokens
_id: 2
team, week, weekl, weekly, eekl, eekly, ekly, meet, meeti, meetin, meeting, eeti, eeti, eeting, etin, eting, ting
_id: 3
regu, regul, regula, regular, egul, egula, egular, gula, gular, ular, boar, board, oard, meet, meeti, meetin, meeting, eeti, eeti, eeting, etin, eting, ting

Note

Atlas Search doesn't create tokens for terms less than 4 characters (such as the) and greater than 7 characters because the termNotInBounds parameter is set to omit by default. If you set the value for termNotInBounds parameter to include, Atlas Search would create tokens for the term the also.

The porterStemming token filter uses the porter stemming algorithm to remove the common morphological and inflectional suffixes from words in English. It expects lowercase text and doesn't work as expected for uppercase text.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be porterStemming.

The following index definition indexes the title field in the minutes collection using a custom analyzer named porterStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following token filters on the tokens:

    • lowercase token filter to convert the words to lowercase.

    • porterStemming token filter to remove the common morphological and inflectional suffixes from the words.

The following query searches the title field in the minutes collection for the term Meet.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "Meet",
"path": "title"
}
}
},
{
"$project": {
"_id": 1,
"title": 1
}
}
])
[
{
_id: 1,
title: 'The team's weekly meeting'
},
{
_id: 3,
title: 'The regular board meeting'
}
]

Atlas Search returns the documents with _id: 1 and _id: 3 because the lowercase token filter normalizes token text to lowercase and then the porterStemming token filter stems the morphological suffix from the meeting token to create the meet token, which Atlas Search matches to the query term Meet. Specifically, Atlas Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term Meet:

Normalization Forms
Output Tokens
_id: 1
the, team', weekli, meet
_id: 3
the, regular, board, meet

The regex token filter applies a regular expression to each token, replacing matches with a specified string.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter. Value must be regex.
pattern
string
yes
Regular expression pattern to apply to each token.
replacement
string
yes

Replacement string to substitute wherever a matching pattern occurs.

Note

If you specify an empty string ("") to ignore or delete a token, Atlas Search creates a token with an empty string instead. To delete tokens with empty strings, use the stopword token filter after the regex token filter. For example:

"analyzers": [
{
"name": "custom.analyzer.name",
"charFilters": [],
"tokenizer": {
"type": "whitespace"
},
"tokenFilters": [
{
"matches": "all",
"pattern": "^(?!\\$)\\w+",
"replacement": "",
"type": "regex"
},
{
"type": "stopword",
"tokens": [""]
}
]
}
]
matches
string
yes

Acceptable values are:

  • all

  • first

If matches is set to all, replace all matching patterns. Otherwise, replace only the first matching pattern.

The following index definition indexes the page_updated_by.email field in the minutes collection using a custom analyzer named emailRedact. The custom analyzer specifies the following:

  1. Apply the keyword tokenizer to index all words in the field value as a single term.

  2. Apply the following token filters on the tokens:

    • lowercase token filter to turn uppercase characters in the tokens to lowercase.

    • regex token filter to find strings that look like email addresses in the tokens and replace them with the word redacted.

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator for the term example.com preceded by any number of other characters.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "index": "default",
5 "wildcard": {
6 "query": "*example.com",
7 "path": "page_updated_by.email",
8 "allowAnalyzedField": true
9 }
10 }
11 },
12 {
13 "$project": {
14 "_id": 1,
15 "page_updated_by.email": 1
16 }
17 }
18])

Atlas Search doesn't return any results for the query although the page_updated_by.email field contains the word example.com in the email addresses. Atlas Search tokenizes strings that match the regular expression provided in the custom analyzer with the word redacted and so, Atlas Search doesn't match the query term to any document.

The reverse token filter reverses each string token.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter. Value must be reverse.

The following index definition indexes the page_updated_by.email fields in the minutes collection using a custom analyzer named keywordReverse. The custom analyzer specifies the following:

  • Apply the keyword tokenizer to tokenize entire strings as single terms.

  • Apply the reverse token filter to reverse the string tokens.

The following query searches the page_updated_by.email field in the minutes collection using the wildcard operator to match any characters preceding the characters @example.com in reverse order. The reverse token filter can speed up leading wildcard queries.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"wildcard": {
"query": "*@example.com",
"path": "page_updated_by.email",
"allowAnalyzedField": true
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.email": 1,
}
}
])

For the preceding query, Atlas Search applies the custom analyzer to the wildcard query to transform the query as follows:

moc.elpmaxe@*

Atlas Search then runs the query against the indexed tokens, which are also reversed. The query returns the following documents:

[
{ _id: 1, page_updated_by: { email: 'auerbach@example.com' } },
{ _id: 2, page_updated_by: { email: 'ohrback@example.com' } },
{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } },
{ _id: 4, page_updated_by: { email: 'levinski@example.com' } }
]

Specifically, Atlas Search creates the following tokens (searchable terms) for the documents in the results, which it then matches to the query term moc.elpmaxe@*:

Normalization Forms
Output Tokens
_id: 1
moc.elpmaxe@hcabreua
_id: 2
moc.elpmaxe@kcabrho
_id: 3
moc.elpmaxe@yksniwel
_id: 4
moc.elpmaxe@iksnivel

The shingle token filter constructs shingles (token n-grams) from a series of tokens. You can't use the shingle token filter in synonym or autocomplete mapping definitions.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be shingle.
minShingleSize
integer
yes
Minimum number of tokens per shingle. Must be less than or equal to maxShingleSize.
maxShingleSize
integer
yes
Maximum number of tokens per shingle. Must be greater than or equal to minShingleSize.

The following index definition example on the page_updated_by.email field in the minutes collection uses two custom analyzers, emailAutocompleteIndex and emailAutocompleteSearch, to implement autocomplete-like functionality. Atlas Search uses the emailAutocompleteIndex analyzer during index creation to:

  • Replace @ characters in a field with AT

  • Create tokens with the whitespace tokenizer

  • Shingle tokens

  • Create edgeGram of those shingled tokens

Atlas Search uses the emailAutocompleteSearch analyzer during a search to:

The following query searches for an email address in the page_updated_by.email field of the minutes collection:

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "auerbach@ex",
"path": "page_updated_by.email"
}
}
},
{
"$project": {
"_id": 1,
"page_updated_by.email": 1
}
}
])
[ { _id: 1, page_updated_by: { email: 'auerbach@example.com' } } ]

Atlas Search creates search tokens using the emailAutocompleteSearch analyzer, which it then matches to the index tokens that it created using the emailAutocompleteIndex analyzer. The following table shows the search and index tokens (up to 15 characters) that Atlas Search creates:

Search Tokens
Index Tokens
auerbachATexamp
au, aue, auer, auerb, auerba, auerbac, auerbach, auerbachA, auerbachAT, auerbachATe, auerbachATex, auerbachATexa, auerbachATexam, auerbachATexamp

The snowballStemming token filters Stems tokens using a Snowball-generated stemmer.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be snowballStemming.
stemmerName
string
yes

The following values are valid:

  • arabic

  • armenian

  • basque

  • catalan

  • danish

  • dutch

  • english

  • estonian

  • finnish

  • french

  • german

  • german2 (Alternative German language stemmer. Handles the umlaut by expanding ü to ue in most contexts.)

  • hungarian

  • irish

  • italian

  • kp (Kraaij-Pohlmann stemmer, an alternative stemmer for Dutch.)

  • lithuanian

  • lovins (The first-ever published "Lovins JB" stemming algorithm.)

  • norwegian

  • porter (The original Porter English stemming algorithm.)

  • portuguese

  • romanian

  • russian

  • spanish

  • swedish

  • turkish

The following index definition indexes the text.fr_CA field in the minutes collection using a custom analyzer named frenchStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following token filters on the tokens:

    • lowercase token filter to convert the tokens to lowercase.

    • french variant of the snowballStemming token filter to stem words.

The following query searches the text.fr_CA field in the minutes collection for the term réunion.

db.minutes.aggregate([
{
"$search": {
"text": {
"query": "réunion",
"path": "text.fr_CA"
}
}
},
{
"$project": {
"_id": 1,
"text.fr_CA": 1
}
}
])
[
{
_id: 1,
text: { fr_CA: 'Cette page traite des réunions de département' }
}
]

Atlas Search returns document with _id: 1 in the results. Atlas Search matches the query term to the document because it creates the following tokens for the document, which it then used to match to the query term réunion:

Document ID
Output Tokens
_id: 1
cet, pag, trait, de, réunion, de, départ

The spanishPluralStemming token filter stems spanish plural words. It expects lowercase text.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be spanishPluralStemming.

The following index definition indexes the text.es_MX field in the minutes collection using a custom analyzer named spanishPluralStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following token filters on the tokens:

    • lowercase token filter to convert spanish terms to lowercase.

    • spanishPluralStemming token filter to stem plural spanish words in the tokens into their singular form.

The following query searches the text.es_MX field in the minutes collection for the spanish term punto.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "punto",
"path": "text.es_MX"
}
}
},
{
"$project": {
"_id": 1,
"text.es_MX": 1
}
}
])
[
{
_id: 4,
text : {
es_MX: 'La página ha sido actualizada con los puntos de la agenda.',
}
}
]

Atlas Search returns the document with _id: 4 because the text.es_MX field in the document contains the plural term puntos. Atlas Search matches this document for the query term punto because Atlas Search analyzes puntos as punto by stemming the plural (s) from the term. Specifically, Atlas Search creates the following tokens (searchable terms) for the document in the results, which it then uses to match to the query term:

Document ID
Output Tokens
_id: 4
la, pagina, ha, sido, actualizada, con, los, punto, de, la, agenda

The stempel token filter uses Lucene's default Polish stemmer table to stem words in the Polish language. It expects lowercase text.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be stempel.

The following index definition indexes the text.pl_PL field in the minutes collection using a custom analyzer named stempelStemmer. The custom analyzer specifies the following:

  1. Apply the standard tokenizer to create tokens based on word break rules.

  2. Apply the following filters on the tokens:

    • lowercase token filter to convert the words to lowercase.

    • stempel token filter to stem the Polish words.

The following query searches the text.pl_PL field in the minutes collection for the Polish term punkt.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "punkt",
"path": "text.pl_PL"
}
}
},
{
"$project": {
"_id": 1,
"text.pl_PL": 1
}
}
])
[
{
_id: 4,
text: {
pl_PL: 'Strona została zaktualizowana o punkty porządku obrad.'
}
}
]

Atlas Search returns the document with _id: 4 because the text.pl_PL field in the document contains the plural term punkty. Atlas Search matches this document for the query term punkt because Atlas Search analyzes punkty as punkt by stemming the plural (y) from the term. Specifically, Atlas Search creates the following tokens (searchable terms) for the document in the results, which it then matches to the query term:

Document ID
Output Tokens
_id: 4
strona, zostać, zaktualizować, o, punkt, porządek, obrada

The stopword token filter removes tokens that correspond to the specified stop words. This token filter doesn't analyze the specified stop words.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be stopword.
tokens
array of strings
yes
List that contains the stop words that correspond to the tokens to remove. Value must be one or more stop words.
ignoreCase
boolean
no

Flag that indicates whether to ignore the case of stop words when filtering the tokens to remove. The value can be one of the following:

  • true - ignore case and remove all tokens that match the specified stop words

  • false - be case-sensitive and remove only tokens that exactly match the specified case

Default: true

The following index definition indexes the title field in the minutes collection using a custom analyzer named stopwordRemover. The custom analyzer specifies the following:

  1. Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

  2. Apply the stopword token filter to remove the tokens that match the defined stop words is, the, and at. The token filter is case-insensitive and will remove all tokens that match the specified stopwords.

The following query searches for the phrase head of the sales in the text.en_US field in the minutes collection.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "phrase": {
5 "query": "head of the sales",
6 "path": "text.en_US"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "text.en_US": 1
14 }
15 }
16])
1[
2 {
3 _id: 2,
4 text: { en_US: 'The head of the sales department spoke first.' }
5 }
6]

Atlas Search returns the document with _id: 2 because the en_US field contains the query term. Atlas Search doesn't create tokens for the stopword the in the document during analysis, but is still able to match it to the query term because for string fields, it also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer) and removes the stopword from the query term, which allows Atlas Search to match the query term to the document. Specifically, Atlas Search creates the following tokens for the document in the results:

Document ID
Output Tokens
_id: 2
head, of, sales, department, spoke, first.

The trim token filter trims leading and trailing whitespace from tokens.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be trim.

The following index definition indexes the text.en_US in the the minutes collection using a custom analyzer named tokenTrimmer. The custom analyzer specifies the following:

  • Apply the htmlStrip character filter to remove all HTML tags from the text except the a tag.

  • Apply the keyword tokenizer to create a single token for the entire string.

  • Apply the trim token filter to remove leading and trailing whitespace in the tokens.

The following query searches for the phrase *department meetings* preceded and followed by any number of other characters in the text.en_US field in the minutes collection.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "wildcard": {
5 "query": "*department meetings*",
6 "path": "text.en_US",
7 "allowAnalyzedField": true
8 }
9 }
10 },
11 {
12 "$project": {
13 "_id": 1,
14 "text.en_US": 1
15 }
16 }
17])
1[
2 {
3 _id: 1,
4 text: { en_US: '<head> This page deals with department meetings. </head>' }
5 }
6]

Atlas Search returns the document with _id: 1 because the en_US field contains the query term department meetings. Atlas Search creates the following token for the document in the results, which shows that Atlas Search removed the HTML tags, created a single token for the entire string, and removed leading and trailing whitespaces in the token:

Document ID
Output Tokens
_id: 1
This page deals with department meetings.

The wordDelimiterGraph token filter splits tokens into sub-tokens based on configured rules. We recommend that you don't use this token filter with the standard tokenizer because this tokenizer removes many of the intra-word delimiters that this token filter uses to determine boundaries.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this token filter type. Value must be wordDelimiterGraph.
delimiterOptions
object
no

Object that contains the rules that determine how to split words into sub-words.

Default: {}

delimiterOptions
.generateWordParts
boolean
no

Flag that indicates whether to split tokens based on sub-words. For example, if true, this option splits PowerShot into Power and Shot.

Default: true

delimiterOptions
.generateNumberParts
boolean
no

Flag that indicates whether to split tokens based on sub-numbers. For example, if true, this option splits 100-2 into 100 and 2.

Default: true

delimiterOptions
.concatenateWords
boolean
no

Flag that indicates whether to concatenate runs of sub-words. For example, if true, this option concatenates wi-fi into wifi.

Default: false

delimiterOptions
.concatenateNumbers
boolean
no

Flag that indicates whether to concatenate runs of sub-numbers. For example, if true, this option concatenates 100-2 into 1002.

Default: false

delimiterOptions
.concatenateAll
boolean
no

Flag that indicates whether to concatenate all runs. For example, if true, this option concatenates wi-fi-100-2 into wifi1002.

Default: false

delimiterOptions
.preserveOriginal
boolean
no

Flag that indicates whether to generate tokens of the original words.

Default: true

delimiterOptions
.splitOnCaseChange
boolean
no

Flag that indicates whether to split tokens based on letter-case transitions. For example, if true, this option splits camelCase into camel and Case.

Default: true

delimiterOptions
.splitOnNumerics
boolean
no

Flag that indicates whether to split tokens based on letter-number transitions. For example, if true, this option splits g2g into g, 2, and g.

Default: true

delimiterOptions
.stemEnglishPossessive
boolean
no

Flag that indicates whether to remove trailing possessives from each sub-word. For example, if true, this option changes who's into who.

Default: true

delimiterOptions
.ignoreKeywords
boolean
no

Flag that indicates whether to skip tokens with the keyword attribute set to true.

Default: false

protectedWords
object
no

Object that contains options for protected words.

Default: {}

protectedWords
.words
array
conditional
List that contains the tokens to protect from delimination. If you specify protectedWords, you must specify this option.
protectedWords
.ignoreCase
boolean
no

Flag that indicates whether to ignore case sensisitivity for protected words.

Default: true

If true, apply the flattenGraph token filter after this option to make the token stream suitable for indexing.

The following index definition indexes the title field in the minutes collection using a custom analyzer named wordDelimiterGraphAnalyzer. The custom analyzer specifies the following:

  1. Apply the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

  2. Apply the wordDelimiterGraph token filter for the following:

    • Don't try and split is, the, and at. The exclusion is case sensitive. For example Is and tHe are not excluded.

    • Split tokens on case changes and remove tokens that contain only alphabetical letters from the English alphabet.

The following query searches the title field in the minutes collection for the term App2.

db.minutes.aggregate([
{
"$search": {
"index": "default",
"text": {
"query": "App2",
"path": "title"
}
}
},
{
"$project": {
"_id": 1,
"title": 1
}
}
])
[
{
_id: 4,
title: 'The daily huddle on tHe StandUpApp2'
}
]

Atlas Search returns the document with _id: 4 because the title field in the document contains App2. Atlas Search splits tokens on case changes and removes tokens created by a split that contain only alphabetical letters. It also analyzes the query term using the index analyzer (or if specified, using the searchAnalyzer) to split the word on case change and remove the letters preceding 2. Specifically, Atlas Search creates the following tokens for the document with _id : 4 for the protectedWords and delimiterOptions options:

wordDelimiterGraph Options
Output Tokens
protectedWords
The, daily, huddle, on, t, He, Stand, Up, App, 2
delimiterOptions
The, daily, huddle, on, 2
←  TokenizersDefine Field Mappings →