Docs Menu
Docs Home
/
MongoDB Atlas
/ / / / /

Character Filters

On this page

  • htmlStrip
  • Attributes
  • Example
  • icuNormalize
  • Attributes
  • Example
  • mapping
  • Attributes
  • Example
  • persian
  • Attributes
  • Example

Character filters examine text one character at a time and perform filtering operations. Character filters require a type field, and some take additional options as well.

"charFilters": [
{
"type": "<filter-type>",
"<additional-option>": <value>
}
]

Atlas Search supports four types of character filters:

  • htmlStrip

  • icuNormalize

  • mapping

  • persian

The following sample index definitions and queries use the sample collection named minutes. If you add the minutes collection to a database in your Atlas cluster, you can create the following sample indexes from the Visual Editor or JSON Editor in the Atlas UI and run the sample queries against this collection. To create these indexes, after you select your preferred configuration method in the Atlas UI, select the database and collection, and refine your index as shown in the examples on this page to add custom analyzers that use character filters.

Note

When you add a custom analyzer using the Visual Editor in the Atlas UI, the Atlas UI displays the following details about the analyzer in the Custom Analyzers section.

Name
Label that identifies the custom analyzer.
Used In
Fields that use the custom analyzer. Value is None if custom analyzer isn't used to analyze any fields.
Character Filters
Atlas Search character filters configured in the custom analyzer.
Tokenizer
Atlas Search tokenizer configured in the custom analyzer.
Token Filters
Atlas Search token filters configured in the custom analyzer.
Actions

Clickable icons that indicate the actions that you can perform on the custom analyzer.

  • Click to edit the custom analyzer.

  • Click to delete the custom analyzer.

The htmlStrip character filter strips out HTML constructs.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this character filter type. Value must be htmlStrip.
ignoredTags
array of strings
no
List that contains the HTML tags to exclude from filtering.

The following index definition example indexes the text.en_US field in the minutes collection using a custom analyzer named htmlStrippingAnalyzer. The custom analyzer specifies the following:

The following query looks for occurrences of the string head in the text.en_US field of the minutes collection.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "head",
6 "path": "text.en_US"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "text.en_US": 1
14 }
15 }
16])
[
{
_id: 2,
text: { en_US: "The head of the sales department spoke first." }
},
{
_id: 3,
text: {
en_US: "<body>We'll head out to the conference room by noon.</body>"
}
}
]

Atlas Search doesn't return the document with _id: 1 because the string head is part of the HTML tag <head>. The document with _id: 3 contains HTML tags, but the string head is elsewhere so the document is a match. The following table shows the tokens that Atlas Search generates for the text.en_US field values in documents _id: 1, _id: 2, and _id: 3 in the minutes collection using the htmlStrippingAnalyzer.

Document ID
Output Tokens
_id: 1
This, page, deals, with, department, meetings
_id: 2
The, head, of, the, sales, department, spoke, first
_id: 3
We'll, head, out, to, the, conference, room, by, noon

The icuNormalize character filter normalizes text with the ICU Normalizer. It is based on Lucene's ICUNormalizer2CharFilter.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this character filter type. Value must be icuNormalize.

The following index definition example indexes the message field in the minutes collection using a custom analyzer named normalizingAnalyzer. The custom analyzer specifies the following:

  • Normalize the text in the message field value using the icuNormalize character filter.

  • Tokenize the words in the field based on occurrences of whitespace between words using the whitespace tokenizer.

The following query searches for occurrences of the string no (for number) in the message field of the minutes collection.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "no",
6 "path": "message"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "message": 1,
14 "title": 1
15 }
16 }
17])
[
{
_id: 4,
title: 'The daily huddle on tHe StandUpApp2',
message: 'write down your signature or phone №'
}
]

Atlas Search matched document with _id: 4 to the query term no because it normalized the numero symbol in the field using the icuNormalize character filter and created the token no for that typographic abbreviation of the word "number". Atlas Search generates the following tokens for the message field value in document _id: 4 using the normalizingAnalyzer:

Document ID
Output Tokens
_id: 4
write, down, your, signature, or, phone, no

The mapping character filter applies user-specified normalization mappings to characters. It is based on Lucene's MappingCharFilter.

It has the following attributes:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this character filter type. Value must be mapping.
mappings
object
yes
Object that contains a comma-separated list of mappings. A mapping indicates that one character or group of characters should be substituted for another, in the format <original> : <replacement>.

The following index definition example indexes the page_updated_by.phone field in the minutes collection using a custom analyzer named mappingAnalyzer. The custom analyzer specifies the following:

  • Remove instances of hyphen (-), dot (.), open parenthesis ((), close parenthesis ( )), and space characters in the phone field using the mapping character filter.

  • Tokenize the entire input as a single token using the keyword tokenizer.

The following query searches the page_updated_by.phone field for the string 1234567890.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "1234567890",
6 "path": "page_updated_by.phone"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "page_updated_by.phone": 1,
14 "page_updated_by.last_name": 1
15 }
16 }
17])
[
{
_id: 1,
page_updated_by: { last_name: 'AUERBACH', phone: '(123)-456-7890' }
}
]

The Atlas Search results contain one document where the numbers in the phone string match the query string. Atlas Search matched the document to the query string even though the query doesn't include the parentheses around the phone area code and the hyphen between the numbers because Atlas Search removed these characters using the mapping character filter and created a single token for the field value. Specifically, Atlas Search generated the following token for the phone field in document with _id: 1:

Document ID
Output Tokens
_id: 1
1234567890

Atlas Search would also match document with _id: 1 for searches for (123)-456-7890, 123-456-7890, 123.456.7890, and so on because for How to Index String Fields fields, Atlas Search also analyzes search query terms using the index analyzer (or if specified, using the searchAnalyzer). The following table shows the tokens that Atlas Search creates by removing instances of hyphen (-), dot (.), open parenthesis ((), close parenthesis ( )), and space characters in the query term:

Query Term
Output Tokens
(123)-456-7890
1234567890
123-456-7890
1234567890
123.456.7890
1234567890

Tip

See also: Additional Sample Index Definitions and Queries

The persian character filter replaces instances of zero-width non-joiner with the space character. This character filter is based on Lucene's PersianCharFilter.

It has the following attribute:

Name
Type
Required?
Description
type
string
yes
Human-readable label that identifies this character filter type. Value must be persian.

The following index definition example indexes the text.fa_IR field in the minutes collection using a custom analyzer named persianCharacterIndex. The custom analyzer specifies the following:

  • Apply the persian character filter to replace non-printing characters in the field value with the space character.

  • Use the whitespace tokenizer to create tokens based on occurrences of whitespace between words.

The following query searches the text.fa_IR field for the term صحبت.

1db.minutes.aggregate([
2 {
3 "$search": {
4 "text": {
5 "query": "صحبت",
6 "path": "text.fa_IR"
7 }
8 }
9 },
10 {
11 "$project": {
12 "_id": 1,
13 "text.fa_IR": 1,
14 "page_updated_by.last_name": 1
15 }
16 }
17])
[
{
_id: 2,
page_updated_by: { last_name: 'OHRBACH' },
text: { fa_IR: 'ابتدا رئیس بخش فروش صحبت کرد' }
}
]

Atlas Search returns the _id: 2 document that contains the query term. Atlas Search matches the query term to the document by first replacing instances of zero-width non-joiners with the space character and then creating individual tokens for each word in the field value based on occurrences of whitespace between words. Specifically, Atlas Search generates the following tokens for document with _id: 2:

Document ID
Output Tokens
_id: 2
ابتدا, رئیس, بخش, فروش, صحبت, کرد
← Custom Analyzers