The standard analyzer is the default for all Atlas Search indexes and
queries. It divides text into terms based on word boundaries, which
makes it language-neutral for most use cases. It converts all terms
to lower case and removes punctuation. It provides grammar-based
tokenization that recognizes email addresses, acronyms,
Chinese-Japanese-Korean characters, alphanumerics, and more.
You can see the tokens that the standard analyzer creates for a
built-in sample document and query string when you create or edit an index in the Atlas UI Visual Editor.
If you select Refine Your Index, the Atlas UI displays
a section titled View text analysis of your selected index configuration
within the Index Configurations section. If you expand this section,
the Atlas UI displays the index and search tokens that the standard
analyzer generates for each sample string.
Important
Atlas Search won't index string fields where analyzer tokens exceed 32766 bytes in size. If using the keyword analyzer, string fields which exceed 32766 bytes will not be indexed.
Example
The following example index definition specifies an index on
the title field in the sample_mflix.movies
collection using the standard analyzer.
To follow along with this example, load the sample data on your cluster
and navigate to the Create a Search Index page in the Atlas UI following the steps
in the Create an Atlas Search Index tutorial.
Then, select the minutes collection as your data source, and follow the example procedure
to create an index in the Visual Editor or JSON editor.
Click Refine Your Index to configure your index.
In the Index Configurations section, toggle Dynamic Mapping to off.
In the Field Mappings section, click Add Field Mapping to open the Add Field Mapping window.
Click Customized Configuration.
Select
titlefrom the Field Name dropdown.Click the Data Type dropdown and select String if it isn't already selected.
Expand String Properties and make the following changes:
Index Analyzer
Select
lucene.standardfrom the dropdown if it isn't already selected.Search Analyzer
Select
lucene.standardfrom the dropdown if it isn't already selected.Index Options
Use the default
offsets.Store
Use the default
true.Ignore Above
Keep the default setting.
Norms
Use the default
include.Click Add.
Click Save Changes.
Click Create Search Index.
Replace the default index definition with the following index definition.
{ "mappings": { "fields": { "title": { "type": "string", "analyzer": "lucene.standard" } } } } Click Next.
Click Create Search Index.
The following query searches the title field for the term action
and limits the output to two results.
db.movies.aggregate([ { "$search": { "text": { "query": "action", "path": "title" } } }, { "$limit": 2 }, { "$project": { "_id": 0, "title": 1 } } ])
[ { title: 'Action Jackson' }, { title: 'Class Action' } ]
Atlas Search returned these documents because it matched the query term
action to the token action for the documents, which Atlas Search
created by doing the following for the text in the title field
using the lucene.standard analyzer:
Convert the text to lowercase.
Split the text based on word boundaries and create separate tokens.
The following table shows the tokens (searchable terms) that Atlas Search creates using the Standard Analyzer and, by contrast, the tokens that Atlas Search creates for the Keyword Analyzer and Whitespace Analyzer for the documents in the results:
Title | Standard Analyzer Tokens | Keyword Analyzer Tokens | Whitespace Analyzer Tokens |
|---|---|---|---|
|
|
|
|
|
|
|
|
If you index the field using the:
Keyword Analyzer, Atlas Search wouldn't match the documents in the results for the query term
actionbecause thekeywordanalyzer matches only documents in which the search term matches the entire contents of the field (Action JacksonandClass Action) exactly.Whitespace Analyzer, Atlas Search wouldn't match the documents in the results for the query term
actionbecause thewhitespaceanalyzer tokenizes thetitlefield value in its original case (Action) and the query term has the lowercaseaction, which doesn't match thewhitespaceanalyzer token.