/ /

分词器

自定义分析器的分词器决定MongoDB Search 如何将文本拆分为离散的数据段以索引。分词器需要一个 type 字段，有些还接受额外的选项。

语法

"tokenizer": {
  "type": "<tokenizer-type>",
  "<additional-option>": "<value>"
}

分词器类型

MongoDB Search 支持以下类型的分词器：

edgeGram
keyword
nGram
regexCaptureGroup
regexSplit
标准
uaxUrlEmail
whitespace

以下示例索引定义和查询使用名为示例集合的 minutes。要跟随这些示例，请在集群上加载 minutes 集合，并按照创建MongoDB搜索索引教程中的步骤导航到Atlas 用户界面中的 Create a Search Index 页面。然后，选择 minutes 集合作为数据源，并按照示例过程从 Atlas 用户界面或使用 mongosh 创建索引。

➤ 使用 Select your language（选择您的语言）下拉菜单设置此页面上示例的语言。

edgeGram

edgeGram 分词器将文本输入左侧（或“边缘”）的输入标记为给定大小的 n 元模型。您无法在同义词或自动完成字段映射定义的 analyzer 字段中将自定义分析器与 edgeGram 分词器结合使用。

属性

它具有以下属性：

注意

edgeGram 分词器可按单词或跨输入文本中的单词生成多个输出标记，从而生成标记图。

由于自动完成字段类型映射定义和具有同义词映射的分析器仅在与不生成图的分词器一起使用时才有效，因此不能将 analyzer 字段中带有 edgeGram 分词器的自定义分析器用于自动完成字段类型映射定义或带有同义词映射的分析器。

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `edgeGram`。
`minGram`	整型	是	要包含在创建的最短词元中的字符数。
`maxGram`	整型	是	要包含在创建的最长词元中的字符数。

例子

以下索引定义使用名为 edgegramExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 edgeGram 分词器创建 2 到 7 个字符长的词元（可搜索词），从 message 字段中单词左侧的第一个字符开始。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace，然后为以下字段键入值：

字段
值
minGram
2
maxGram
7
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

字段	值
minGram	`2`
maxGram	`7`

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "edgegramExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "edgegramExample",
      "tokenFilters": [],
      "tokenizer": {
        "maxGram": 7,
        "minGram": 2,
        "type": "edgeGram"
      }
    }
  ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "dynamic": true,
6       "fields": {
7         "message": {
8           "analyzer": "edgegramExample",
9           "type": "string"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "charFilters": [],
16         "name": "edgegramExample",
17         "tokenFilters": [],
18         "tokenizer": {
19           "maxGram": 7,
20           "minGram": 2,
21           "type": "edgeGram"
22         }
23       }
24     ]
25   }
26 )

下面的查询在 minutes 集合的 message 字段中搜索以 tr 开头的文本。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "text": {
      "query": "tr",
      "path": "message"
    }
  }
}

SCORE: 0.3150668740272522 _id:  "1"
 message: "try to siGn-In"
 page_updated_by: Object
   last_name: "AUERBACH"
   first_name: "Siân"
   email: "auerbach@example.com"
   phone: "(123)-456-7890"
 text: Object
   en_US: "<head> This page deals with department meetings.</head>"
   sv_FI: "Den här sidan behandlar avdelningsmöten"
   fr_CA: "Cette page traite des réunions de département"
SCORE: 0.3150668740272522 _id:  "3"
 message: "try to sign-in"
 page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "tr",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 1, message: 'try to siGn-In' },
{ _id: 3, message: 'try to sign-in' }

MongoDB Search 在结果中返回具有 _id: 1 和 _id: 3 的文档，因为MongoDB Search 使用文档的 edgeGram分词器创建了一个值为 tr 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 message字段索引， MongoDB Search 将不会返回搜索术语tr 的任何结果。

下表列出了 edgeGram 分词器以及与之相比较的 standard 分词器为结果中的文档创建的词元：

分词器	Token Outputs
`standard`	`try`, `to` , `sign` , `in`
`edgeGram`	`tr`, `try` , `try{SPACE}` , `try t` , `try to` , `try to{SPACE}`

keyword

keyword分词器将整个输入标记为单个词元。MongoDB Search 不会使用 keyword分词器对超过 32766 个字符的字符串字段索引。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `keyword`。

例子

以下索引定义使用名为 keywordExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 keyword 分词器在整个字段上创建作为单个术语的词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

1 {
2   "mappings": {
3     "dynamic": true,
4     "fields": {
5       "message": {
6         "analyzer": "keywordExample",
7         "type": "string"
8       }
9     }
10   },
11   "analyzers": [
12     {
13       "charFilters": [],
14       "name": "keywordExample",
15       "tokenFilters": [],
16       "tokenizer": {
17         "type": "keyword"
18       }
19     }
20   ]
21 }

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "dynamic": true,
6       "fields": {
7         "message": {
8           "analyzer": "keywordExample",
9           "type": "string"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "charFilters": [],
16         "name": "keywordExample",
17         "tokenFilters": [],
18         "tokenizer": {
19           "type": "keyword"
20         }
21       }
22     ]
23   }
24 )

下面的查询在 minutes 集合的 message 字段中搜索词语 try to sign-in。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "text": {
      "query": "try to sign-in",
      "path": "message"
    }
  }
}

SCORE: 0.5472603440284729 _id:  "3"
 message: "try to sign-in"
 page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "try to sign-in",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 3, message: 'try to sign-in' }

MongoDB Search 在结果中返回具有 _id: 3 的文档，因为MongoDB Search 使用文档的 keyword分词器创建了一个值为 try to sign-in 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 message字段索引，则MongoDB Search 将针对搜索术语try to sign-in 返回具有 _id: 1、_id: 2 和 _id: 3 的文档，因为每个文档都包含 standard分词器创建。

下表列出了 keyword 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`try`, `to` , `sign` , `in`
`keyword`	`try to sign-in`

nGram

nGram 分词器标记为给定大小的文本块或“n-grams”。对于同义词或自动完成字段映射定义，不能在 analyzer 字段中使用带有 nGram 分词器的自定义分析器。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `nGram`。
`minGram`	整型	是	要包含在创建的最短词元中的字符数。
`maxGram`	整型	是	要包含在创建的最长词元中的字符数。

例子

以下索引定义使用名为 ngramExample 的自定义分析器为 minutes 集合中的 title 字段建立索引，并使用 nGram 分词器在 title 字段中创建长度介于 4 至 6 个字符之间的词元（可搜索术语）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace，然后为以下字段键入值：

字段
值
minGram
4
maxGram
6
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

字段	值
minGram	`4`
maxGram	`6`

将默认索引定义替换为以下内容：

1 {
2   "mappings": {
3     "dynamic": true,
4     "fields": {
5       "title": {
6         "analyzer": "ngramExample",
7         "type": "string"
8       }
9     }
10   },
11   "analyzers": [
12     {
13       "charFilters": [],
14       "name": "ngramExample",
15       "tokenFilters": [],
16       "tokenizer": {
17         "maxGram": 6,
18         "minGram": 4,
19         "type": "nGram"
20       }
21     }
22   ]
23 }

db.minutes.createSearchIndex("default", {
  "mappings": {
    "dynamic": true,
    "fields": {
      "title": {
        "analyzer": "ngramExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "ngramExample",
      "tokenFilters": [],
      "tokenizer": {
        "maxGram": 6,
        "minGram": 4,
        "type": "nGram"
      }
    }
  ]
})

下面的查询在 minutes 集合的 title 字段中搜索词语 week。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "index": "default",
    "text": {
      "query": "week",
      "path": "title"
    }
  }
}

SCORE: 0.5895273089408875 _id:  "1"
   message: "try to siGn-In"
   page_updated_by: Object
     last_name: "AUERBACH"
     first_name: "Siân"
     email: "auerbach@example.com"
     phone: "(123)-456-7890"
   text: Object
     en_US: "<head> This page deals with department meetings.</head>"
     sv_FI: "Den här sidan behandlar avdelningsmöten"
     fr_CA: "Cette page traite des réunions de département"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "week",
6         "path": "title"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "title": 1
14     }
15   }
16 ])

{ _id: 1, title: "The team's weekly meeting" }

MongoDB Search 在结果中返回具有 _id: 1 的文档，因为MongoDB Search 使用文档的 nGram分词器创建了一个值为 week 的词元，该词元与搜索术语匹配。如果使用 standard 或 edgeGram分词器为 title字段索引， MongoDB Search 将不会返回搜索术语week 的任何结果。

下表列出了 nGram 分词器以及与之相比，standard 和 edgeGram 分词器为具有 _id: 1 的文档创建的标记：

分词器	Token Outputs
`standard`	`The`, `team's` , `weekly` , `meeting`
`edgeGram`	`The{SPACE}`, `The t` , `The te`
`nGram`	`The{SPACE}`, `The t`, `The te`, `he t`, ... , `week`, `weekl`, `weekly`, `eekl`, ..., `eetin`, `eeting`, `etin`, `eting`, `ting`

regexCaptureGroup

regexCaptureGroup 分词器与 Java 正则表达式模式相匹配以提取词元。

提示

要了解有关 Java 正则表达式语法的更多信息，请参阅 Java 文档中的 Pattern 类。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `regexCaptureGroup`。
`pattern`	字符串	是	要匹配的正则表达式。
`group`	整型	是	匹配表达式中要提取为标记的字符群组的索引。使用 `0` 提取所有字符组。

例子

以下索引定义使用名为 phoneNumberExtractor 的自定义分析器对 minutes 集合中的 page_updated_by.phone 字段建立索引。它使用以下命令：

mappings 字符过滤器，删除前三位数字周围的括号并用破折号替换所有空格和句点
regexCaptureGroup 分词器，用于根据文本输入中存在的第一个美国格式的电话号码创建单个词元

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
展开 Character Filters 并单击 Add character filter。
从下拉菜单中选择 mapping，然后点击 Add mapping。
在 Original 字段中输入以下字符，一次一个，并将相应的 Replacement 字段留空。

原始
更换
-
.
(
)
单击 Add character filter（连接）。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace，然后为以下字段键入值：

字段
值
pattern
^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$
group
0
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在 page_updated_by.phone 字段上应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.phone，并从 Data Type 下拉列表中选择字符串。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

原始	更换
`-`
`.`
`(`
`)`

字段	值
pattern	`^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$`
group	`0`

将默认索引定义替换为以下示例：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "phoneNumberExtractor",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [
        {
          "mappings": {
            " ": "-",
            "(": "",
            ")": "",
            ".": "-"
          },
          "type": "mapping"
        }
      ],
      "name": "phoneNumberExtractor",
      "tokenFilters": [],
      "tokenizer": {
        "group": 0,
        "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$",
        "type": "regexCaptureGroup"
      }
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "phoneNumberExtractor",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [
        {
          "mappings": {
            " ": "-",
            "(": "",
            ")": "",
            ".": "-"
          },
          "type": "mapping"
        }
      ],
      "name": "phoneNumberExtractor",
      "tokenFilters": [],
      "tokenizer": {
        "group": 0,
        "pattern": "^\\b\\d{3}[-]?\\d{3}[-]?\\d{4}\\b$",
        "type": "regexCaptureGroup"
        }
    }
  ]
})

以下查询会在 minutes 集合的 page_updated_by.phone 字段中搜索电话号码 123-456-9870。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "index": "default",
    "text": {
      "query": "123-456-9870",
      "path": "page_updated_by.phone"
    }
  }
}

SCORE: 0.5472603440284729 _id:  "3"
 message: "try to sign-in"
 page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "123-456-9870",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

MongoDB Search 在结果中返回具有 _id: 3 的文档，因为MongoDB Search 使用文档的 regexCaptureGroup分词器创建了一个值为 123-456-7890 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 page_updated_by.phone字段索引， MongoDB Search 将返回搜索术语123-456-7890 的所有文档。

下表列出了 regexCaptureGroup 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`123`, `456.9870`
`regexCaptureGroup`	`123-456-9870`

regexSplit

regexSplit 分词器使用基于 Java 正则表达式的分隔符拆分词元。

提示

要了解有关 Java 正则表达式语法的更多信息，请参阅 Java 文档中的 Pattern 类。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `regexSplit`。
`pattern`	字符串	是	要匹配的正则表达式。

例子

以下索引定义使用名为 dashDotSpaceSplitter 的自定义分析器对 minutes 集合中的 page_updated_by.phone 字段建立索引。它使用 regexSplit 分词器从 page_updated_by.phone 字段中的一个或多个连字符、句点和空格创建词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace，然后为以下字段键入值：

字段
值
pattern
[-. ]+
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在 page_updated_by.phone 字段上应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.phone，并从 Data Type 下拉列表中选择字符串。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

字段	值
pattern	`[-. ]+`

将默认索引定义替换为以下内容：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "dashDotSpaceSplitter",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "dashDotSpaceSplitter",
      "tokenFilters": [],
      "tokenizer": {
        "pattern": "[-. ]+",
        "type": "regexSplit"
      }
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "dynamic": true,
    "fields": {
      "page_updated_by": {
        "fields": {
          "phone": {
            "analyzer": "dashDotSpaceSplitter",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "dashDotSpaceSplitter",
      "tokenFilters": [],
      "tokenizer": {
        "pattern": "[-. ]+",
        "type": "regexSplit"
      }
    }
  ]
})

下面的查询在 minutes 集合的 page_updated_by.phone 字段中搜索数字 9870。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "index": "default",
    "text": {
      "query": "9870",
      "path": "page_updated_by.phone"
    }
  }
}

SCORE: 0.5472603440284729 _id:  "3"
 message: "try to sign-in"
 page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "9870",
6         "path": "page_updated_by.phone"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.phone": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { phone: '(123).456.9870' }

MongoDB Search 在结果中返回具有 _id: 3 的文档，因为MongoDB Search 使用文档的 regexSplit分词器创建了一个值为 9870 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 page_updated_by.phone字段索引， MongoDB Search 将不会返回搜索术语9870 的任何结果。

下表列出了 regexCaptureGroup 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`123`, `456.9870`
`regexSplit`	`(123)`, `456` , `9870`

标准

standard 分词器根据 Unicode 文本分割算法中的分词规则进行分词。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `standard`。
`maxTokenLength`	整型	no	单个标记的最大长度。大于此长度的标记将按 `maxTokenLength` 分割为多个标记。默认值：`255`

例子

以下索引定义使用名为 standardExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 standard 分词器和停用词词元过滤器。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
 "mappings": {
   "dynamic": true,
   "fields": {
     "message": {
       "analyzer": "standardExample",
       "type": "string"
     }
   }
 },
 "analyzers": [
   {
     "charFilters": [],
     "name": "standardExample",
     "tokenFilters": [],
     "tokenizer": {
       "type": "standard"
     }
   }
 ]
}

1 db.minutes.createSearchIndex(
2   "default",
3   {
4     "mappings": {
5       "dynamic": true,
6       "fields": {
7         "message": {
8           "analyzer": "standardExample",
9           "type": "string"
10         }
11       }
12     },
13     "analyzers": [
14       {
15         "charFilters": [],
16         "name": "standardExample",
17         "tokenFilters": [],
18         "tokenizer": {
19           "type": "standard"
20         }
21       }
22     ]
23   }
24 )

下面的查询在 minutes 集合的 message 字段中搜索词语 signature。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "text": {
      "query": "signature",
      "path": "message"
    }
  }
}

SCORE: 0.5376965999603271 _id:  "4"
 message: "write down your signature or phone №"
 page_updated_by: Object
   last_name: "LEVINSKI"
   first_name: "François"
   email: "levinski@example.com"
   phone: "123-456-8907"
 text: Object
   en_US: "<body>This page has been updated with the items on the agenda.</body>"
   es_MX: "La página ha sido actualizada con los puntos de la agenda."
   pl_PL: "Strona została zaktualizowana o punkty porządku obrad."

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "signature",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

{ _id: 4, message: 'write down your signature or phone №' }

MongoDB Search 返回具有 _id: 4 的文档，因为MongoDB Search 使用文档的 standard分词器创建了一个值为 signature 的词元，该词元与搜索术语匹配。如果使用 keyword分词器为 message字段索引， MongoDB Search 将不会返回搜索术语signature 的任何结果。

下表列出了 standard 分词器以及与之相比较的 keyword 分析器为具有 _id: 4 的文档创建的词元：

分词器	Token Outputs
`standard`	`write`, `down` , `your` , `signature` , `or` , `phone`
`keyword`	`write down your signature or phone №`

uaxUrlEmail

uaxUrlEmail 分词器将对 URL 和电子邮件地址进行分词。虽然 uaxUrlEmail 分词器会根据 Unicode 文本分段算法中的分词规则进行分词，但我们建议仅在索引字段值包含 URL 和电子邮件地址时才使用 uaxUrlEmail 分词器。对于不包含 URL 或电子邮件地址的字段，请使用标准分词器以根据分词规则创建词元。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `uaxUrlEmail`。
`maxTokenLength`	int	no	一个词元中的最大字符数。默认值：`255`

例子

基本示例

使用 uaxUrlEmail分词器创建简单索引。

以下索引定义使用名为 basicEmailAddressAnalyzer 的自定义分析器对 minutes 集合中的 page_updated_by.email 字段建立索引。它使用 uaxUrlEmail 分词器从 page_updated_by.email 字段中的 URL 和电子邮件地址创建词元（可搜索的术语）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分，单击 Add Field Mapping，在 page_updated_by.email 字段中应用自定义分析器。
从Field Name下拉列表中选择 page_updated_by.email，并从 Data Type 下拉列表中选择 String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "fields": {
      "page_updated_by": {
       "fields": {
          "email": {
            "analyzer": "basicEmailAddressAnalyzer",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "basicEmailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "page_updated_by": {
        "fields": {
          "email": {
            "analyzer": "basicEmailAddressAnalyzer",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "basicEmailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
})

下面的查询会在 minutes 集合的 page_updated_by.email 字段中搜索电子邮件 lewinsky@example.com。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "text": {
      "query": "lewinsky@example.com",
      "path": "page_updated_by.email"
    }
  }
}

SCORE: 0.5472603440284729   _id:  "3"
 message: "try to sign-in"
   page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "autocomplete": {
5         "query": "lewinsky@example.com",
6         "path": "page_updated_by.email"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.email": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

MongoDB Search 在结果中返回具有 _id: 3 的文档，因为MongoDB Search 使用文档的 uaxUrlEmail分词器创建了一个值为 lewinsky@example.com 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 page_updated_by.email字段索引， MongoDB Search 将返回搜索术语lewinsky@example.com 的所有文档。

下表列出了 uaxUrlEmail 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`lewinsky`, `example.com`
`uaxUrlEmail`	`lewinsky@example.com`

高级示例

使用具有自动完成分词策略的 uaxUrlEmail分词器创建索引。

以下索引定义使用名为 emailAddressAnalyzer 的自定义分析器对 minutes 集合中的 page_updated_by.email 字段建立索引。它使用以下命令：

采用 edgeGram 分词策略的 autocomplete 类型
uaxUrlEmail 分词器，用于从 URL 和电子邮件地址创建词元（可搜索词语）

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分，单击 Add Field Mapping，在 page_updated_by.email 字段中应用自定义分析器。
从 Field Name 下拉列表中选择 page_updated_by.email，并从 Data Type 下拉列表中选择自动完成。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下内容：

{
  "mappings": {
    "fields": {
      "page_updated_by": {
        "fields": {
          "email": {
            "analyzer": "emailAddressAnalyzer",
            "tokenization": "edgeGram",
            "type": "autocomplete"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "emailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "fields": {
      "page_updated_by": {
        "fields": {
          "email": {
            "analyzer": "emailAddressAnalyzer",
            "tokenization": "edgeGram",
            "type": "autocomplete"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "name": "emailAddressAnalyzer",
      "tokenizer": {
        "type": "uaxUrlEmail"
      }
    }
  ]
})

下面的查询在 minutes 集合的 page_updated_by.email 字段中搜索词语 exam。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "autocomplete": {
      "query": "lewinsky@example.com",
      "path": "page_updated_by.email"
    }
  }
}

SCORE: 1.0203158855438232   _id:  "3"
 message: "try to sign-in"
   page_updated_by: Object
   last_name: "LEWINSKY"
   first_name: "Brièle"
   email: "lewinsky@example.com"
   phone: "(123).456.9870"
 text: Object
   en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "autocomplete": {
5         "query": "lewinsky@example.com",
6         "path": "page_updated_by.email"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "page_updated_by.email": 1
14     }
15   }
16 ])

{ _id: 3, page_updated_by: { email: 'lewinsky@example.com' } }

下表列出了 uaxUrlEmail 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	MongoDB搜索字段类型	Token Outputs
`standard`	`autocomplete` `edgeGram`	`le`, `lew`, `lewi`, `lewin`, `lewins`, `lewinsk`, `lewinsky`, `lewinsky@`, `lewinsky`, `ex`, `exa`, `exam`, `examp`, `exampl`, `example`, `example.`, `example.c`, `example.co`, `example.com`
`uaxUrlEmail`	`autocomplete` `edgeGram`	`le`, `lew`, `lewi`, `lewin`, `lewins`, `lewinsk`, `lewinsky`, `lewinsky@`, `lewinsky@e`, `lewinsky@ex`, `lewinsky@exa`, `lewinsky@exam`, `lewinsky@examp`, `lewinsky@exampl`

whitespace

whitespace分词器根据单词之间出现的空格进行分词。

属性

它具有以下属性：

名称	类型	必需？	说明
`type`	字符串	是	标识此分词器类型的可读标签。值必须是 `whitespace`。
`maxTokenLength`	整型	no	单个标记的最大长度。大于此长度的标记将按 `maxTokenLength` 分割为多个标记。默认值：`255`

例子

以下索引定义使用名为 whitespaceExample 的自定义分析器对 minutes 集合中的 message 字段建立索引。它使用 whitespace 分词器从 message 字段中的任意空格创建词元（可搜索词）。

在 Custom Analyzers 部分中，单击 Add Custom Analyzer。
选择 Create Your Own 单选按钮并单击 Next。
在 Analyzer Name 字段中输入 whitespaceExample。
如果 Tokenizer 已折叠，请将其展开。
从下拉列表中选择 whitespace。
单击 Add，将自定义分析器添加到索引。
在 Field Mappings 部分中，单击 Add Field Mapping 以在message（消息）字段上应用自定义分析器。
从 Field Name 下拉列表中选择 message（消息），并从 Data Type 下拉列表中选择String（字符串）。
在数据类型的属性部分中，从 Index Analyzer 和 Search Analyzer 下拉菜单中选择 whitespaceExample。
单击 Add，然后单击 Save Changes。

将默认索引定义替换为以下示例：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "whitespaceExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "whitespaceExample",
      "tokenFilters": [],
      "tokenizer": {
        "type": "whitespace"
      }
    }
  ]
}

db.minutes.createSearchIndex("default", {
  "mappings": {
    "dynamic": true,
    "fields": {
      "message": {
        "analyzer": "whitespaceExample",
        "type": "string"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "whitespaceExample",
      "tokenFilters": [],
      "tokenizer": {
        "type": "whitespace"
      }
    }
  ]
})

下面的查询在 minutes 集合的 message 字段中搜索词语 SIGN-IN。

单击索引的 Query 按钮。
单击 Edit Query 编辑查询。
单击查询栏上的并选择数据库和集合。

将默认查询替换为以下内容，然后单击 Find：

{
  "$search": {
    "text": {
      "query": "sign-in",
      "path": "message"
    }
  }
}

SCORE: 0.6722691059112549 _id:  "3"
   message: "try to sign-in"
   page_updated_by: Object
     last_name: "LEWINSKY"
     first_name: "Brièle"
     email: "lewinsky@example.com"
     phone: "(123).456.9870"
   text: Object
     en_US: "<body>We'll head out to the conference room by noon.</body>"

1 db.minutes.aggregate([
2   {
3     "$search": {
4       "text": {
5         "query": "sign-in",
6         "path": "message"
7       }
8     }
9   },
10   {
11     "$project": {
12       "_id": 1,
13       "message": 1
14     }
15   }
16 ])

[ { _id: 3, message: 'try to sign-in' } ]

MongoDB Search 在结果中返回具有 _id: 3 的文档，因为MongoDB Search 使用文档的 whitespace分词器创建了一个值为 sign-in 的词元，该词元与搜索术语匹配。如果使用 standard分词器为 message字段索引， MongoDB Search 将针对搜索术语sign-in 返回具有 _id: 1、_id: 2 和 _id: 3 的文档。

下表列出了 whitespace 分词器以及与之相比较的 standard 分词器为具有 _id: 3 的文档创建的词元：

分词器	Token Outputs
`standard`	`try`, `to` , `sign` , `in`
`whitespace`	`try`, `to` , `sign-in`

后退

字符筛选器

来年

词元筛选器

1	db.minutes.createSearchIndex(
2	"default",
3	{
4	"mappings": {
5	"dynamic": true,
6	"fields": {
7	"message": {
8	"analyzer": "edgegramExample",
9	"type": "string"
10	}
11	}
12	},
13	"analyzers": [
14	{
15	"charFilters": [],
16	"name": "edgegramExample",
17	"tokenFilters": [],
18	"tokenizer": {
19	"maxGram": 7,
20	"minGram": 2,
21	"type": "edgeGram"
22	}
23	}
24	]
25	}
26	)

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"text": {
5	"query": "tr",
6	"path": "message"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"message": 1
14	}
15	}
16	])

1	db.minutes.aggregate([
2	{
3	"$search": {
4	"autocomplete": {
5	"query": "lewinsky@example.com",
6	"path": "page_updated_by.email"
7	}
8	}
9	},
10	{
11	"$project": {
12	"_id": 1,
13	"page_updated_by.email": 1
14	}
15	}
16	])