Analyzing Analyzers to Build the Right Search Index for Your App
Rate this tutorial
“Why am I not getting the right search results?”
So, you’ve created your first search query. You are familiar with various Atlas Search operators. You may have even played around with score modifiers to sort your search results. Yet, typing into that big, beautiful search bar still isn’t bringing you the results you expect from your data. Well, It just might be your search index definition. Or more specifically, your analyzer.
You may know Lucene analyzers are important—but why? How do they work? How do you choose the right one? If this is you, don’t worry. In this tutorial, we will analyze analyzers—more specifically, Atlas Search indexes and the Lucene analyzers used to build them. We’ll define what they are exactly and how they work together to bring you the best results for your search queries.
Expect to explore the following questions:
- What is a search index and how is it different from a traditional MongoDB index?
- What is an analyzer? What kinds of analyzers are built into Atlas and how do they compare to affect your search results?
- How can you create an Atlas Search index using different search analyzers?
By the end, cured of your search analysis paralysis, you’ll brim with the confidence and knowledge to choose the right analyzers to create the best Atlas Search index for your application.
So, what’s an index? Generally, indexes are special data structures that enable ultra-fast querying and retrieval of documents based on certain identifiers.
Every Atlas Search query requires a search index. Actually, it’s the very first line of every Atlas Search query.
Let’s explore the differences by walking through an example. Say we have a set of MongoDB documents that look like this:
If we were to search through these documents’ sentence fields for the text:
“It was the best of times, it was the worst of times.” -A Tale of Two Cities, Charles Dickens
Atlas Search would break down this text data into these seven individual terms for our inverted index :
it - was - the - best - of - times - worst
Next, Atlas Search would map these terms back to the original MongoDB documents’ _id fields as seen below. The word “it” can be found in document with _id 4. Find “the” in documents 2, 3, 4, etc.
So essentially, an inverted index is a mapping between terms and which documents contain those terms. The inverted index contains the term and the _id of the document, along with other relevant metadata, such as the position of the term in the document.
You can think about the inverted index as analogous to the index you might find in the back of the book. Remember how book indexes contain words or expressions and list the pages in the book where they are found? 📖📚
Well, these inverted indexes use these terms to point to the specific documents in your database.
Imagine if you are looking for Lady MacBeth’s utterance of “Out, damned spot” in Shakespeare’s MacBeth. You wouldn’t start at page one and read through the entire play, would you? I would go straight to the index to pinpoint it in Act 5, Scene 1, and even the exact page.
Inverted indexes make text searches much faster than a traditional search because you are not searching through every single document at query time. You are instead querying the search index which was mapped upon index creation. Then, following the roadmap with the _id to the exact data document(s) is fast and easy.
How does our metaphorical book decide which words or expressions to list in the back? Or for Atlas Search specifically, how do we know what terms to put in our Search indexes? Well, this is where analyzers come into play.
To make our corpus of data searchable, we transform it into terms or “tokens” through a process called “analysis” done by analyzers.
In our Charles Dickens example, we broke apart, “It was the best of times, it was the worst of times,” by removing the punctuation, lowercasing the words, and breaking the text apart at the non-letter characters to obtain our terms.
These rules are applied by the lucene.standard analyzer, which is Atlas Search’s default analyzer.
Atlas Search offers other analyzers built-in, too.
A whitespace analyzer will keep your casing and punctuation but will split the text into tokens at only the whitespaces.
The English analyzer takes a bit of a heavier hand when tokenizing.
It removes common STOP words for English. STOP words are common words like “the,” “a,” “of,” and “and” that you find often but may make the results of your searches less meaningful. In our Dickens example, we remove the “it,” “was,” and “the.” Also, it understands plurals and “stemming” words to their most reduced form. Applying the English analyzer leaves us with only the following three tokens:
- best - worst - time
Which maps as follows:
Notice you can’t find “the” or “of” with the English analyzer because those stop words were removed in the analysis process.
Interesting, huh? 🤔
|Analyzer||Text Processing Description|
|Standard||Lowercase, removes punctuation, keeps accents|
|English||Lowercase, removes punctuation and stop words, stems to root, pluralization, and possessive|
|Simple||Lowercase, removes punctuation, separates at non-letters|
|Whitespace||Keeps case and punctuation, separates at whitespace|
|Keyword||Keeps everything exactly intact|
|French||Similar to English, but in French =-)|
By toggling across all the different types of analyzers listed in the top bar, you will see what I call the basic golden rules of each one. We’ve discussed standard, whitespace, and English. The simple analyzer removes punctuation and lowercases and separates at non-letters. “Keyword” is the easiest for me to remember because everything needs to match exactly and returns a single token. Case, punctuation, everything. This is really helpful for when you expect a specific set of options—checkboxes in the application UI, for example.
With our golden rules in mind, select one the sample texts offered and see how they are transformed differently with each analyzer. We have a basic string, an email address, some html, and a French sentence.
Try searching for particular terms across these text samples by using the input box. Do they produce a match?
Trying our first sample text:
“As I was walking to work, I listened to two of Mike Lynn’s podcasts, and I dropped my keys.”
Notice by the yellow highlighting how the English analyzer allows you to recognize the stems “walk” and “listen,” the singular “podcast” and “key.”
However, none of those terms will match with any other analyzer:
Parlez-vous français? Comment dit-on “stop word” en français?
Email addresses can be a challenge. But now that you understand the rules for analyzers, try looking for “mongodb” email addresses (or Gmail, Yahoo, “fill-in-the-corporate-blank.com”). I can match “mongodb” with the simple analyzer, but no other ones.
With our Analyzer Analyzer in place to help guide you, you can input your own sample text data in the input bar and hit submit ✅. Once that is done, input your search term and choose an analyzer to see if there is a result returned.
Maybe you have some logging strings or UUIDs to try?
Analyzers matter. If you aren’t getting the search results you expect, check the analyzer used in your index definition.
Armed with our deeper understanding of analyzers, we can take the next step in our search journey and create a search index in Atlas using different analyzers.
We can create the search index using the Visual Editor. When creating the Atlas Search index, we can specify which analyzer to use. By default, Atlas Search uses the lucene.standard analyzer and maps every field dynamically.
Mapping dynamically will automatically index all the fields of supported type.
This is great if your schema evolves often or if you are experimenting with Atlas Search—but this takes up space. Some index configuration options—like autocomplete, synonyms, multi analyzers, and embedded documents—can lead to search indexes taking up a significant portion of your disk space, even more than the dataset itself. Although this is expected behavior, you might feel it with performance, especially with larger collections. If you are only searching across a few fields, I suggest you define your index to map only for those fields.
You can also choose different analyzers for different fields—and you can even apply more than one analyzer to the same field.
Pro tip! You can also use your own custom analyzer—but we’ll save custom analyzers for a different day.
Click Refine to customize our index definition.
I’ll turn off dynamic mapping and Add Field to map the title to standard analyzer. Then, add the fullplot field to map with the english analyzer. CREATE!
And now, after just a few clicks, I have a search index named ‘default’ which has stored in it the tokenized results of the standard analysis on the title field and the tokenized results of the lucene.english analyzer on the full plot field.
It’s just that simple.
And just like that, now I can use this index that took a minute to create to search these fields in my movies collection! 🎥🍿
So, when configuring your search index:
- Think about your data first. Knowing your data, how will you be querying it? What do you want your tokens to be?
- Then, choose your analyzer accordingly.
- Specify the best analyzer for your use case in your Atlas Search index definition.
- Specify that index when writing your search query.
You can create many different search indexes for your use case, but remember that you can only use one search index per search query.
So, now that we have analyzed the analyzers, you know why picking the right analyzer matters. You can create the most efficient Atlas Search index for accurate results and optimal results. So go forth, search-warrior! Type in your application’s search box with confidence, not crossed fingers.
How to Use Custom Archival Rules and Partitioning on MongoDB Atlas Online Archive
May 31, 2023