Getting started with MongoDB Atlas Full-Text Search

TIm Frietas

#search

Atlas Full-Text Search is now available in beta, making it easier than ever to get score-ranked results for text-based data. Let's look at what is special about Full-Text Search compared with the existing text search options in MongoDB and then dive into using it.

MongoDB + Lucene = 🔍💚

Atlas Full-Text Search is built on Apache Lucene. Lucene is a powerful text search library and the most popular open-source language-focused search project in the world. The “language-focused” part is really important — Lucene by default is optimized for many different languages (many of which we support today, with more coming soon), meaning it is optimized for natural language search.

There are special rulesets for each language, called analyzers that apply rules to make your search results more accurate. Take the lucene.english analyzer, for example, which removes common words (called “stop words”) such as “and”, “the” and “or”, and understands pluralization. This means that in practice, if you’re looking for that one Elton John song and your best recollection is “Benny Jet” you’ll still get the right search result for “Benny and the Jets”. Pretty neat!

Lucene is built on the concept of inverted indexes, meaning that mapping a term to the document and field it appears in is super fast — much faster than in a traditional database. Combined with built-in query operators, default score ranking and traditional MongoDB-native functionality means that with Atlas Full-Text Search you get the flexibility of MongoDB combined with the speed of Lucene.

MongoDB Full Text Search Overview

There's a lot more to how we've blended these two worlds of search together but if you're like us, the real question is how do you put Full-Text Search to work? Let’s show you.

Creating a Cluster

In order to use Atlas FTS at this time you need to be running an M30 cluster or higher and choose to run the 4.2 release candidate or higher .

(This is a temporary measure to ensure high performance during the initial beta period: check back in the future as we plan to expand Atlas FTS to more cluster sizes.)

Want to try out Full-Text Search on a cluster of your own? Use credit activation code MONGODB4DOT2 for $200 of Atlas credit. For information on redeeming Atlas credit, see Atlas Billing.

First, navigate to cloud.mongodb.com and sign in. (Set up an account if you haven’t already.)

Once logged in, create a new cluster by clicking the Build a New Cluster button at the top right of your screen.

When the process reaches the Cluster Tier selection, be sure to choose a cluster size of M30 or higher. For our walkthrough, we'll just choose the default M30 settings:

M30 Cluster Sizing

Next, be sure to expand the Additional Settings option and choose** MongoDB 4.2 Beta**, like so:

M30 Additional Settings - Select 4.2beta

Once complete, name your cluster — I’ll name mine MyCluster — and click the green Create Cluster button. It may take several minutes for your cluster to be created.

Importing Sample Data

Once your cluster has been completely created, you should see something like this:

A Cluster View

We are now ready to pull in some sample data to create our first FTS indexes and run our first query. Thanks to a recently introduced feature, Atlas comes with the ability to import some basic data sets we can use to get started.

Import Sample Data menu item

Choose the extra option button to the right of collections [...] and select Load Sample Dataset.

On the next screen, click the green button to Load Sample Dataset. This will take a few minutes as data is loaded in your cluster. Once you see the green banner that says “sample dataset successfully loaded” you are good to go. Click the Collections button and highlight the listingsAndReviews collection under first database, sample_airbnb. There, you will find a new tab titled Full Text Search.

Locating the Full Text Search tab

Navigate to the Full Text Search tab and click the big green Create an FTS Index button. This will bring you to a screen, like the one below, where you can choose to build a dynamic index -the default option which automatically and continually indexes all text fields in your target collection- or a static index. For the purposes of this tutorial we are going to create a default dynamic index. That said, click the big green button that says Create Index.

Creating a Full Text Search

Your index will take a few moments to build and upon completion you should see a screen showing your index status as active:

The Full Text Search Index

Congratulations! You have built your first Atlas FTS index and it is ready to go. Now let’s use it.

Our First FTS Query

You may want to take a moment to explore the contents of the documents in the sample_airbnb.listingsAndReviews collection. Use the Find tab in Atlas for your collection or once logged into the shell, execute the following commands:

use sample_airbnb
db.listngsAndReviews.findOne()

You’ll notice it is a fairly complex document with many fields and arrays. Don’t worry: FTS will make it easy to query the text data in your collection.

For this next step you will want to connect to your collection using the mongo shell. If this is your first time launching and connecting to an Atlas cluster you’ll need to perform additional steps like creating a user and whitelisting the IP address of the system you’d like to query from. See Connect via mongo Shell for all the details you need to get started.

Now that you are connected, you can run your first query.

Let’s say we’re looking to go on a vacation in Portugal. Queries in FTS are handled by the $searchBeta operator which is part of the aggregation pipeline stage, so our very first query to find properties in Portugal looks like this:

db.listingsAndReviews.aggregate([  
 {
   $searchBeta: {
       "search": {
           "path": "address.country",
           "query": "Portugal"
       }
   }
 },
 {
   $project: {
       _id : 0,
       address: 1,
       'score': { $meta: 'searchScore' },
   }
 }
]).pretty()

There are a few things to note here:

  • $searchBeta has a range of operators which we will come to, but for now we are using just one, search, which is composed of a path and a query.
    • path supports dot notation to access subfields to make it easy to access embedded fields. Full information can be found in the Path Construction document.
  • Because each document in this collection is large, we used the $project stage to limit the total amount of information in each result to what is currently relevant to our query, in this case the address field.
  • Additionally, we were also able to retrieve the score for our results by using a new special $meta operator, searchScore

Here is an example of one of the documents resulting from that query.

{
    "address" : {
        "street" : "Porto, Porto, Portugal",
        "suburb" : "",
        "government_area" : "Cedofeita, Ildefonso, Sé, Miragaia, Nicolau, Vitória",
        "market" : "Porto",
        "country" : "Portugal",
        "country_code" : "PT",
        "location" : {
            "type" : "Point",
            "coordinates" : [
                -8.61308,
                41.1413
            ],
            "is_location_exact" : false
        }
    },
    "score" : 1.1643565893173218
}

If you examine the rest of the results you may have also noticed that for now, they all have the same score ("score" : 1.1643565893173218). Why? Well, FTS and Lucene only have one point of data to score against at the moment (address.country: "Portugal") which is exactly the same across all results. Don’t worry, we will soon see how scores change as queries grow more complex.

Let’s expand our query further and see what happens.

Using compound queries

Let’s say we’ve decided on the lovely town of Porto, Portugal as our destination. Let’s also assume we will be joined by some friends, who would really like for the house to have a garden. Our friends also have a flexible itinerary, meaning we may want to exclude properties with strict cancellation policies.

The compound operator lets us combine these requirements to focus in further and find what we are looking for. A query might look like this:

db.listingsAndReviews.aggregate([  
 {
   $searchBeta: {
       "compound": {
           "must": {
               "search": {
                   "path": "address.country",
                   "query": "Portugal"
               },
               "search": {
                   "path": "address.market",
                   "query": "Porto"
               }
           },
           "should": {
               "search": {
                   "path": "description",
                   "query": "garden"
               }
           },
           "mustNot": {
               "search": {
                   "path": "cancellation_policy",
                   "query": "strict",
                   "phrase": {"prefix": true}
               }
           }
       }
   }
 },
 {
   $project: {
      _id : 0,
       address: 1,
       description: 1,
       cancellation_policy: 1,
       price: 1,
       'score': { $meta: 'searchScore' },
   }
 }
]).pretty()


There’s a bit more going on here, so let’s break it down:

  • We have introduced a new operator, compound, which allows you to chain several searches together to return results that satisfy the overall query. You can read more about compound and the different options it supports here.
  • In this example we are limiting our query to:
    • Properties that must be in Porto, Portugal.
    • Properties that should have the word garden in the description field.
    • Properties that do not have a strict cancellation policy.
      • If we examine the documents closely, we can see the possible states for cancellation policies aren’t straightforward; for example, there are many types of strict cancellation policies and we want to avoid them all. Because all of them happen to start with the string strict, we can specify that the cancellation policy mustNot have the word strict as the first word using the phrase operator and the option to set "prefix": true. You can learn more about phrase options in the search operator documentation.

After we run the query, you’ll notice that we have a list of ranked results. The score for each result is different: our top result has "score" : 3.2279162406921387, and the next best set of matches have a similar or identical relevance score ("score" : 2.997941732406616).

What’s happening behind the scenes is that all of these properties met the must and mustNot clauses: they are all in Porto and don’t have strict cancellation policies. However, the first one has the string garden in the description field, making it more relevant to our search and boosting the overall score for that entry, making it the best match.

How scores are constructed, what the numbers mean, and the factors that go into scoring are quite complex and the subject of a future blog post, but the takeaway is that results that best satisfy the conditions set out in the query are given the highest relevance score, and our Porto vacation home with a garden has come out on top.

Adding additional MQL operators: Sorting by price

$searchBeta can also be combined with other MQL aggregation stages to further refine results. For example, let’s say we wanted to narrow our options to properties that cost between 80 and 120 euros per night. We could simply add a $match stage to narrow down our list to the best results in that price range:

db.listingsAndReviews.aggregate([  
 {
   $searchBeta: {
       "compound": {
           "must": {
               "search": {
                   "path": "address.country",
                   "query": "Portugal"
               },
               "search": {
                   "path": "address.market",
                   "query": "Porto"
               }
           },
           "should": {
               "search": {
                   "path": "description",
                   "query": "garden"
               }
           },
           "mustNot": {
               "search": {
                   "path": "cancellation_policy",
                   "query": "strict",
                   "phrase": {"prefix": true}
               }
           }
       }
   }
 },
 {
   $project: {
       _id : 0,
       address: 1,
       description: 1,
       cancellation_policy: 1,
       price: 1,
       'score': { $meta: 'searchScore' },
   }
 },
 {
   $match: { price: { $gte: 80, $lte: 120 } }
 }
]).pretty()

We now have a list of options, ranked by score, but within our specified price range.

Conclusion

Atlas Full-Text Search can help give you results based on score relevance by taking advantage of the power of Lucene. It also allows you to combine $searchBeta with other operators to get the most out of your queries.

Give it a try, and if you have feedback or feature requests specific to the FTS beta, feel free to let us know.