Using Atlas Search from Java
Rate this article
Dear fellow developer, welcome!
is a full-text search engine embedded in MongoDB Atlas that gives you a seamless, scalable experience for building relevance-based app features. Built on Apache Lucene, Atlas Search eliminates the need to run a separate search system alongside your database. The gateway to Atlas Search is the
$searchaggregation pipeline stage.
The stage, as one of the newest members of the , has gotten native, convenient support added to various language drivers. Driver support helps developers build concise and readable code. This article delves into using the Atlas Search support built into the MongoDB Java driver, where we’ll see how to use the driver, how to handle
$searchfeatures that don’t yet have native driver convenience methods or have been released after the driver was released, and a glimpse into Atlas Search relevancy scoring. Let’s get started!
Full-text search is a deceptively sophisticated set of concepts and technologies. From the user perspective, it’s simple: good ol’
?q=queryon your web applications URL and relevant documents are returned, magically. There’s a lot behind the classic magnifying glass search box, from analyzers, synonyms, fuzzy operators, and facets to autocomplete, relevancy tuning, and beyond. We know it’s a lot to digest. Atlas Search works hard to make things easier and easier for developers, so rest assured you’re in the most comfortable place to begin your journey into the joys and power of full-text search. We admittedly gloss over details here in this article, so that you get up and running with something immediately graspable and useful to you, fellow Java developers. By following along with the basic example provided here, you’ll have the framework to experiment and learn more about details elided.
Once in the “Create Index” wizard, use the Visual Editor, pick the sample_mflix.movies collection, leave the index name as “default”, and finally, click “Create Search Index.”
It’ll take a few minutes for the search index to be built, after which an e-mail notification will be sent. The indexing processing status can be tracked in the UI, as well.
Here’s what the Search section should now look like for you:
Voila, now you’ve got the movie data indexed into Atlas Search and can perform sophisticated full text queries against it. Go ahead and give it a try using the handy Search Tester, by clicking the “Query” button. Try typing in some of your favorite movie titles or actor names, or even words that would appear in the plot or genre.
Behind the scenes of the Search Tester lurks the $search pipeline stage. Clicking “Edit $search Query” exposes the full $search stage in all its JSON glory, allowing you to experiment with the syntax and behavior.
This is our first glimpse into the $search syntax. The handy “copy” (the top right of the code editor side panel) button copies the code to your clipboard so you can paste it into your favorite MongoDB aggregation pipeline tools like Compass, , or the Atlas UI aggregation tool (shown below). There’s an “aggregation pipeline” link there that will link you directly to the aggregation tool on the current collection.
At this point, your environment is set up and your collection is Atlas search-able. Now it’s time to do some coding!
Let’s first take a moment to reflect on and appreciate what’s happened behind the scenes of our wizard clicks up to this point:
- A managed, scalable, reliable MongoDB cluster has spun up.
- Many sample data collections were ingested, including the movies database used here.
- A triple-replicated, flexible, full-text index has been configured and built from existing content and stays in sync with database changes.
Through the Atlas UI and other tools like , we are now able to query our movies collection in, of course, all the usual MongoDB ways, and also through a proven and performant full-text index with relevancy-ranked results. It’s now up to us, fellow developers, to take it across the finish line and build the applications that allow and facilitate the most useful or interesting documents to percolate to the top. And in this case, we’re on a mission to build Java code to search our Atlas Search index.
Let’s answer this question from our movies data:
What romantic, drama movies have featured Keanu Reeves?
Yes, we could answer this particular question knowing the precise case and spelling of each field value in a direct lookup fashion, using this aggregation pipeline:
Let’s suppose we have a UI that allows the user to select one or more genres to filter, and a text box to type in a free form query (see the resources at the end for a site like this). If the user had typed “keanu reeves”, all lowercase, the above $match would not find any movies. Doing known, exact value matching is an important and necessary capability, to be sure, yet when presenting free form query interfaces to humans, we need to allow for typos, case insensitivity, voice transcription mistakes, and other inexact, fuzzy queries.
Using the Atlas Search index we’ve already set up, we can now easily handle a variety of full text queries. We’ll stick with this example throughout so you can compare and contrast doing standard $match queries to doing sophisticated $search queries.
Ultimately, regardless of the coding language, environment, or driver that we use, a of our aggregation pipeline request is handled by the server. The Aggregation view in Atlas UI and very similarly in Compass, our useful MongoDB client-side UI for querying and analyzing MongoDB data, can help guide you through the syntax, with links directly to the pertinent .
Rather than incrementally building up to our final example, here’s the complete aggregation pipeline so you have it available as we adapt this to Java code. This aggregation pipeline performs a search query, filtering results to movies that are categorized as both Drama and Romance genres, that have “keanu reeves” in the cast field, returning only a few fields of the highest ranked first 10 documents.
At this point, go ahead and copy the above JSON aggregation pipeline and paste it into Atlas UI or Compass. There’s a nifty feature (the "</> TEXT" mode toggle) where you can paste in the entire JSON just copied. Here’s what the results should look like for you:
As we adapt the three-stage aggregation pipeline to Java, we’ll explain things in more detail.
We spend the time here emphasizing this JSON-like structure because it will help us in our Java coding. It’ll serve us well to also be able to work with this syntax in ad hoc tools like Compass in order to experiment with various combinations of options and stages to arrive at what serves our applications best, and be able to translate that aggregation pipeline to Java code. It’s also the most commonly documented query language/syntax for MongoDB and Atlas Search; it’s valuable to be savvy with it.
was released in July of last year (2022), adding convenience methods for the Atlas
$searchstage, while Atlas Search was made generally available . In that time, Java developers weren’t out of luck, as direct BSON Document API calls to construct a $search stage work fine. Code examples in that time frame used
new Document("$search",...). This article showcases a more comfortable way for us Java developers to use the
$searchstage, allowing clearly named and strongly typed parameters to guide you. Your IDE’s method and parameter autocompletion will be a time-saver to more readable and reliable code.
You’ll need a modern version of Java, something like:
Now grab the code from our repository using
git cloneand go to the working directory:
Once you clone that code, copy the connection string from the Atlas UI (the “Connect” button on the Database page). You’ll use this connection string in a moment to run the code connecting to your cluster.
Now open a command-line prompt to the directory where you placed the code, and run:
Be sure to fill in the appropriate username and password in the connection string. If you don’t already have Gradle installed, the
gradlewcommand should install it the first time it is executed. At this point, you should get a few pages of flurry of output to your console. If the process hangs for a few seconds and then times out with an error message, check your Atlas network permissions, the connection string you have specified the
ATLAS_URIsetting, including the username and password.
runcommand from Gradle is a convenient way to run the Java
FirstSearchExample. It can be run in other ways as well, such as through an IDE. Just be sure to set the
ATLAS_URIenvironment variable for the environment running the code.
Ideally, at this point, the code ran successfully, performing the search query that we have been describing, printing out these results:
So there are four movies that match our criteria — our initial mission has been accomplished.
Let’s now go through our project and code, pointing out the important pieces you will be using in your own project. First, our
build.gradlefile specifies that our project depends on the MongoDB Java driver, down to the specific version of the driver. There’s also a convenient
applicationplugin so that we can use the
runtarget as we just did.
Let’s walk through this code, section by section, in a little bit backward order. First, we open a connection to our collection, pulling the connection string from the
Our ultimate goal is to call
collection.aggregate()with our list of pipeline stages: search, project, and limit. There are driver convenience methods in
com.mongodb.client.model.Aggregatesfor each of these.
The and stages are both specified fully inline above. We’ll define
searchStagein a moment. The
metaSearchScore, a Java driver convenience method, to map the Atlas Search computed score (more on this below) to a pseudo-field named
score. Additionally, Atlas Search can provide the score explanations, which itself is a performance hit to generate so only use for debugging and experimentation. Score explanation details must be requested as an option on the
searchstage for them to be available for projection here. There is not a convenience method for projecting scoring explanations, so we use the generic
meta()method to provide the pseudo-field name and the key of the meta value Atlas Search returns for each document. The Java code above generates the following aggregation pipeline, which we had previously done manually above, showing it here to show the Java code and the corresponding generated aggregation pipeline pieces.
searchStageconsists of a search operator and an additional option. We want the relevancy scoring explanation details of each document generated and returned, which is enabled by the
scoreDetailssetting that was developed and released after the Java driver version was released. Thankfully, the Java driver team built in pass-through capabilities to be able to set arbitrary options beyond the built-in ones to future-proof it.
SearchOptions.searchOptions().option()allows us to set the
scoreDetailsoption on the
$searchstage to true. Reiterating the note from above, generating score details is a performance hit on Lucene, so only enable this setting for debugging or experimentation while inspecting but do not enable it in performance sensitive environments.
That code builds this structure:
We’ve left a couple of variables to fill in:
What are filters versus other compound operator clauses?
filter: clauses to narrow the query scope, not affecting the resultant relevancy score
must: required query clauses, affecting relevancy scores
should: optional query clauses, affecting relevancy scores
mustNot: clauses that must not match
Our (non-scoring) filter is a single search operator clause that combines required criteria for genres Drama and Romance:
And that code builds this query operator structure:
Notice how we nested the
filterarray, which takes a list of
SearchOperatoris a Java driver class with convenience builder methods for some, but not all, of the available Atlas Search search operators. You can see we used
SearchOperator.text()to build up the genres clauses.
Last but not least is the primary (scoring!)
phrasesearch operator clause to search for “keanu reeves” within the
castfield. Alas, this is one search operator that currently does not have built-in
SearchOperatorsupport. Again, kudos to the Java driver development team for building in a pass-through for arbitrary BSON objects, provided we know the correct JSON syntax. Using
SearchOperator.of(), we create an arbitrary operator out of a BSON document. Note: This is why it was emphasized early on to become savvy with the JSON structure of the aggregation pipeline syntax.
So now we’ve built the aggregation pipeline. To show the results (shown earlier), we simply iterate through
The results are ordered in descending score order. Score is a numeric factor based on the relationship between the query and each document. In this case, the only scoring component to our query was a phrase query of “keanu reeves”. Curiously, our results have documents with different scores! Why is that? If we covered everything, this article would never end, so addressing the scoring differences is beyond this scope, but we’ll explain a bit below for bonus and future material.
You’re now an Atlas Search-savvy Java developer — well done! You’re well on your way to enhancing your applications with the power of full-text search. With just the steps and code presented here, even without additional configuration and deeper search understanding, the power of search is available to you.
This is only the beginning. And it is important, as we refine our application to meet our users’ demanding relevancy needs, to continue the Atlas Search learning journey.
We’ll leave delving into those details as an exercise to the reader, noting that you can learn a lot about how queries are interpreted/analyzed by studying the explain() output.
Search relevancy is a scientific art. Without getting into mathematical equations and detailed descriptions of information retrieval research, let’s focus on the concrete scoring situation presented in our application here. The scoring component of our query is a phrase query of “keanu reeves” on the cast field. We do a
phrasequery rather than a
textquery so that we search for those two words contiguously, rather than “keanu OR reeves” (“keanu” is a rare term, of course, but there are many “reeves”).
Scoring takes into account the field length (the number of terms/words in the content), among other factors. Underneath, during indexing, each value of the cast field is run through an analysis process that tokenizes the text. Tokenization is a process splitting the content into searchable units, called terms. A “term” could be a word or fragment of a word, or the exact text, depending on the analyzer settings. Take a look at the
castfield values in the returned movies. Using the default,
lucene.standard, analyzer, the tokens emitted split at whitespace and other word boundaries, such as the dash character.
Now do you see how the field length (number of terms) varies between the documents? If you’re curious of the even gnarlier details of how Lucene performs the scoring for our query, uncomment the
printScoreDetailscode in our results output loop.
Don’t worry if this section is a bit too much to take in right now. Stay tuned — we’ve got some scoring explanation content coming shortly.
We could quick fix the ordering to at least not bias based on the absence of hyphenated actor names. Moving the queryClause into the
filterssection, rather than the
mustsection, such that there would be no scoring clauses, only filtering ones, will leave all documents of equal ranking.
There are many useful Atlas Search resources available, several linked inline above; we encourage you to click through those to delve deeper. These quick three steps will have you up and searching quickly:
And finally, we’ll leave you with the slick demonstration of Atlas Search on the movies collection at (though note that it fuzzily searches all searchable text fields, not just the cast field, and does so with OR logic querying, which is different than the
phrasequery only on the
castfield we performed here).
Building Generative AI Applications Using MongoDB: Harnessing the Power of Atlas Vector Search and Open Source Models
Sep 26, 2023