It's no secret that the amount of data generated globally is geometrically exploding. In fact, it's estimated that by the end of 2025, there will be 181 zettabytes of data in existence for the first time ever (for context, one zettabyte is equal to one sextillion bytes)! And, of those 181 zettabytes, 80% is anticipated to be unstructured data.
(Source: K21Academy.com, 2023)
However, this also highlights a significant issue for data users of all kinds. Unstructured data, specifically unstructured text data, can neither be housed in nor queried by traditional relational databases. Further, traditional queries, which only search for exact text matches, aren't very helpful in that the user often only knows the topics or key ideas they are searching for in the text data.
The solution to this problem is full-text search. Full-text search enables users to access their unstructured text data in a way that is both intuitive and efficient while this data is stored optimally in nonrelational, NoSQL databases. Read on to learn what full-text search is, how it works, and examples of how it's used.
Table of contents
Unlike traditional search methods that rely on exact word or phrase matches, a full-text search refers to a search of all of the documents' contents within the full-text queries’ range(s) that are relevant. This includes topic, phrasing, citation, or additional text attributes.
There are a variety of ways to conduct full-text searches. Each type has its own advantages and, much like the tools in a toolbox, is designed to address specific needs. Here are some of the most common types:
Simple full-text search: Simple full-text searches are very basic in that users enter keywords or phrases to find documents containing those specific terms.
Boolean full-text search: This type of search uses Boolean operators (e.g., AND, OR, NOT) to either combine or exclude specific keywords in the search query. This not only provides more control of the search results but also helps users narrow down broad topics to the specific information they are seeking.
Fuzzy search: Fuzzy search allows the user to find text that is a “likely” match, meaning that misspelled words, typos, etc. in the desired term can be analyzed as matching user parameters.
Wildcard search: Wildcard searches include non-alphanumeric characters (e.g., ?, *) representing unknown portions of words. This allows the user to search for variations of words (e.g., part, parted, parting) or partial matches (e.g., summertime, summer vacation, summer).
Phrase search: This search seeks an exact phrase where the words of the phrase queried appear within the document in the order specified.
Proximity search: Proximity searches identify and retrieve documents containing specific terms within a set number of words, phrases, or paragraphs from each other.
Range search: Range searches look for terms within a numerical or alphabetical range specified by the user.
Faceted search: This type of search helps refine results using predefined categories and specific attributes (e.g., facets) of the topic.
Full-text queries are used within full-text searches to define the specific terms, parameters, etc. required by the user. Further, full-text queries enable the discovery of additional content of which the user may be unaware via multiple methods, including the following:
Natural language processing (NLP): Full-text searching often incorporates NLP techniques to understand the context, semantics, and relationships between words in full-text queries and the text in documents. This provides accurate and contextually relevant results, even though the user may not know specific terms or phrases they should include in their full-text search queries.
Synonym expansion: Full-text search engines often employ synonym expansion capabilities. This means that the full-text search engine is able to identify alternative words or phrases that have the same meaning (e.g., synonyms) as those included in the users' full-text search query. Given this expansion of relevant search terms, more relevant information is gathered for the user despite the user not including these particular words (e.g., synonyms) in their initial full-text query.
Ontologies and taxonomies: Using ontologies or taxonomies assists in the grouping of terms into hierarchies based on term relationships. Using these hierarchies, full-text searching is enhanced in that both broader and narrower terms relevant to the user's query can be returned. This provides an accurate and more comprehensive set of results.
Fuzzy matching: Fuzzy matching algorithms enable the database engine to find approximate search term matches for their query. This means that content containing misspelled words, overlooked typos, or other language variations that actually do match the query but would be overlooked by traditional searches are identified and collected for the user.
Relevance ranking: In relevance ranking, sophisticated algorithms are employed to consider such factors as frequency of term usage and term proximity within documents to help identify documents that may contain unexpected but highly relevant information relating to the user's query.
The combination of these techniques enhances the ability of full-text search systems to uncover relevant information, making them powerful tools for unstructured text data exploration and discovery — even when users may not have a full understanding of the breadth and depth of the topic they are investigating.
Full-text search involves reviewing large numbers of documents and vast amounts of text. Web search services often use full-text search to retrieve relevant results from the internet — be it web page content, online .PDFs, and more. Given the volume of text data involved, a technique to handle the search volume is required — it's called full-text indexing. A full-text search index is a specialized data structure that enables the fast, efficient searching of large volumes of textual data.
To create a full-text search index, each text field of a dataset (e.g., document) is analyzed. First, diacritics (marks placed above or below letters, such as é, à, and ç in French) are removed. Then, based on the language the text is written in, the algorithms remove filler words and only keep the stem of the terms. This way, “to eat” and “eating” are classified as the same root word: “eat."
Next, all text is converted to either all uppercase or lowercase text. Additional steps may also be included depending upon the specific analyzer that is being used to develop the full-text index.
Finally, the index is created, storing references to where each term (e.g., word, phrase) can be found within the document where it resides.
Mapping terms to documents: The main function of full-text indexes is to map terms (e.g., words, phrases, numbers) back to the documents where they reside. During the creation of the index, document content is reviewed and a link between terms and their respective documents is created.
Enhanced query speed: Once the full-text index is created, it allows for fast lookup and retrieval of documents relevant to a user's search query. Instead of scanning through all the content of every document or all web pages, the search engine can quickly identify the documents that contain the specified terms by consulting the index.
Optimization: In addition to enhanced query speed, additional optimizations to enhance index speed and storage efficiency are applied. Data caching, data compression, and other data structure optimizations are often employed to create leading-class full-text search systems.
There are various types of full-text indexes to choose from. Search requirements, data type and volume, and query complexity are key points of consideration for the user when selecting the full-text index method. In addition, some users may choose to employ more than one indexing method to optimize performance and address data storage concerns. Two common full-text indexes include inverted indexes and B-tree indexes.
Inverted index: Inverted indexes are the most commonly used. These indexes store the mapping of terms to the documents in which they're contained. They enable rapid lookups during searches (i.e., search the index rather than all the documents) and optimize the search process.
Within inverted indexes, some of the additional functions occurring behind the scenes include:
Compression: Data compression techniques are applied to reduce index data storage requirements.
Positioning: Additional information related to where the selected term(s) appear in a document is included, enabling proximity and phrase queries.
Frequency: Rather than mapping terms, documents are mapped. This is useful when analyzing the number of times a term appears in a document.
N-grams: Text is broken down into N-grams (i.e., a contiguous sequence of characters or words). For example, the phrase "The slow tortoise beat the lazy hare" could be broken down into N-1, "Tortoise beat the lazy hare"; N-2, "The slow tortoise beats"; N-3, "Tortoise beats the lazy"; etc. N-gram indexing enables partial matching and wildcard queries.
B-trees and B+ trees: B-trees and B+ trees are often used when full-text search is integrated into a relational database. Specifically, they are often used for range queries (e.g., a date range, a range of currency values).
Full-text search can have many different uses. For example, an inverted index could be used to look for a dish on a restaurant menu or for a specific feature in the description of an item on an e-commerce website. In addition to searching for particular keywords, a full-text search can be augmented with search features such as fuzzy-text and synonyms. In our example, the results for a word such as “pasta” would return not only items such as “Pasta with meatballs” but could also return items such as “Linguine Carbonara” using a synonym, or “Psta” using a fuzzy search.
In the example of Apache Lucene, the open-sourced search library, it uses an inverted index to locate restaurant menu items and acts as an extensive glossary for any matching documents.
You can find a fully functional demo of a similar full-text search for menu items at https://www.atlassearchrestaurants.com/.
To implement our restaurant menu full-text search example in a SQL database, a full-text index on each column to be indexed must be created. In MySQL, this would be done with the FULLTEXT keyword.
Then, you will be able to query the database using MATCH and AGAINST.
ALTER TABLE menus ADD FULLTEXT(item);
SELECT * FROM menus WHERE MATCH(item) AGAINST("pasta");
While this index will increase the search speed for your queries, it does not provide you with all the additional capabilities that you might expect. To use features such as fuzzy search, typo tolerance, or synonyms, you will need to add a core search engine such as Apache Lucene on top of your database.
Implementing a full-text search engine in MongoDB Atlas simply requires clicking a button. The user goes to any cluster and selects the “Search” tab. From there, click on “Create Search Index” to launch the process.
Once the index is created, you can use the $search operator to perform full-text searches.
db.menus.aggregate([
{
$search: {
text: {
query: "pasta",
path:"item"
}
}
}
]);
This aggregation is the most simple query used with MongoDB Atlas Search. Rich queries — including typo-tolerance, search term highlighting, and synonym search — can also be built. Behind the scenes, Atlas Search uses Apache Lucene, so you don't have to add the engine yourself.
If you don't have a MongoDB Atlas account, you can sign up for one for free right now. Once you have your account set up, you'll be able to try out Atlas Search in the demo at the Atlas Search Restaurant Finder, or you can learn how to implement it using our tutorial on how to build a movie search application.
Before implementing a full-text search solution, it's important to consider the necessary features, architectural complexity, and costs related to your full-text searches. Here, we will use examples from MongoDB Atlas Search to illustrate each consideration.
Adding a full-text index to your database will help optimize your text search and potentially minimize storage requirements. Still, you might need additional features, such as auto-complete suggestions, synonym search, or custom scoring for relevant results. Some examples from MongoDB Atlas Search include:
Rich querying capabilities: Using a wide range of operators, Atlas Search can do more than just search for text. It can also search for geo points and dates.
Fuzzy search: Users sometimes make mistakes as they type. With Atlas Search typo-tolerance, you can deliver accurate results, even with a typo or a spelling mistake.
Synonyms: Your data might use wording differently from what your users are searching for. You can use synonyms to define lists of equivalent words to deliver more relevant results to your users.
Custom scoring: If you have promoted content or content that is more relevant based on different variables (for example, at different times of the year), you can define that in a custom scoring function. This score will help push prioritized results to the top of the search results.
Autocomplete: Provide your users with suggestions to make their experience more seamless as they type.
Highlights: As the search results come back from your database, have them automatically highlight the searched words to help your users find more context on the results.
Interested in learning more? A complete list of MongoDB Atlas Search features is available.
Adding additional components adds complexity to your application. To provide full-text search capabilities to your application, you will need an extra layer to take care of the indexing and provide you with the results.
With MongoDB Atlas Search, everything is integrated into your database. Software developers don’t need to worry about where to query — they can access data with a regular aggregation pipeline, just as they would with traditional data.
By removing that additional layer, software development is simplified and the associated overhead of implementing and maintaining different components in the architecture is avoided.
Whether a solution is built in-house or uses a third-party tool, additional costs are to be expected. On one hand, developing a solution in-house may incur high costs in terms of development time, mistakes, and overall hours. Conversely, even an open source solution comes at a price in terms of integration, maintenance, etc. This is why many software development teams start with an off-the-shelf solution that requires minimal effort to implement and maintain. The fixed cost of a third-party solution, in combination with a set deliverable timeframe, can make the most sense. Make sure to consider all these factors when evaluating the right full-text search solution for you.
Using a solution such as MongoDB Atlas Search reduces costs by removing underlying infrastructure maintenance and associated training. It also makes it easier to ramp up development teams as most are already familiar with using MongoDB to query their data.
Are you ready to discover how to ramp up your full-text searches? Here are some additional resources to help you learn more.
Full-text queries are used within full-text searches to define the specific terms, parameters, etc. required by the user.