EventJoin us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases. Learn more >>Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases. >>

An Introduction to Search Indexes

One of the most popular search engines, Google, owns 92% of the search engine market, and 84% of internet users report using Google more than three times per day. This translates to roughly 99,000 searches per second. However, given the sheer amount of data on the internet, how is it possible to know which websites contain what information and how relevant it is to the user's search criteria?

Conversely, when an analyst queries a gargantuan retail database (e.g., Amazon, Kroger, Walmart), waiting hours or days for millions of rows to be searched isn't an option. So, how are these queries completed in seconds?

In both cases, search indexes are used to minimize query processing time and maximize relevant, valuable search results. In this article, we'll discuss what search indexes are and how they work, and we’ll offer tips to improve your search indexing skills.


Table of contents

What is a search index?

A search index file is a compilation of source data that has been analyzed and placed in a searchable order. Just like the index in the back of a book, content has been reviewed and cataloged in a logical order to minimize search time and maximize information relevance to the user.

What does a search index do?

Most internet users, when looking for information on a specific topic, will use internet search engines (e.g., Google, Bing) to find relevant websites and information. However, rather than the search engine going out and searching the entire internet for the answer, the search engine algorithm is actually reviewing search indexes that have already been created which detail the content of popular and relevant websites. The search engine algorithm not only scans the search indexes created to determine which content is relevant in answering the user's query, but also to rank the relevance or "value" of the information returned in the search results.

Alternatively, when a database administrator (DBA) reviews database user requirements and usage logs, they may find key activities, query patterns, and data search pathways that are most commonly applied. The DBA can then create search indexing that anticipates these activities and answers queries more quickly and completely. This is possible for structured data, text files, and a number of other file types.

How do search indexes work?

A search index can be created for virtually any type of information — from websites to databases and even documents. The process of creating a search index includes several steps which may vary depending on the type of data source(s).


Steps to create search engine indexing

When most people think of a search index, they think of search engines and the internet — specifically, a Google search. This makes sense because internet search engines must contend with the billions of sites that site owners publish and somehow find the relevant search results that its users are looking for. However, it's important to understand that without an index, search engine results would take hours or days to produce relevant search results given the time-consuming expanse of internet web pages, links, searchable data, and documents.

Here is a summary of the steps required to create the necessary search engine indexing to take a web search from days to seconds.

  • Crawling: Web crawlers are bots tasked with gathering information from websites, documents, databases, etc., which is the first step in creating a search index. Often, these web crawlers will start with a group of previously searched sites and then follow hyperlinks to additional sites, and then follow embedded hyperlinks in those pages further. This cycle can go on indefinitely.

  • Parsing: Data collected by web crawlers is analyzed to remove unnecessary information, such as HTML tags, and extract relevant information. Some of the ways relevant information is extracted includes:

    • Tokenization: In this process, relevant information (e.g., keywords, phrases, symbols) is broken up into pieces known as tokens and punctuation marks are discarded.
    • Stemming: When extracting relevant information from big data sources, stripping the suffixes of words (e.g., exposing the "stem" of the word) helps the search engine algorithm correctly identify desired keywords or terms. Interestingly, stemming is also used extensively in natural language processing (NLP) for AI information extraction in big data, as well.
    • Lemmatization: In this process, similar-meaning words are linked as one word — specifically, to their common root word — called a "lemma." The search algorithm uses dictionary lookup criteria to find the lemma (e.g., run from ran).
    • Stochastic models: Stochastic models are almost the opposite of stemming in that they infer different words by applying suffixes (e.g., cook to cooking, cooked).

  • Indexing: Once the parsing process is complete, an index is created, mapping included words, keywords, phrases, and terms to their source site, database, or document. Usually, such additional information as metadata, location within the source, and frequency of use is also included.

Database search indexing

There are a number of indexing strategies to consider when creating search indexes for databases. These strategies include:

  • ESR (Equality, Sort, Range) Rule: This is a guide to creating indexes that support anticipated queries.
  • Query support indexes: An index supports a query when the index contains all the fields scanned by the query, resulting in greatly increased query performance.
  • Result-sorting query indexes: By specifying the sequential order and sort order of index fields, more relevant and digestible search results can be achieved in an optimized timeframe.
  • Selectivity indexing: Selectivity allows the database to use the index for a larger amount of the work associated with fulfilling a query. This not only speeds up the process but also ensures more relevant results are produced by the query.

Database search index example

Imagine you have a collection of invoices and anticipate you'll want to query those invoices using their issued date to determine your quarterly sales totals. To help your database find these invoices faster, an index is created where all documents are ordered by the issued date field. This way, the database engine will only need to go through the documents related to the last quarter (per issue date) and will stop querying once it reaches the issue dates belonging to the current quarter.

Illustration of querying quarterly sales by invoice issued date with and without an index.

In addition, you might want to highlight the most significant orders in this same collection of invoices. To aid in obtaining these search results, adding an index on the total invoice amount is necessary. This way, the database engine will easily find the largest invoices and return those search results quickly.

Illustration of querying top quarterly invoice amounts with and without a search index.

Tips to improve your search indexing

Whether you want to increase the visibility and value of your web content or enhance the relevance and speed of your database queries, using search indexing appropriately is key. However, the right index strategy will vary depending upon your content and individual goals.

One of the most basic principles regarding search engine index design is to have your content be perceived as relevant and valuable by search engine algorithms. In order to do this, it's important to remember the following tips:

  • Quality, quality, quality: While it's the least technical tip, maintaining quality, up-to-date content is key. Given recent changes to the way search engines review and rank websites, it's even more critical that website content not only match users' keyword searches but also be deemed valuable by the search engine algorithm. Further, when content is updated, it encourages web crawlers to revisit your site, which is also beneficial in the ranking process.

  • Accessibility: Make sure that your website has a structure that is clean, has easy navigation, and avoids using complex JavaScript frameworks, content hidden behind forms, and large quantities of flash content. These all hinder web crawlers and negatively impact load speed. Remember that search engines rate website load speed when ranking search results in the search engine results page (SERP).

  • Linking: Following links is one of the key ways search engines move from site to site as they index. So, it stands to reason that links to and from popular sites will help your website move up in SERP order. There are three basic types of links to consider:

    • Internal linking: These links build a hierarchical structure within your website, helping search engines to understand and navigate your content easily. They also help heighten visibility of key pages and improve overall content evaluation.
    • External linking: External links connect to other websites and are often used to provide additional, specialized information or site content sources. While they can be useful, it's also important to remember that these links will be followed by web crawlers and readers as well, so external links to competitors or similar content that may rank more highly than yours is not beneficial. Try to keep external links to educational, government, or neutral institutional sites, where possible.
    • Backlinks: Backlinks are links from other websites to your website. The more highly regarded or ranked a website is that links to your site, the greater your own content will be valued and ranked by search engines. One way to create backlinks to your site is to write guides or educational articles with loads of "For further reference" links to helpful (noncompetitive) resources designed to gain the interest and external links of highly ranked sites. For example, if your site content relates to online courses teaching adults how to code, you might consider writing articles discussing key strategies for adult learners switching to a tech career, attracting the attention of highly rated neutral sites such as libraries, community resource agencies, or organizations such as AARP (American Association of Retired Persons). There are even agencies that will bring your article to the attention of your "target audience sites" for a small fee. If successful, the backlinks from these organizations' websites will enhance your site's search engine ranking and broaden your reach.

  • Smart content labeling and mapping: Taking some extra time to lay out the web-crawler welcome mat will be worth it. Consider enhancing the following elements of your website:

    • XML sitemaps: A sitemap is a file that lists your website's URLs. By creating and submitting an XML sitemap to search engines, you are helping search engines understand the structure and content of your site while creating an easy way for search engines to index your web pages.
    • Metadata optimization: Using keywords that best represent your content in title tags, meta descriptions, and header tags helps search engines understand the relevance and value of your web pages.
    • Robots.txt file: A robots.txt file communicates to search engines which portions of your website should be crawled and indexed. Conversely, you can also indicate specific pages or content that shouldn't be accessed by bots.

Database indexing tips

Database indexing can be complex and strategies may vary depending on the size, type, and structure of the specific database being indexed. Consider the tips below to help speed your database index design along and enhance your database performance.

  • Be intentional: While indexing can certainly enhance query and database performance, be mindful of the fact that indexes take up storage space and require maintenance. Think carefully about the types of indexing that work best with your database type, as well as how much maintenance time you're willing to devote to your indexing.

  • Indexes fit in RAM: When your index fits in RAM, the system can avoid reading the index from disk, resulting in faster processing.

  • Columns or fields for indexing: Columns in relational databases, or fields in other types of databases including MongoDB that are frequently used in queries, are a great place to start looking for indexing opportunities as they often will improve query performance.

  • Covering indexes: These indexes include all columns or fields required by a query. Since the database can retrieve all necessary data from the index, there is no need to access the table or collection directly which then enhances speed and performance.

  • Composite indexes: Consider creating an index including multiple columns or fields that are often used together in queries. This will enhance query speeds.

  • Maintenance, maintenance, maintenance: While not thrilling, maintaining a regular monitoring and maintenance schedule for your indexes is key. Be sure to regularly monitor and update statistics so your query optimizer has the information to function well. Identify redundant or underutilized indexes and either modify or remove them to keep database performance at peak levels.

FAQs

What is a search index?

A search index is a compilation of source data that has been analyzed and placed in a searchable order.

What does a search index do?

Search indexing allows search engines and databases to quickly identify relevant content, data, web pages, etc. without searching every data source available. Search indexes, which contain relevant, summarized, and ordered information about each site, table, or document are what enable search engines and databases to return relevant results in seconds instead of hours or days.

Three steps in creating search engine indexes

The three key steps to creating a search engine index include crawling, parsing, and indexing. The use of web crawlers is a key aspect in obtaining the relevant data to then be parsed and indexed.

Database indexing strategies

Some examples of database indexing strategies include:

  • ESR (Equality, Sort, Range) Rule.
  • Query support indexing.
  • Result-sorting indexing.
  • Selectivity indexing.

Search engine indexing tips

Key areas to enhance on your website for better search engine indexing include:

  • Content quality.
  • Site accessibility.
  • Linking.
  • Smart content labeling and mapping.