Event
{Event}  Tune in on September 26 starting at 10 a.m. BST to hear all the latest product updates and announcements in the MongoDB.local London keynote >

Big Data: An In-Depth Introductory Guide

Big data databases rapidly ingest, prepare, and store large amounts of diverse data. They are responsible for converting unstructured and semi-structured data into a format that analytics tools can use. Because of these distinctive requirements, NoSQL (non-relational) databases, such as MongoDB, are a powerful choice for storing big data.

The process of storing big data in a database The process of storing big data in a database

What is big data?

Data that is huge in Volume (size), Variety, and Velocity (speed) is known as big data. In this article, we will explore what big data is and how it’s transforming businesses to help them increase revenue and improve their business strategies and processes.

Picture this: You watch a video on YouTube, like it, and share it with a few friends. You then purchase groceries and medicine online, and search for cool places to vacation. You open Netflix and watch your favorite web series. You pay your parents’ phone and electricity bills, and update their details on a health portal to apply for insurance. A friend calls you up to like their content on Instagram, so you log into your account and post comments on a few of their photos.

Then, you book your flight to your parents’ place for next weekend.

With all these transactions, you keep generating data and sharing personal information about yourself and people you are related to—your parents, your friends, your favorite series, your favorite travel destinations, and more.


image depicts a network of activities including online purchases, searches, and videos


As you keep transacting in various ways, the magnitude and variety of data grows at a very fast rate. And that’s just your data! Imagine the amount of data each of the 4.66 billion active internet users worldwide produces daily! You can generate data in various ways—from the fitness app that you use, doctor visits you schedule, or videos you watch, to the Instagram posts you like, grocery purchases you make online, games you play, vacations you book—and every transaction that you make (or cancel) generates data. More often than not, that data is analyzed by businesses to better understand their users and present them with customized content.

Big data is used in almost all major industries to streamline operations and reduce overall costs.

For example, big data in healthcare is becoming increasingly important—early detection of diseases, discovery of new drugs, and customized treatment plans for patients are all examples of big data applications in healthcare.

It’s a complex and massive undertaking to capture and analyze so much data (for example, data about thousands of patients). To perform big data analytics, data scientists require big data tools, as traditional tools and databases are not sufficient.

Types of big data

Structured, unstructured, and semi-structured data are all types of big data. Most of today’s big data is unstructured, including videos, photos, webpages, and multimedia content. Each type of big data requires a different set of big data tools for storage and processing:

Structured data

Structured data is stored in an organized and fixed manner in the form of tables and columns.

Relational databases are well-suited to store structured data. Developers use the Structured Query Language (SQL) to process and retrieve structured data.

Here is an example of structured data, with order details of a few customers:

OrderIDCustomerIDBillAmountBillDate
ORD334567CUST00001234$25017-04-2021 17:00:56
ORD334568CUST00009856$30017-04-2021 17:00:56
ORD334569CUST00001234$10017-04-2021 17:01:57

The Order table has a reference to the CustomerID field, which refers to the customer details stored in another table called Customer.

Semi-structured data

Semi-structured data is structured but not rigid. It’s not in the form of tables and columns. Some examples are data from mobile applications, emails, logs, and IoT devices. JSON and XML are common formats for semi-structured data:

{
"customerID": "CUST0001234",
"name" : "Ben Kinsley",
"address": {
    "street": "piccadilly",
    "zip" : "W1J9LL",
    "city" : "London",
    "state" : "England" 
},
"orders": [{
    "orderid":"ORD334567",
    "billamount":"$250",
    "billdate":"17-04-2021 17:00:56"
}, {
    "orderid":"ORD334569",
    "billamount":"$100",
    "billdate":"17-04-2021 17:01:57"
}]
}

The data has a more natural structure here and is easier to traverse. MongoDB is a good example of semi-structured data storage.

Multi-structured/unstructured data

Multi-structured data is raw and has varied formats. It can contain sensor data, web logs, social media data, audio files, videos and images, documents, text files, binary data, and more. This data has no particular structure and hence is categorized as unstructured data. Examples include text files, audio files, and images.


unstructured data


It’s difficult to store and process unstructured data because of its varied formats. However, non-relational databases, such as MongoDB Atlas, can easily store and process various formats of big data.

The three Vs of big data

Big data has three distinguishing characteristics: Volume, Velocity, and Variety. These are known as the three Vs of big data.

Volume

Data isn’t “big” unless it comes in truly massive quantities. Just one cross-country airline trip can generate 240 terabytes of flight data. IoT sensors on a single factory shop floor can produce thousands of simultaneous data feeds every day. Other common examples of big data are Twitter data feeds, webpage clickstreams, and mobile apps.

Velocity

The tremendous volume of big data means it has to be processed at lightning-fast speed to yield insights in useful timeframes. Accordingly, stock-trading software is designed to log market changes within microseconds. Internet-enabled games serve millions of users simultaneously, each of them generating several actions every second. And IoT devices stream enormous quantities of event data in real time.

Variety

Big data comes in many forms, such as text, audio, video, geospatial, and 3D, none of which can be addressed by highly formatted traditional relational databases. These older systems were designed for smaller volumes of structured data and to run on just a single server, imposing real limitations on speed and capacity. Modern big data databases such as MongoDB are engineered to readily accommodate the need for variety—not just multiple data types, but a wide range of enabling infrastructure, including scale-out storage architecture and concurrent processing environments.

Nowadays, more Vs are making it to the definition of big data, the most prominent ones being:

  • Veracity—the accuracy of big data.
  • Value—the business value gained by analyzing the big data.
  • Variability—the different data types and changes in the big data over time.

The Vs of big data are variety, velocity, and volume.

History of big data

Big data has come a long way since the term was coined in 1980 by sociologist Charles Tilly.

Many researchers and experts anticipated an information explosion in the 21st century. In the late 1990s, analysts and researchers started talking more about what big data is and mentioning it in their research papers.

In 2001, Douglas Laney, an industry analyst at Gartner, introduced the three Vs in the definition of big data—volume, velocity, and variety.

The year 2006 was another milestone with the development of Hadoop, the distributed storage and processing system. Since then, there have been constant improvements in the big data tools for analytics. MongoDB Atlas, MongoDB’s cloud database service, was released in 2016, allowing users to run applications in over 80 regions on AWS, Azure, and Google Cloud.

By 2022, we’ve already generated more than 79 zettabytes of data, and by 2025 that number is estimated to be about 181 zettabytes (1 zettabyte = 1 trillion gigabytes).

Big data analytics has become quite advanced today, with at least 53% of companies using big data to generate insights, save costs, and increase revenues. There are many players in the market and modern databases are evolving to get much better insights from big data.

The Evolution of Big Data The Evolution of Big Data

Why is big data Important?

Big data is used for gaining practical insights for process and revenue improvements. Big data analysis can aid in:

  • Cost optimization: Through big data analytics, companies are able to improve their business strategies, boost productivity by handling disasters before they occur, and focus more on the business rather than worrying about operational aspects, thus reducing overall cost.

  • Innovative products and services: Through big data technologies, businesses are able to understand customer preferences better, and form their marketing strategies accordingly. This enables them to develop better products and services in future.

  • Better, quicker decision-making: With the help of big data tools like Spark, Hadoop, NoSQL databases like MongoDB Atlas, visualization tools like MongoDB Charts, and others, analysts are able to get faster insights and big data solutions. This helps in quick decision-making for business.

How big data works

To better understand what big data is, we should know how big data works. Here is a simple big data example:

Defining business goal(s)

A clothing company wants to expand its business by acquiring new users.

Data collection and integration

To do this:

  • They need the help of social media sites like Facebook, Instagram, and My Business to understand user behavior—the posts users like, their engagement on particular pages, and so on.

  • They create a website and track events on their website, including the number of clicks and minutes a user spends on a page.

  • For the customers who browsed a particular section (like women’s ethnic wear), the company wants to send customized emails giving them offers and discounts.

  • For queries and support, the company has chatbots and customer support available.


All of this information cannot be collected from a single source. Each step has its own data center where the information goes. The data collected from various sources should be combined in one place to get a unified view. Such a place is commonly referred to as a data lake or data warehouse. The process of collecting and combining data from various sources is called data integration.

Data management

Next, the company has to store all the above data in a reliable and highly available environment, where it can be easily retrieved for business use. The company finds out that most companies prefer cloud-based storage so that the infrastructure can be easily managed. One such cloud-based data storage solution is MongoDB Atlas, which offers flexibility and scalability, among other features, and is also compatible with major cloud providers like AWS and Azure. Data can be easily updated and governed with big data cloud storage.

The process of storing the integrated data, so that it can be retrieved by applications as required, is called data management.

Data analysis

Once the brand knows that the big data is managed well, the next step is to figure out how the data should be put to use to get the maximum insights. The process of big data analytics involves transforming data, building machine learning and deep learning models, and visualizing data to get insights and communicate them to stakeholders. This step is known as data analysis.

Let’s summarize how big data works:

Company big data exampleMapping to big data processName of the big data analytics stageBig data tools
Company wants to acquire new customersDefine business goalsProblem definition and understanding user needs: Why do we want to go for big data analytics?Interviews, research data, web logs, demographics, mobile data, emails

Company finds out multiple ways to ingest data

Know where data can be sourced from and consolidateData collection, ingestion, and integration from IoT, social media, cloud, etc.Kafka, NIFI, Kinesis, MongoDB Atlas Data Lake
Company finds out about cloud storageStore big data, keep data updatedData managementAWS, MS Master Data Services, Talend, MongoDB Atlas, Google Cloud
Company hires data analysts and data scientists to get insightsAnalyze big dataData visualization and analysisSpark, SAS, MongoDB Charts, R, Python, Power BI
Company big data exampleMapping to big data process
Company wants to acquire new customersDefine business goals

Company finds out multiple ways to ingest data

Know where data can be sourced from and consolidate
Company finds out about cloud storageStore big data, keep data updated
Company hires data analysts and data scientists to get insightsAnalyze big data
Company big data exampleName of the big data analytics stage
Company wants to acquire new customersProblem definition and understanding user needs: Why do we want to go for big data analytics?

Company finds out multiple ways to ingest data

Data collection, ingestion, and integration from IoT, social media, cloud, etc.
Company finds out about cloud storageData management
Company hires data analysts and data scientists to get insightsData visualization and analysis
Company big data exampleBig data tools
Company wants to acquire new customersInterviews, research data, web logs, demographics, mobile data, emails

Company finds out multiple ways to ingest data

Kafka, NIFI, Kinesis, MongoDB Atlas Data Lake
Company finds out about cloud storageAWS, MS Master Data Services, Talend, MongoDB Atlas, Google Cloud
Company hires data analysts and data scientists to get insightsSpark, SAS, MongoDB Charts, R, Python, Power BI



This enables companies to make data-driven decisions to create intelligent organizations. Big data is the key to building a competitive, highly performant environment which can benefit businesses and customers alike.

Image showing the continuous process of data analysis which leads to actionable insights.

MongoDB can help at each stage of big data analytics with its host of tools like MongoDB Atlas, MongoDB Atlas Data Lake, and MongoDB Charts.

MongoDB Atlas is a fully managed cloud-based database service. Atlas takes care of complete database management, including security, reliability, and optimal performance, so that developers can focus on building the application logic.

Big data challenges

Collecting, storing, and processing big data comes with its own set of challenges:

  • Big data is growing exponentially, and existing data management solutions have to be constantly updated to cope with the three Vs.
  • Organizations do not have enough skilled data professionals who can understand and work with big data and big data tools

Learn more about the top seven big data challenges.

What are some examples of big data in practice?

Some examples of big data are fraud detection, personalized content recommendations, and predictive analytics.

Before we get into domain-specific big data examples, let’s first understand what big data is commonly used for.

What is big data used for?

Big data can address a range of business activities from customer experience to analytics. Here are some examples:

  • Compliance and fraud protection: Big data lets you identify usage patterns associated with fraud and parse through large quantities of information much faster, speeding up and simplifying regulatory reporting.

  • Machine learning: Big data is a key enabler for algorithms that teach machines and software how to learn from their own experience, so they can perform faster, achieve higher precision, and discover new and unexpected insights.

  • Product development: Companies analyze and model a range of big data inputs to forecast customer demand and make predictions as to what kinds of new products and attributes are most likely to suit them.

  • Predictive maintenance: Using sophisticated algorithms, manufacturers assess IoT sensor inputs and other large datasets to track machine performance and uncover clues to imminent problems. The goal is determining the ideal intervals for preventive maintenance to optimize equipment operation and maximize uptime.

  • Improving productivity and minimizing costs: To hone their edge in low-margin competitive markets, manufacturers utilize big data to improve quality and output while minimizing scrap. Government agencies can employ social media to identify and monitor outbreaks of infectious diseases. Retailers routinely fine-tune campaigns, inventory SKUs, and price points by monitoring web click rates that reveal otherwise hidden changes in consumer behavior.

Big data examples

Enterprises and consumers are producing data at an equally high rate. The data can be used by several streaming and batch processing applications, predictive modeling, dynamic querying, machine learning, AI applications, and so on.

We touched upon big data applications in healthcare, marketing, and customer experience.

Other common big data examples are:

  • Fraud detection and prevention: By identifying suspicious transactions and activities, financial institutions can identify and differentiate frauds. Real-time tracking and machine learning algorithms help in detection and prevention of cyber thefts, insurance scams, identity thefts, and many other online frauds.

  • Recommendation systems: Apps like Netflix and Amazon Prime have now become the primary source of at-home entertainment. These sites recommend programs that are similar to the previous videos that they or other users liked. Amazon product recommendations work on the same principle.

Check out nine more real-world big data examples and use cases.

Best database for big data

Managing big data comes with a set of specifications. Storage solutions for big data should be able to process and store large amounts of data, converting it to a format that can be used for analytics. NoSQL, or non-relational, databases are designed for handling large volumes of data while being able to scale horizontally. In this section, we’ll take a look at some of the best big data databases.

Apache HBase

HBase is a column-oriented big data database that runs on top of the Hadoop Distributed File System (HDFS). HBase is a top-level Apache project and its main advantages include fast lookups for large tables and random access.

Apache Cassandra

Another Apache top-level project—Cassandra—is a wide-column store, designed to process large amounts of data. Cassandra provides great read-and-write performance and reliability, while also being able to scale horizontally.

MongoDB Atlas

MongoDB is the leading NoSQL document-oriented database. The document model is a great fit for unstructured data allowing users to easily combine and organize data from multiple sources. MongoDB Atlas is a developer data platform built on the MongoDB database.

MongoDB Atlas takes big data management to the next level by providing a set of integrated data services for analytics, search, visualization, and more.

How does big data work in MongoDB Atlas?

As we saw earlier, MongoDB has a document-based structure, which is a more natural way to store unstructured data. Its flexible schema accepts data in any form and volume—so you don't have to worry about storage as the amount of data increases.

MongoDB Atlas is a developer data platform that provides a secure, highly available, fully managed cloud database along with data services like MongoDB Atlas Data Lake and MongoDB Charts. Data Lake allows you to gain fast insights by analyzing data from multiple MongoDB databases and AWS S3 together. Charts is the best way to create visualizations from your MongoDB data, with powerful sharing and embedding capabilities.

Learn more about MongoDB Atlas.

FAQs

What is an example of big data?

Big data is used in almost every business domain, like healthcare, logistics, retail, and manufacturing. For example, big data in healthcare finds much use in new drug discovery, disease research, early detection of diseases, personalized patient care, and efforts towards fewer doctor visits.

What are big data tools?

Big data tools are used to collect, transform, and analyze big data since traditional tools and relational databases are no longer enough to handle it. Some of the top big data tools are:

  • Apache Spark: Spark is an open-source framework mainly popular for processing streaming data. It can process large amounts of real-time data very quickly because of in-memory calculations.

  • MongoDB: MongoDB is a NoSQL database. It has a flexible schema. MongoDB stores huge amounts of data in a naturally traversable format, making it a good choice to store, query, and analyze big data. MongoDB Atlas provides database as a fully managed service, with features like data encryption, security, advanced analytics, and data lakes.

  • Apache Hadoop: The framework that changed the way big data analytics was viewed, Hadoop is still widely used for faster batch processing of data.

  • Kafka: Kafka is an open-source framework that can handle huge volumes of events. It offers high throughput to systems and has high fault tolerance. Kafka is used for stream processing, event sourcing, and building activity tracking pipelines.

  • R: R is a popular big data statistical tool that can perform advanced statistical analytics. R provides advanced graphs and charting features for easy visualization of data.

What is big data and how is it used?

Big data refers to data that is huge in Volume, Variety, and Velocity. It contains both structured and unstructured data, which can mean anything, including customer order details, video files, audio messages, documents, social media engagements, and patient and healthcare data. Big data is used by businesses to improve their internal operations as well as products and services, in order to serve consumers better. Big data is used in healthcare for research, early detection of diseases, keeping track of patient health, and so on.

Where is big data stored?

Traditional approaches to storing data in relational databases, data silos, and data centers are no longer sufficient due to the size and variety of today’s data. Nowadays, cloud-based systems, data lakes, and data warehouses are becoming popular options to store, integrate, and process big data. MongoDB Atlas is a good example of a database as a service. Atlas is compatible with major cloud providers and offers high security, flexibility, data availability, and other important features to easily store and manage big data.

How is big data collected?

Big data is collected from different offline and online channels. It can be generated by:

  • Interviews, documents, surveys, audio, videos, and social media posts.
  • IoT devices and sensors.
  • Network logs, server logs, and web logs.
  • Web scraping and search engine results.
  • Virtual assistants like Alexa, Cortana, and Siri.
  • Mobile apps, real-time data from streaming apps like Netflix and YouTube.
  • Online transactions and purchases.
  • Location data from vehicles, human movement, and satellites.

What do you mean by big data?

The data that is big in volume, contains a lot of variety, and comes with high velocity constitutes big data. Big data should also have high veracity and provide value for businesses.

  • Volume—Big data is huge in size. Businesses and consumers will generate about 180 zettabytes of data by 2025, which is more than double the amount of data (64.2 zettabytes) generated in 2020.

  • Velocity—Big data comes at high speed, like real-time data that needs immediate analysis and action. An ATM transaction is one common example of this. Every transaction should be immediately reflected in the user’s account as well as the ATM system to keep track of cash availability. Each transaction also needs to be checked for authenticity right then and there.

  • Variety—Data comes in all types of formats: unstructured, structured, or semi-structured. For structured data, like a purchase order made by a customer, relational databases are sufficient. However, unstructured and semi-structured data, which are more common forms of big data, require specialized big data tools to store and process.

  • Value—Big data results in big data analytics, which leads to insights and action. This provides business value and helps in increasing overall revenue and growth of the business.

  • Veracity—Veracity includes not just the quality of data, but also the truthfulness of the data source. One such example is social media content—from user profiles to sentiments to trends, everything can change very fast.

Who is using big data?

Almost all industries use big data in some way. This includes:

  • Big data in healthcare: Electronic Medical Records (EMRs) help in tracking patient and hospital records, detecting diseases in early stages, discovering new drugs, supporting biomedical research, and monitoring health through IoT devices.

  • Big data in banking and finance: Big data is used for fraud detection and prevention, identifying loyal customers, and providing better security.

  • Big data in marketing and retail: To understand customer behavior, support customer segmentation, recommend products and services, and provide targeted marketing, retailers and other marketers often turn to big data analysis.

Big data also has applications in manufacturing, logistics, insurance, education, entertainment, and many other sectors.

What is big data, in simple terms?

Big data is the raw data obtained from multiple sources to get business insights. This data is huge in volume, comes in a variety of forms (like videos or images), and arrives in high volume (like streaming data).