For organizations of all sizes, data management has shifted from an important competency to a critical differentiator that can determine market winners and has-beens. Fortune 1000 companies and government bodies are starting to benefit from the innovations of the web pioneers. These organizations are defining new initiatives and reevaluating existing strategies to examine how they can transform their businesses using Big Data. In the process, they are learning that Big Data is not a single technology, technique or initiative. Rather, it is a trend across many areas of business and technology.
Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra- structure to address efficiently. Said differently, the volume, velocity or variety of data is too great.
But today, new technologies make it possible to realize value from Big Data. For example, retailers can track user web clicks to identify behavioral trends that improve campaigns, pricing and stockage. Utilities can capture household energy usage levels to predict outages and to incent more efficient energy consumption. Governments and even Google can detect and track the emergence of disease outbreaks via social media signals. Oil and gas companies can take the output of sensors in their drilling equipment to make more efficient and safer drilling decisions.
"Big Data" describes data sets so large and complex they are impractical to manage with traditional software tools.
Specifically, Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, velocity, and variety:
Powerful Big Data solutions with less effort
MongoDB offers products and services that help you reduce effort and riskGet to Production Faster
With Big Data databases, enterprises can save money, grow revenue, and achieve many other business objectives, in any vertical.
The Big Data landscape is dominated by two classes of technology: systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored; and systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data. These classes of technology are complementary and frequently deployed together.
Operational and analytical workloads for Big Data present opposing requirements and systems have evolved to address their particular demands separately and in very different ways. Each has driven the creation of new technology architectures. Operational systems, such as the NoSQL databases, focus on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria. Analytical systems, on the other hand, tend to focus on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records.
For operational Big Data workloads, NoSQL Big Data systems such as document databases have emerged to address a broad set of applications, and other architectures, such as key-value stores, column family stores, and graph databases are optimized for more specific applications. NoSQL technologies, which were developed to address the shortcomings of relational databases in the modern computing environment, are faster and scale much more quickly and inexpensively than relational databases.
Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational Big Data workloads much easier to manage, and cheaper and faster to implement.
In addition to user interactions with data, most operational systems need to provide some degree of real-time intelligence about the active data in the system. For example in a multi-user game or financial application, aggregates for user activities or instrument performance are displayed to users to inform their next actions. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.
Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database systems and MapReduce. These technologies are also a reaction to the limitations of traditional relational databases and their lack of ability to scale beyond the resources of a single server. Furthermore, MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL.
As applications gain traction and their users generate increasing volumes of data, there are a number of retrospective analytical workloads that provide real value to the business. Where these workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native MapReduce functionality that allows for analytics to be performed on operational data in place. Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoop for MapReduce.
|Latency||1 ms - 100 ms||1 min - 100 min|
|Concurrency||1000 - 100,000||1 - 10|
|Access Pattern||Writes and Reads||Reads|
|End User||Customer||Data Scientist|
|Technology||NoSQL||MapReduce, MPP Database|
Cloud computing refers to a broad set of computing and software products that are sold as a service, managed by a 3rd-party provider and delivered over a network. Infrastructure-as-a-Service (IaaS) is a flavor of cloud computing in which on-demand processing, storage or network resources are provided to the customer. Sold on-demand with limited or no upfront investment for the end-user, consumption is readily scalable to accommodate spikes in usage. Customers pay only for the capacity that is actually used (like a utility), as opposed to self-hosting, where the user pays for system capacity it is are used or not.
As compared to self-hosting, IaaS is:
Overall, cloud computing provides better agility and scalability, together with lower costs and faster time to market. However, it does require that applications be engineered to take advantage of this new infrastructure; applications built for the cloud need to be able to scale by adding more servers, for example, instead of adding capacity to existing servers.
On the storage layer, traditional relational databases were not designed to take advantage of horizontal scaling. A class of new database architectures, dubbed NoSQL databases, are designed to take advantage of the cloud computing environment. NoSQL databases are natively able to handle load by spreading data among many servers, making them a natural fit for the cloud computing environment. Part of the reason NoSQL databases can do this is that related data is always stored together, instead of in separate tables. This document data model, used in MongoDB and other NoSQL databases, makes them a natural fit for the cloud computing environment.
In fact, MongoDB is built for the cloud. Its native scale-out architecture, enabled by "sharding," aligns well with the horizontal scaling and agility afforded by cloud computing. Sharding automatically distributes data evenly across multi-node clusters and balances queries across them. In addition, MongoDB automatically manages sets of redundant servers, called "replica sets," to maintain availability and data integrity even if individual cloud instances are taken offline. To ensure high availability, for instance, users can spin up multiple members of a replica set as individual cloud instances across different availability zones and/or data centers. MongoDB has also partnered with a number of leading cloud computing providers, including Amazon Web Services, Microsoft and SoftLayer.
To learn more about partner offers, please visit our partners page.
New technologies like NoSQL, MPP databases, and Hadoop have emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business.
One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.
While many Big Data technologies are mature enough to be used for mission-critical, production use cases, it is still nascent in some regards. Accordingly, the way forward is not always clear. As organizations develop Big Data strategies, there are a number of dimensions to consider when selecting technology partners, including:
1. Online vs. Offline Big Data
2. Software Licensing Models
4. Developer Appeal
6. General Purpose vs. Niche Solutions
Big Data can take both online and offline forms. Online Big Data refers to data that is created, ingested, trans- formed, managed and/or analyzed in real-time to support operational applications and their users. Big Data is born online. Latency for these applications must be very low and availability must be high in order to meet SLAs and user expectations for modern application performance. This includes a vast array of applications, from social networking news feeds, to analytics to real-time ad servers to complex CRM applications. Examples of online Big Data databases include MongoDB and other NoSQL databases.
Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big Data in a batch context. They typically do not create new data. For these applications, response time can be slow (up to hours or days), which is often acceptable for this type of use case. Since they usually produce a static (vs. operational) output, such as a report or dashboard, they can even go offline temporarily without impacting the overall goal or end product. Examples of offline Big Data applications include Hadoop-based workloads; modern data warehouses; extract, transform, load (ETL) applications; and business intelligence tools.
Organizations evaluating which Big Data technologies to adopt should consider how they intend to use their data. For those looking to build applications that support real-time, operational use cases, they will need an operational data store like MongoDB. For those that need a place to conduct long-running analysis offline, perhaps to inform decision-making processes, offline solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so in tandem, and they will sometimes find integrations between online and offline Big Data technologies. For instance, MongoDB provides integration with Hadoop.
There are three general types of licenses for Big Data software technologies:
For many Fortune 1000 companies, regulations and internal policies around data privacy limit their ability to leverage cloud-based solutions. As a result, most Big Data initiatives are driven with technologies deployed on-premise. Most of the Big Data pioneers are web companies that developed powerful software and hardware, which they open-sourced to the larger community. Accordingly, most of the software used for Big Data projects is open-source.
In these early days of Big Data, there is an opportunity to learn from others. Organizations should consider how many other initiatives are being pursued using the same technologies and with similar objectives. To understand a given technology’s adoption, organiza- tions should consider the following:
The market for Big Data talent is tight. The nation’s top engineers and data scientists often flock to companies like Google and Facebook, which are known havens for the brightest minds and places where one will be exposed to leading edge technology. If enterprises want to compete for this talent, they have to offer more than money.
By offering developers the opportunity to work on tough problems, and by using a technology that has strong developer interest, a vibrant community, and an auspicious long-term future, organizations can attract the brightest minds. They can also increase the pool of candidates by choosing technologies that are easy to learn and use — which are often the ones that appeal most to developers. Furthermore, technologies that have strong developer appeal tend to make for more productive teams who feel they are empowered by their tools rather than encumbered by poorly-designed, legacy technology. Productive developer teams reduce time to market for new initiatives and reduce development costs, as well.
Organizations should use Big Data products that enable them to be agile. They will benefit from technologies that get out of the way and allow teams to focus on what they can do with their data, rather than how to deploy new applications and infrastructure. This will make it easy to explore a variety of paths and hypotheses for extracting value from the data and to iterate quickly in response to changing business needs.
In this context, agility comprises three primary components:
MongoDB’s ease of use, dynamic data model and open- source licensing model make it the most agile online Big Data solution available.
Organizations are constantly trying to standardize on fewer technologies to reduce complexity, to improve their competency in the selected tools and to make their vendor relationships more productive. Organizations should consider whether adopting a Big Data technology helps them address a single initiative or many initiatives. If the technology is general purpose, the expertise, infrastructure, skills, integrations and other investments of the initial project can be amortized across many projects. Organizations may find that a niche technology may be a better fit for a single project, but that a more general purpose tool is the better option for the organization as a whole.
Big Data means new opportunities for organizations to create business value — and extract it. The MongoDB NoSQL database can underpin many Big Data systems, not only as a real-time, operational data store but in offline capacities as well. With MongoDB, organizations are serving more data, more users, more insight with greater ease — and creating more value worldwide. Read about MongoDB's big data use cases to learn more.
Selecting the right big data technology for your application and goals is important. MongoDB, Inc. offers products and services that get you to production faster with less risk and effort. Learn more or contact us.
Download the PDF and learn: