What is MongoDB?
MongoDB is a NoSQL database designed for how we build and run applications today using modern development techniques, programming models, and computing resources. As a result, it empowers businesses to be more agile and scalable, create new applications, improve customer experience, and accelerate time to market while reducing costs.
How We Build Applications
- New and Complex Data Types. Rich data structures with dynamic attributes, mixed structure, text, media, arrays and other complex types are common in today's applications.
- Modern Programming Languages. Object-oriented programming languages interact with data in structures that are dramatically different from the way data is stored in a relational database.
- Faster Development. Software engineering teams now embrace short, iterative development cycles.
How We Run Applications
- New Scalability for Big Data. Operational and analytical workloads challenge traditional capabilities on one or more dimensions of scale, availability, performance and cost effectiveness.
- Fast, Real-time Performance. Users expect consistent, interactive experiences from applications across many types of interfaces.
- New Computing Environments. The infrastructure requirements for applications can easily exceed the resources of a single computer, and cloud infrastructure now provides massive, elastic, cost-effective computing capacity on a metered cost model.
MongoDB Feature Overview
MongoDB embraces these new realities through key innovations.
- Document Data Model. Data is stored in a structure that maps to objects in modern programming languages and is easy for developers to understand.
- Rich Query Model. MongoDB is fit for a wide variety of applications. It provides rich index and query support, including secondary, geospatial and text search indexes, the Aggregation Framework and native MapReduce.
- Idiomatic Drivers. Developers interact with the database through native libraries that are integrated with their respective environments and code repositories, making MongoDB simple and natural to use.
- Horizontal Scalability. As the data volume and throughput grow, developers can take advantage of commodity hardware and cloud infrastructure to increase the capacity of the MongoDB system.
- High Availability. Multiple copies of data are maintained with native replication. Automatic failover to secondary nodes, racks and data centers makes it possible to achieve enterprise- grade uptime without custom code and complicated tuning.
- In-Memory Performance. Data is read and written to RAM while also persisted to disk for durability, providing fast performance and eliminating the need for a separate caching layer.
- Flexibility. From the document data model, to multi-datacenter deployments, to tunable consistency, to operation-level availability options, MongoDB provides tremendous flexibility to the development and operations teams, and for these reasons it is well suited to a wide variety of applications across many industries.
MongoDB Data Model
DATA AS DOCUMENTS
MongoDB stores data as documents in a binary representation called BSON (Binary JSON). Documents that tend to share a similar structure are organized as collections. It may be helpful to think of collections as being analogous to a table in a relational database, documents as similar to rows, and fields as similar to columns.
For example, consider the data model for a blogging application. In a relational database the data model would comprise multiple tables. To simplify the example, assume there are tables for Categories, Tags, Users, Comments and Articles. In MongoDB the data could be modeled as two collections, one for users, and the other for articles. In each blog document there might be multiple comments, multiple tags, and multiple categories, each expressed as an embedded array.
MongoDB documents tend to have all data for a given record in a single document, whereas in a relational database information for a given record is usually spread across many tables.
MongoDB documents can vary in structure. For example, all documents that describe users might contain the user id and the last date they logged into the system, but only some of these documents might contain the user’s identity for one or more third-party applications. Fields can vary from document to document; there is no need to declare the structure of documents to the system – documents are self-describing. If a new field needs to be added to a document then the field can be created without affecting all other documents in the system, without updating a central system catalog, and without taking the system offline.
MongoDB Query Model
MongoDB supports many types of queries. A query may return a document or a subset of specific fields within the document:
- Key-value queries return results based on any field in the document, often the primary key.
- Range queries return results based on values defined as inequalities (e.g, greater than, less than or equal to, between).
- Geospatial queries return results based on proximity criteria, intersection and inclusion as specified by a point, line, circle or polygon.
- Text Search queries return results in relevance order based on text arguments using Boolean operators (e.g., AND, OR, NOT).
- Aggregation Framework queries return aggregations of values returned by the query (e.g., count, min, max, average, similar to a SQL GROUP BY statement).
Like most database management systems, indexes are a crucial mechanism for optimizing system performance in MongoDB. And while indexes will improve the performance of some operations by orders of magnitude, they have associated costs in the form of slower writes, disk usage, and memory usage. MongoDB includes support for many types of indexes on any field in the document:
- Unique Indexes. By specifying an index as unique, MongoDB will reject inserts of new documents or the update of a document with an existing value for the field for which the unique index has been created. By default all indexes are not set as unique. If a compound index is specified as unique, the combination of values must be unique.
- Compound Indexes. It can be useful to create compound indexes for queries that specify multiple predicates. For example, consider an application that stores data about customers. The application may need to find customers based on last name, first name, and state of residence. With a compound index on last name, first name, and state of residence, queries could efficiently locate people with all three of these values specified. An additional benefit of a compound index is that any leading field within the index can be used, so fewer indexes on single fields may be necessary: this compound index would also optimize queries looking for customers by last name.
- Array Indexes. For fields that contain an array, each array value is stored as a separate index entry. For example, documents that describe recipes might include a field for ingredients. If there is an index on the ingredient field, each ingredient is indexed and queries on the ingredient field can be optimized by this index. There is no special syntax required for creating array indexes – if the field contains an array, it will be indexed as an array index.
- TTL Indexes. In some cases data should expire out of the system automatically. Time to Live (TTL) indexes allow the user to specify a period of time after which the data will automatically be deleted from the database. A common use of TTL indexes is applications that maintain a rolling window of history (e.g., most recent 100 days) for user actions such as clickstreams.
- Geospatial Indexes. MongoDB provides geospatial indexes to optimize queries related to location within a two dimensional space, such as projection systems for the earth. These indexes allow MongoDB to optimize queries for documents. that contain points or a polygon that are closest to a given point or line; that are within a circle, rectangle, or polygon; or that intersect with a circle, rectangle, or polygon
- Sparse Indexes. Sparse indexes only contain entries for documents that contain the specified field. Because the document data model of MongoDB allows for flexibility in the data model from document to document, it is common for some fields to be present only in a subset of all documents. Sparse indexes allow for smaller, more efficient indexes when fields are not present in all documents.
- Text Search Indexes. MongoDB provides a specialized index for text search that uses advanced, language-specific linguistic rules for stemming, tokenization and stop words. Queries that use the text search index will return documents in relevance order. One or more fields can be included in the text index.
MongoDB Data Management
MongoDB provides horizontal scale-out for databases using a technique called sharding, which is trans- parent to applications. Sharding distributes data across multiple physical partitions called shards. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.
Sharding is transparent to applications; whether there is one or one hundred shards, the application code for querying MongoDB is the same. Applications issue queries to a query router that dispatches the query to the appropriate shards.
MongoDB Consistency & Durability
MongoDB guarantees atomic updates to data at the document level. Because data for each record tends to exist in a single document, this level of granularity is sufficient for most applications. One or more fields may be updated in a single operation, including push operations and inserts into capped arrays.
MongoDB implements write-ahead journaling of operations to enable fast crash recovery and durability in the storage engine. Journaling helps prevent corruption and increases operational resilience. Journal commits are issued at least as often as every 100ms by default. In the case of a server crash, journal entries are recovered automatically.
MongoDB maintains multiple copies of data called replica sets using native replication. A replica set is a fully self-healing shard that helps prevent database downtime. Replica failover is fully automated, eliminating the need for administrators to intervene manually.
The number of replicas in a MongoDB replica set is configurable, and a larger number of replicas provides increased data durability and protection against database downtime (e.g., in case of multiple machine failures, rack failures, data center failures, or network partitions). Optionally, operations can be configured to write to multiple replicas before returning to the application, thereby providing functionality that is similar to synchronous replication.
Replica sets also provide operational flexibility by providing a way to upgrade hardware and software without requiring the database to go offline.
CONFIGURABLE WRITE AVAILABILITY
MongoDB allows users to specify write availability in the system, which is called the write concern. The default write concern acknowledges writes from the application, allowing the client to catch network exceptions and duplicate key errors. Each query can specify the appropriate write concern, ranging from unacknowledged to acknowledgement that writes have been committed to multiple replicas, a majority of replicas, or all replicas. It is also possible to configure the write concern so that writes are only acknowledged once specific policies have been fulfilled, such as writing to at least two replicas in one data center and at least one replica in a second data center.
IN-MEMORY PERFORMANCE WITH ON-DISK CAPACITY
MongoDB makes extensive use of RAM to speed up database operations. Reading data from memory is measured in nanoseconds, whereas reading data from spinning disk is measured in milliseconds; reading from memory is approximately 100,000 times faster than reading data from disk. In MongoDB, all data is read and manipulated through memory-mapped files. Data that is not accessed is not loaded into RAM. While it is not required that all data fit in RAM, it should be the goal of the deployment team that indexes and all data that is frequently accessed should fit in RAM.
For example, it may be the case that a fraction of the entire database is most frequently accessed by the application, such as data related to recent events or popular products. If the volume of data that is frequently accessed exceeds the capacity of a single machine, MongoDB can scale horizontally across multiple servers using automatic sharding. Because MongoDB provides in-memory performance, for most applications there is no need for a separate caching layer.