GIANT Stories at MongoDB

MongoDB Atlas Best Practices: Part 3

Scaling your MongoDB Atlas Deployment, Delivering Continuous Application Availability

MongoDB Atlas radically simplifies the operation of MongoDB. As with any hosted database as a service there are still decisions you need to take to ensure the best performance and availability for your application. This blog series provides a series of recommendations that will serve as a solid foundation for getting the most out of the MongoDB Atlas service.

We’ll cover four main areas over this series of blog posts:

  • In part 1, we got started by preparing for our deployment, focusing specifically on schema design and application access patterns.
  • In part 2, we discussed additional considerations as you prepare for your deployment, including indexing, data migration and instance selection.
  • In this part 3 post, we are going dive into how you scale your MongoDB Atlas deployment, and achieve your required availability SLAs.
  • In the final part 4, we’ll wrap up with best practices for operational management and ensuring data security.

If you want to get a head start and learn about all of these topics now, just go ahead and download the MongoDB Atlas Best Practices guide.

Scaling a MongoDB Atlas Cluster

Horizontal Scaling with Sharding

*Figure 1: Create a sharded MongoDB Atlas cluster in just a few clicks*

MongoDB Atlas provides horizontal scale-out for databases using a technique called sharding, which is transparent to applications. MongoDB distributes data across multiple Replica Sets called shards. With automatic balancing, MongoDB ensures data is equally distributed across shards as data volumes grow or the size of the cluster increases or decreases. Sharding allows MongoDB deployments to scale beyond the limitations of a single server, such as bottlenecks in RAM or disk I/O, without adding complexity to the application.

MongoDB Atlas supports three types of sharding policy, enabling administrators to accommodate diverse query patterns:

  • Range-based sharding: Documents are partitioned across shards according to the shard key value. Documents with shard key values close to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range-based queries.
  • Hash-based sharding: Documents are uniformly distributed according to an MD5 hash of the shard key value. Documents with shard key values close to one another are unlikely to be co-located on the same shard. This approach guarantees a uniform distribution of writes across shards – provided that the shard key has high cardinality – making it optimal for write-intensive workloads.
  • Location-aware sharding: Documents are partitioned according to a user-specified configuration that "tags" shard key ranges to physical shards residing on specific hardware.

Users should consider deploying a sharded MongoDB Atlas cluster in the following situations:

  • RAM Limitation: The size of the system's active working set plus indexes is expected to exceed the capacity of the maximum amount of RAM in the provisioned instance.
  • Disk I/O Limitation: The system will have a large amount of write activity, and the operating system will not be able to write data fast enough to meet demand, or I/O bandwidth will limit how fast the writes can be flushed to disk.
  • Storage Limitation: The data set will grow to exceed the storage capacity of a single node in the system.

Applications that meet these criteria, or that are likely to do so in the future, should be designed for sharding in advance rather than waiting until they have consumed available capacity. Applications that will eventually benefit from sharding should consider which collections they will want to shard and the corresponding shard keys when designing their data models. If a system has already reached or exceeded its capacity, it will be challenging to deploy sharding without impacting the application's performance.

Between 1 and 12 shards can be configured in MongoDB Atlas.

Sharding Best Practices

Users who choose to shard should consider the following best practices.

Select a good shard key: When selecting fields to use as a shard key, there are at least three key criteria to consider:

  1. Cardinality: Data partitioning is managed in 64 MB chunks by default. Low cardinality (e.g., a user's home country) will tend to group documents together on a small number of shards, which in turn will require frequent rebalancing of the chunks and a single country is likely to exceed the 64 MB chunk size. Instead, a shard key should exhibit high cardinality.
  2. Insert Scaling: Writes should be evenly distributed across all shards based on the shard key. If the shard key is monotonically increasing, for example, all inserts will go to the same shard even if they exhibit high cardinality, thereby creating an insert hotspot. Instead, the key should be evenly distributed.
  3. Query Isolation: Queries should be targeted to a specific shard to maximize scalability. If queries cannot be isolated to a specific shard, all shards will be queried in a pattern called scatter/gather, which is less efficient than querying a single shard.
  4. Ensure uniform distribution of shard keys: When shard keys are not uniformly distributed for reads and writes, operations may be limited by the capacity of a single shard. When shard keys are uniformly distributed, no single shard will limit the capacity of the system.

For more on selecting a shard key, see Considerations for Selecting Shard Keys.

Avoid scatter-gather queries: In sharded systems, queries that cannot be routed to a single shard must be broadcast to multiple shards for evaluation. Because these queries involve multiple shards for each request they do not scale well as more shards are added.

Use hash-based sharding when appropriate: For applications that issue range-based queries, range-based sharding is beneficial because operations can be routed to the fewest shards necessary, usually a single shard. However, range-based sharding requires a good understanding of your data and queries, which in some cases may not be practical. Hash-based sharding ensures a uniform distribution of reads and writes, but it does not provide efficient range-based operations.

Apply best practices for bulk inserts: Pre-split data into multiple chunks so that no balancing is required during the insert process. For more information see Create Chunks in a Sharded Cluster in the MongoDB Documentation.

Add capacity before it is needed: Cluster maintenance is lower risk and more simple to manage if capacity is added before the system is over utilized.

Continuous Availability & Data Consistency

Data Redundancy

MongoDB maintains multiple copies of data, called replica sets, using native replication. Replica failover is fully automated in MongoDB, so it is not necessary to manually intervene to recover nodes in the event of a failure.

A replica set consists of multiple replica nodes. At any given time, one member acts as the primary replica and the other members act as secondary replicas. If the primary member fails for any reason (e.g., a failure of the host system), one of the secondary members is automatically elected to primary and begins to accept all writes; this is typically completed in 2 seconds or less and reads can optionally continue on the secondaries.

Sophisticated algorithms control the election process, ensuring only the most suitable secondary member is promoted to primary, and reducing the risk of unnecessary failovers (also known as "false positives"). The election algorithm processes a range of parameters including analysis of histories to identify those replica set members that have applied the most recent updates from the primary and heartbeat and connectivity status.

A larger number of replica nodes provide increased protection against database downtime in case of multiple machine failures. A MongoDB Atlas replica set can be configured with 3, 5, or 7 replicas. Replica set members are deployed across availability zones to avoid the failure of a data center interrupting service to the MongoDB Atlas cluster.

More information on replica sets can be found on the Replication MongoDB documentation page.

Write Guarantees

MongoDB allows administrators to specify the level of persistence guarantee when issuing writes to the database, which is called the write concern. The following options can be selected in the application code:

  • Write Acknowledged: This is the default write concern. The mongod will confirm the execution of the write operation, allowing the client to catch network, duplicate key, Document Validation, and other exceptions
  • Journal Acknowledged: The mongod will confirm the write operation only after it has flushed the operation to the journal on the primary. This confirms that the write operation can survive a mongod crash and ensures that the write operation is durable on disk
  • Replica Acknowledged: It is also possible to wait for acknowledgment of writes to other replica set members. MongoDB supports writing to a specific number of replicas. This mode also ensures that the write is written to the journal on the secondaries. Because replicas can be deployed across racks within data centers and across multiple data centers, ensuring writes propagate to additional replicas can provide extremely robust durability
  • Majority: This write concern waits for the write to be applied to a majority of replica set members, and that the write is recorded in the journal on these replicas – including on the primary

Read Preferences

Reading from the primary replica is the default configuration as it guarantees consistency. Updates are typically replicated to secondaries quickly, depending on network latency. However, reads on the secondaries will not normally be consistent with reads on the primary. Note that the secondaries are not idle as they must process all writes replicated from the primary. To increase read capacity in your operational system consider sharding. Secondary reads can be useful for analytics and ETL applications as this approach will isolate traffic from operational workloads. You may choose to read from secondaries if your application can tolerate eventual consistency.

A very useful option is primaryPreferred, which issues reads to a secondary replica only if the primary is unavailable. This configuration allows for the continuous availability of reads during the short failover process.

For more on the subject of configurable reads, see the MongoDB Documentation page on replica set Read Preference.

Next Steps

That’s a wrap for part 3 of the MongoDB Atlas best practices blog series. In the final instalment, we’ll dive into best practices for operational management and ensuring data security

Remember, if you want to get a head start and learn about all of our recommendations now:

Download MongoDB Atlas Best Practices guide


Avoiding the Dark Side of the Cloud: Platform Lock-In

Mat Keep

Business, Cloud

Date Your Cloud Provider….But Don’t Marry Them

The rise of cloud computing is indisputable – driven primarily by the promise of agility in bringing new applications to market faster, and by more closely aligning expense with actual business usage. But moving to the cloud is not without risk. Many surveys, such as the MongoDB Cloud Brief discussed later, point at the fear of platform lock-in as one of the top inhibitors to on-going cloud adoption. Enterprises are turning to open source software to throw off the shackles of proprietary hardware and software. But they are also concerned about exposing the business to a new level of lock-in. This time from APIs and services of the cloud providers themselves.

In this blog, we explore the drivers and inhibitors of cloud adoption, as well as which factors are driving the fear of cloud lock-in. We’ll then discuss the steps users can take to get the best of both worlds – the business velocity provided by the cloud, without the risks of locking themselves into a specific vendor.

Growing Cloud Adoption

So how quickly is the cloud growing? Recent analysis of the Infrastructure-as-a-Service (IaaS) market by IDC (1) provides some interesting statistics:

  • Spending on public cloud platforms is expected to reach $23bn by the end of 2016, representing just under 20% growth over 2015.
  • Private cloud spending is expected to reach $13bn over the same period, representing 10% growth.

If we contrast this with spending on “traditional” IT infrastructure, we see a forecast decline of 4.5% through 2016. Now is not a good time to be a peddler of premium IT hardware. By 2020, IDC expects total IaaS cloud spending to hit just under $60bn, making revenues (almost) as large as the traditional IT infrastructure sector.

Of course the cloud is a natural home for startups building their businesses. I’m old enough to remember when early seed funding was dedicated purely to financing your own Sun hardware and Oracle software licenses so that you could actually demo your new concept. The thought of doing this today is laughable.

But it’s not just startups that are driving cloud growth. Research from RightScale (2) concluded that 17% of enterprises now have over 1,000 Virtual Machines (VMs) provisioned to public cloud providers, up from 13% of enterprises in 2015. Private cloud showed even stronger growth with 31% of enterprises running more than 1,000 VMs, up from 22% in 2015.

Here at MongoDB, we’ve conducted our own research as we polled over 2,000 members of the MongoDB community. This research found that 82% of respondents were strategically using or evaluating the cloud today. This, and a multitude of other fun facts and insights are available in our MongoDB 2016 Cloud Brief.

So What are the Top Drivers, and the Top Inhibitors of Cloud Adoption?

As the Cloud Brief shows, the number one driver for cloud adoption is agility – the need to rollout new applications faster. This desire was reinforced at a recent meeting I had in London with developers from a leading global financial institution. They complained it takes three months for hardware supporting a new project to be procured, installed, racked, and stacked. Clearly unacceptable in today’s hyper-competitive market governed by agile development, continuous integration, and elastic scaling.

This need for application agility was the top cited reason for cloud adoption across organizations of all sizes – from those with less than 50 employees to enterprises with more than 5,000. It was also the top reason for cloud adoption cited across all job titles – from the CIO through to architects, developers and DBAs.

The Cloud Brief shows another interesting statistic. The majority of respondents use more than one cloud provider. This was primarily driven by the need to take advantage of specific features offered by one provider over another, and this clearly demonstrates the need to remain flexible in your cloud choices. Hitching your wagon to one provider could present serious competitive disadvantage if another cloud vendor introduces something that your rivals can take advantage of, but you can’t. What is that “something”? It could be a specific service or feature, region, instance type, pricing schedule, performance kick. The list goes on.

When our survey respondents evaluated the leading inhibitors of cloud usage, security and data privacy came out on top, followed closely by cloud vendor lock-in. We did see more bifurcation in the responses to this specific question:

  • Security was the top inhibitor in medium and large-sized enterprises. Lock-in was the second top inhibitor.
  • Lock-in was the top concern among smaller enterprises.

What did company size have to do with the differences in response?

Small organizations have increased freedom to innovate quickly and are less likely to be tied to legacy software. Maintaining maximum flexibility as they build their apps means avoiding vendor lock-in that can present restrictions on this ability to evolve.

Larger organizations are more likely to have mature contracts with software vendors and are therefore less sensitive to the loss of flexibility caused by long-term vendor agreements. Concerns over data security resonate more for large enterprises as high profile attacks and data breaches are a substantial threat to a large brand. However, lock-in was the second top concern for these larger organizations, ahead of the technical expertise needed to run on the cloud, or concerns about maintaining performance and availability SLAs for workloads running in the cloud.

So Where Does the Fear of Lock-in Come From?

As discussed in the introduction to this post, many organizations have been burnt by lock-in in the past. The use of open source software and commodity hardware has provided an escape route for many, but they have concerns that by moving to the cloud, they trade one form of lock-in for another.

What form does that lock-in take? It’s not about the hardware, operating systems and software of the past, instead it’s about APIs, services, and data. The underlying IaaS components made up by compute, storage and networking are pretty much commodity and can be exchanged between cloud providers. But as we move up the infrastructure stack, so the APIs and data these services exchange become much less portable. Specifically, we need to think about security, management, continuous integration/continuous delivery (CI/CD) pipelines, container orchestration, serverless compute fabrics, content management, search, databases, data warehouses, and analytics, to name just some of the key friction points.

And it’s those services that manage our data that cause particular concern. You may have heard of the term “data gravity”. It was (presciently) coined a few years ago, but has a real resonance today. In the same way that as the mass of an object increases so the strength of gravitational pull against it increases as well, in the case of data gravity the more data you have in a specific location, the harder it is to move.

An article (3) in the UK’s Computing tech publication illustrates this point. Comparethemarket.com, the largest price comparison site in the UK, made the switch from managing its own on-premises infrastructure to Amazon Web Services (AWS). As a part of that move, the IT team considered the AWS DynamoDB NoSQL database service. However, concerns around exposing itself to excessive AWS control made comparethemarket eliminate DynamoDB as an option. The company has since standardized on MongoDB as the operational database for its microservices-based architecture.

There is an important take-away in all of this:

It’s fine to date your cloud provider….but don’t marry them.

MongoDB and the Cloud

We’ve just launched our shiny new MongoDB Atlas database as a service, providing all of the features of MongoDB, without the operational heavy lifting required for any application.

So isn’t this new service also presenting the risk of cloud lock-in? The answer is “no”, for two important reasons.

The first is that MongoDB Atlas is designed to run on multiple public cloud platforms – so you can spin it up on your vendor of choice. It is available on AWS today, with Azure and Google Cloud Platform coming soon. Eventually we plan to offer MongoDB Atlas across clouds, so you can stretch your MongoDB deployment across providers to take advantage of, for example, specific pricing schemes, regions, or platform features.

Secondly, MongoDB Atlas is running the same software you can download yourself from the MongoDB Download Center. This means MongoDB can run on your laptop, on your own local servers, in your chosen co-location facilities, or on your own instances on any public cloud provider.

It is quick and easy to migrate existing databases into MongoDB Atlas, and to get it back out again, as we demonstrate in this MongoDB Atlas migration blog. What is also really helpful in mitigating lock-in is that if you decide you want to bring operations out of MongoDB Atlas and back under your control, it is easy to move your databases onto your own infrastructure and manage them using the MongoDB Ops Manager and MongoDB Cloud Manager tools. The user experience across MongoDB Atlas, Cloud Manager, and Ops Manager is consistent, ensuring that disruption is minimal if you decide to switch to your own infrastructure.

*Figure 1: Consistent operational interface, where-ever you run MongoDB*

The reality is that if you try to achieve this type of flexibility with any of the public cloud vendors database services, you’ll soon hit a wall.

Next Steps We all know the cloud is great. But it doesn’t come without risks. If you are looking to move your databases to the cloud – whether for new apps, or migrations of existing on-premises apps, then go ahead and check out MongoDB Atlas. Providing freedom from database cloud lock-in.

Try MongoDB Atlas


(1) http://www.informationweek.com/cloud/infrastructure-as-a-service/cloud-spending-will-top-$37-billion-in-2016-idc-reports/d/d-id/1326193

(2) http://www.rightscale.com/blog/cloud-industry-insights/cloud-computing-trends-2016-state-cloud-survey

(3) http://www.computing.co.uk/ctg/news/2411982/making-movies-how-comparethemarketcom-got-meerkat-movies-up-and-running-in-months

MongoDB Atlas Best Practices: Part 2

Preparing for your MongoDB Deployment: Indexing, Data Migration & Instance Selection

MongoDB Atlas radically simplifies the operation of MongoDB. As with any hosted database as a service there are still decisions you need to take to ensure the best performance and availability for your application. This blog series provides a series of recommendations that will serve as a solid foundation for getting the most out of the MongoDB Atlas service.

We’ll cover four main areas over this series of blog posts:

  • In part 1, we got started by preparing for our deployment, focusing specifically on schema design and application access patterns.
  • In this part 2 post, we’ll discuss additional considerations as you prepare for your deployment, including indexing, data migration and instance selection.
  • In part 3, we’ll dive into how you scale your MongoDB Atlas deployment, and achieve your required availability SLAs.
  • In the final part 4, we’ll wrap up with best practices for operational management and ensuring data security.

If you want to get a head start and learn about all of these topics now, just go ahead and download the MongoDB Atlas Best Practices guide.

Indexing

Like most database management systems, indexes are a crucial mechanism for optimizing MongoDB query performance. While indexes will improve the performance of some operations by one or more orders of magnitude, they incur overhead to updates, disk space, and memory usage. Users should always create indexes to support queries, but should not maintain indexes that queries do not use. This is particularly important for deployments that support insert-heavy (or writes which modify indexed values) workloads.

To understand the effectiveness of the existing indexes being used, an $indexStats aggregation stage can be used to determine how frequently each index is used. This information can also be accessed through MongoDB Compass.

Query Optimization

Queries are automatically optimized by MongoDB to make evaluation of the query as efficient as possible. Evaluation normally includes the selection of data based on predicates, and the sorting of data based on the sort criteria provided. The query optimizer selects the best indexes to use by periodically running alternate query plans and selecting the index with the best performance for each query type. The results of this empirical test are stored as a cached query plan and periodically updated.

MongoDB provides an explain plan capability that shows information about how a query will be, or was, resolved, including:

  • The number of documents returned
  • The number of documents read
  • Which indexes were used
  • Whether the query was covered, meaning no documents needed to be read to return results
  • Whether an in-memory sort was performed, which indicates an index would be beneficial
  • The number of index entries scanned
  • How long the query took to resolve in milliseconds (when using the executionStats mode)
  • Which alternative query plans were rejected (when using the allPlansExecution mode)

The explain plan will show 0 milliseconds if the query was resolved in less than 1 ms, which is typical in well-tuned systems. When the explain plan is called, prior cached query plans are abandoned, and the process of testing multiple indexes is repeated to ensure the best possible plan is used. The query plan can be calculated and returned without first having to run the query. This enables DBAs to review which plan will be used to execute the query, without having to wait for the query to run to completion. The feedback from explain() will help you understand whether your query is performing optimally.

*Figure 1: MongoDB Compass visual explain plan*

MongoDB Compass also provides rich query plan visualizations to assist engineering teams to quickly access and optimize query execution.

Profiling

MongoDB provides a profiling capability called Database Profiler, which logs fine-grained information about database operations. The profiler can be enabled to log information for all events or only those events whose duration exceeds a configurable threshold (whose default is 100 ms). Profiling data is stored in a capped collection where it can easily be searched for relevant events. It may be easier to query this collection than parsing the log files.

Primary and Secondary Indexes

A unique index on the _id attribute is created for all documents. MongoDB will automatically create the _id field and assign a unique value if the value is not be specified when the document is inserted. All user-defined indexes are secondary indexes. MongoDB includes support for many types of secondary indexes that can be declared on any field(s) in the document, including fields within arrays and sub-documents. Index options include:

  • Compound indexes
  • Geospatial indexes
  • Text search indexes
  • Unique indexes
  • Array indexes
  • TTL indexes
  • Sparse indexes
  • Partial Indexes
  • Hash indexes

You can learn more about each of these indexes from the MongoDB Architecture Guide

Index Creation Options

Indexes and data are updated synchronously in MongoDB, thus ensuring queries on indexes never return stale or deleted data. The appropriate indexes should be determined as part of the schema design process. By default creating an index is a blocking operation in MongoDB. Because the creation of indexes can be time and resource intensive, MongoDB provides an option for creating new indexes as a background operation on both the primary and secondary members of a replica set. When the background option is enabled, the total time to create an index will be greater than if the index was created in the foreground, but it will still be possible to query the database while creating indexes.

In addition, multiple indexes can be built concurrently in the background. Refer to the Build Index on Replica Sets documentation to learn more about considerations for index creation and on-going maintenance.

Common Mistakes Regarding Indexes

The following tips may help to avoid some common mistakes regarding indexes:

  • Use a compound index rather than index intersection: For best performance when querying via multiple predicates, compound indexes will generally be a better option.
  • Compound indexes: Compound indexes are defined and ordered by field. So, if a compound index is defined for last name, first name and city, queries that specify last name or last name and first name will be able to use this index, but queries that try to search based on city will not be able to benefit from this index. Remove indexes that are prefixes of other indexes.
  • Low selectivity indexes: An index should radically reduce the set of possible documents to select from. For example, an index on a field that indicates gender is not as beneficial as an index on zip code, or even better, phone number.
  • Regular expressions: Indexes are ordered by value, hence leading wildcards are inefficient and may result in full index scans. Trailing wildcards can be efficient if there are sufficient case-sensitive leading characters in the expression.
  • Negation: Inequality queries can be inefficient with respect to indexes. Like most database systems, MongoDB does not index the absence of values and negation conditions may require scanning all documents. If negation is the only condition and it is not selective (for example, querying an orders table, where 99% of the orders are complete, to identify those that have not been fulfilled), all records will need to be scanned.
  • Eliminate unnecessary indexes: Indexes are resource-intensive: even with they consume RAM, and as fields are updated their associated indexes must be maintained, incurring additional disk I/O overhead. To understand the effectiveness of the existing indexes being used, an $indexStats aggregation stage can be used to determine how frequently each index is used. If there are indexes that are not used then removing them will reduce storage and speed up writes.

Working Sets

MongoDB makes extensive use of RAM to speed up database operations. In MongoDB, all data is read and manipulated through in-memory representations of the data. Reading data from memory is measured in nanoseconds and reading data from disk is measured in milliseconds, thus reading from memory is orders of magnitude faster than reading from disk.

The set of data and indexes that are accessed during normal operations is called the working set. It is best practice that the working set fits in RAM. It may be the case the working set represents a fraction of the entire database, such as in applications where data related to recent events or popular products is accessed most commonly.

When MongoDB attempts to access data that has not been loaded in RAM, it must be read from disk. If there is free memory then the operating system can locate the data on disk and load it into memory directly. However, if there is no free memory, MongoDB must write some other data from memory to disk, and then read the requested data into memory. This process can be time consuming and significantly slower than accessing data that is already resident in memory.

Some operations may inadvertently purge a large percentage of the working set from memory, which adversely affects performance. For example, a query that scans all documents in the database, where the database is larger than available RAM on the server, will cause documents to be read into memory and may lead to portions of the working set being written out to disk. Other examples include various maintenance operations such as compacting or repairing a database and rebuilding indexes.

If your database working set size exceeds the available RAM of your system, consider provisioning an instance with larger RAM capacity (scaling up) or sharding the database across additional instances (scaling out). Scaling is an automated, on-line operation which is launched by selecting the new configuration after clicking the CONFIGURE button in MongoDB Atlas (Figure 1). For a discussion on this topic, refer to the section on Sharding Best Practices in part 3 of the blog series. It is easier to implement sharding before the system’s resources are consumed, so capacity planning is an important element in successful project delivery.

*Figure 2: Reconfiguring the MongoDB Atlas Cluster*

Data Migration

Users should assess how best to model their data for their applications rather than simply importing the flat file exports of their legacy systems. In a traditional relational database environment, data tends to be moved between systems using delimited flat files such as CSV. While it is possible to ingest data into MongoDB from CSV files, this may in fact only be the first step in a data migration process. It is typically the case that MongoDB's document data model provides advantages and alternatives that do not exist in a relational data model.

There are many options to migrate data from flat files into rich JSON documents, including mongoimport, custom scripts, ETL tools and from within an application itself which can read from the existing RDBMS and then write a JSON version of the document back to MongoDB.

Other tools such as mongodump and mongorestore, or MongoDB Atlas backups are useful for moving data between different MongoDB systems. The use of mongodump and mongorestore to migrate an application and its data to MongoDB Atlas is described in the post – Migrating Data to MongoDB Atlas.

MongoDB Atlas Instance Selection

The following recommendations are only intended to provide high-level guidance for hardware for a MongoDB deployment. The specific configuration of your hardware will be dependent on your data, queries, performance SLA, and availability requirements.

Memory

As with most databases, MongoDB performs best when the working set (indexes and most frequently accessed data) fits in RAM. Sufficient RAM is the most important factor for instance selection; other optimizations may not significantly improve the performance of the system if there is insufficient RAM. When selecting which MongoDB Atlas instance size to use, opt for one that has sufficient RAM to hold the full working data set (or the appropriate subset if sharding).

If your working set exceeds the available RAM, consider using a larger instance type or adding additional shards to your system.

Storage

Using faster storage can increase database performance and latency consistency. Each node must be configured with sufficient storage for the full data set, or for the subset to be stored in a single shard. The storage speed and size can be set when picking the MongoDB Atlas instance during cluster creation or reconfiguration.

*Figure 3: Select instance size and storage size & speed*

Data volumes can optionally be encrypted which increases security at the expense of reduced performance.

CPU

MongoDB Atlas instances are multi-threaded and can take advantage of many CPU cores. Specifically, the total number of active threads (i.e., concurrent operations) relative to the number of CPUs can impact performance:

  • Throughput increases as the number of concurrent active operations increases up to and beyond the number of CPUs
  • Throughput eventually decreases as the number of concurrent active operations exceeds the number of CPUs by some threshold amount

The threshold amount depends on your application. You can determine the optimum number of concurrent active operations for your application by experimenting and measuring throughput.

The larger MongoDB Atlas instances include more virtual CPUs and so should be considered for highly concurrent workloads.

Next Steps

That’s a wrap for part 2 of the MongoDB Atlas best practices blog series. In Part 3, we’ll dive into scaling your MongoDB Atlas cluster, and achieving continuous availability

Download MongoDB Atlas Best Practices Guide


MongoDB Atlas Best Practices: Part 1

Preparing for your MongoDB Deployment: Schema Design & Access Patterns

MongoDB Atlas radically simplifies the operation of MongoDB. As with any hosted database as a service there are still decisions you need to take to ensure the best performance and availability for your application. This blog series provides a number of recommendations that will serve as a solid foundation for getting the most out of the MongoDB Atlas service.

We’ll cover four main areas over this series of blog posts:

  • In this part 1 post, we’ll get started with preparing for your deployment, focusing specifically on schema design and application access patterns.
  • In part 2, we’ll discuss additional considerations as you prepare for your deployment, including indexing, data migration, and instance selection.
  • In part 3, we’ll dive into how you scale your MongoDB Atlas deployment, and achieve your required availability SLAs.
  • In the final part 4, we’ll wrap up with best practices for operational management and ensuring data security.

If you want to get a head start and learn about all of these topics now, just go ahead and download the MongoDB Atlas Best Practices guide.

So What is MongoDB Atlas?

MongoDB Atlas provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on-demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on your code and your customers.

It’s easy to get started – use a simple GUI to select the appropriate instance size, geographic region, and features you need. MongoDB Atlas provides:

  • Security features to protect access to your data
  • Built in replication for always-on availability, tolerating complete data center failure
  • Backups and point in time recovery to protect against data corruption
  • Fine-grained monitoring to help you know when to scale. Additional instances can be provisioned with the push of a button
  • Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
  • A choice of cloud providers, regions, and billing options

MongoDB Atlas is versatile. It’s great for everything from a quick Proof of Concept, to test/QA environments, to complete production clusters. If you decide you want to bring operations back under your control, it is easy to move your databases onto your own infrastructure and manage them using MongoDB Ops Manager or MongoDB Cloud Manager. The user experience across MongoDB Atlas, Cloud Manager, and Ops Manager is consistent, ensuring that disruption is minimal if you decide to migrate to your own infrastructure.

So now that you know what MongoDB Atlas is, let’s get started preparing for our deployment.

Schema Design

Developers and data architects should work together to develop the right data model, and they should invest time in this exercise early in the project. The requirements of the application should drive the data model, updates, and queries of your MongoDB system. Given MongoDB's dynamic schema, developers and data architects can continue to iterate on the data model throughout the development and deployment processes to optimize performance and storage efficiency, as well as support the addition of new application features. All of this can be done without expensive schema migrations.

Document Model

MongoDB stores data as documents in a binary representation called BSON. The BSON encoding extends the popular JSON representation to include additional types such as int, long, and date. BSON documents contain one or more fields, and each field contains a value of a specific data type, including arrays, sub-documents and binary data. It may be helpful to think of documents as roughly equivalent to rows in a relational database, and fields as roughly equivalent to columns. However, MongoDB documents tend to have all related data for a given record or object in a single document, whereas in a relational database that data is usually normalized across rows in many tables. For example, data that belongs to a parent-child relationship in two RDBMS tables can frequently be collapsed (embedded) into a single document in MongoDB. For operational applications, the document model makes JOINs redundant in many cases.

Where possible, store all data for a record in a single document. MongoDB provides ACID compliance at the document level. When data for a record is stored in a single document the entire record can be retrieved in a single seek operation, which is very efficient. In some cases it may not be practical to store all data in a single document, or it may negatively impact other operations. Make the trade-offs that are best for your application.

Rather than storing a large array of items in an indexed field, storing groups of values across multiple fields results in more efficient updates.

Collections

Collections are groupings of documents. Typically all documents in a collection have similar or related purposes for an application. It may be helpful to think of collections as being analogous to tables in a relational database.

Dynamic Schema & Document Validation

MongoDB documents can vary in structure. For example, documents that describe users might all contain the user id and the last date they logged into the system, but only some of these documents might contain the user's shipping address, and perhaps some of those contain multiple shipping addresses. MongoDB does not require that all documents conform to the same structure. Furthermore, there is no need to declare the structure of documents to the system – documents are self-describing.

DBAs and developers have the option to define Document Validation rules for a collection – enabling them to enforce checks on selected parts of a document's structure, data types, data ranges, and the presence of mandatory fields. As a result, DBAs can apply data governance standards, while developers maintain the benefits of a flexible document model. These are covered in the blog post Document Validation: Adding Just the Right Amount of Control Over Your Documents.

Indexes

MongoDB uses B-tree indexes to optimize queries. Indexes are defined on a collection’s document fields. MongoDB includes support for many indexes, including compound, geospatial, TTL, text search, sparse, partial, unique, and others. For more information see the section on indexing in the 2nd instalment of this blog series.

Transactions

Atomicity of updates may influence the schema for your application. MongoDB guarantees ACID compliant updates to data at the document level. It is not possible to update multiple documents in a single atomic operation, however the ability to embed related data into MongoDB documents eliminates this requirement in many cases. For use cases that do require multiple documents to be updated atomically, it is possible to implement Two Phase Commit logic in the application.

Visualizing your Schema: MongoDB Compass

The MongoDB Compass GUI allows users to understand the structure of existing data in the database and perform ad hoc queries against it – all with zero knowledge of MongoDB's query language. Typical users could include architects building a new MongoDB project or a DBA who has inherited a database from an engineering team, and who must now maintain it in production. You need to understand what kind of data is present, define what indexes might be appropriate, and identify if Document Validation rules should be added to enforce a consistent document structure.

*Figure 1: View schema & interactively build and execute database queries with MongoDB Compass*

Without MongoDB Compass, users wishing to understand the shape of their data would have to connect to the MongoDB shell and write queries to reverse engineer the document structure, field names, and data types. Similarly, anyone wanting to run custom queries on the data would need to understand MongoDB's query language.

MongoDB Compass can be used for free during development and it is also available for production use with MongoDB Professional or MongoDB Enterprise Advanced subscriptions.

Application Access Patterns

Schema design has a huge influence on database performance. How the application accesses the data can also have a major impact.

Searching on indexed attributes is typically the single most important pattern as it avoids collection scans. Taking it a step further, using covered queries avoids the need to access the collection data altogether. Covered queries return results from the indexes directly without accessing documents and are therefore very efficient. For a query to be covered, all the fields included in the query must be present in an index, and all the fields returned by the query must also be present in that index. To determine whether a query is a covered query, use the explain() method. If the explain() output displays true for the indexOnly field, the query is covered by an index, and MongoDB queries only that index to match the query and return the results.

Rather than retrieving the entire document in your application, updating fields, then saving the document back to the database, instead issue the update to specific fields. This has the advantage of less network usage and reduced database overhead.

Document Size

The maximum BSON document size in MongoDB is 16 MB. Users should avoid certain application patterns that would allow documents to grow unbounded. For example, in an e-commerce application it would be difficult to estimate how many reviews each product might receive from customers. Furthermore, it is typically the case that only a subset of reviews is displayed to a user, such as the most popular or the most recent reviews. Rather than modeling the product and customer reviews as a single document it would be better to model each review or groups of reviews as a separate document with a reference to the product document; while also storing the key reviews in the product document for fast access.

In practice most documents are a few kilobytes or less. Consider documents more like rows in a table than the tables themselves. Rather than maintaining lists of records in a single document, instead make each record a document. For large media items, such as video or images, consider using GridFS, a convention implemented by all the drivers that automatically stores the binary data across many smaller documents

Field names are repeated across documents and consume space – RAM in particular. By using smaller field names your data will consume less space, which allows for a larger number of documents to fit in RAM.

Data Lifecycle Management

MongoDB provides features to facilitate the management of data lifecycles, including Time to Live indexes, and capped collections.

Time to Live (TTL)

If documents in a collection should only persist for a pre-defined period of time, the TTL feature can be used to automatically delete documents of a certain age rather than scheduling a process to check the age of all documents and run a series of deletes. For example, if user sessions should only exist for one hour, the TTL can be set to 3600 seconds for a date field called lastActivity that exists in documents used to track user sessions and their last interaction with the system. A background thread will automatically check all these documents and delete those that have been idle for more than 3600 seconds. Another example use case for TTL is a price quote that should automatically expire after a period of time.

Capped Collections

In some cases a rolling window of data should be maintained in the system based on data size. Capped collections are fixed-size collections that support high-throughput inserts and reads based on insertion order. A capped collection behaves like a circular buffer: data is inserted into the collection, that insertion order is preserved, and when the total size reaches the threshold of the capped collection, the oldest documents are deleted to make room for the newest documents. For example, store log information from a high-volume system in a capped collection to quickly retrieve the most recent log entries.

Dropping a Collection

It is very efficient to drop a collection in MongoDB. If your data lifecycle management requires periodically deleting large volumes of documents, it may be best to model those documents as a single collection. Dropping a collection is much more efficient than removing all documents or a large subset of a collection, just as dropping a table is more efficient than deleting all the rows in a table in a relational database.

Disk space is automatically reclaimed after a collection is dropped.

Next Steps

That’s a wrap for part 1 of the MongoDB Atlas best practices blog series. In Part 2, we’ll continue along the path of preparing for our first deployment by discussing indexing and data migration.

Download MongoDB Atlas Best Practice Guide


Develop & Deploy a Node.js App to AWS Elastic Beanstalk & MongoDB Atlas

This post is part of our Road to re:Invent series. In the weeks leading up to AWS re:Invent in Las Vegas this November, we'll be posting about a number of topics related to running MongoDB in the public cloud. This post provides an introduction to Amazon Kinesis: its architecture, what it provides, and how it's typically used. It goes on to step through how to implement an application where data is ingested by Amazon Kinesis before being processed and then stored in MongoDB Atlas.

This is part of a series of posts which examine how to use MongoDB Atlas with a number of complementary technologies and frameworks.

Introduction to Amazon Kinesis

The role of Amazon Kinesis is to get large volumes of streaming data into AWS where it can then be processed, analyzed, and moved between AWS services. The service is designed to ingest and store terabytes of data every hour, from multiple sources. Kinesis provides high availability, including synchronous replication within an AWS region. It also transparently handles scalability, adding and removing resources as needed.

Once the data is inside AWS, it can be processed or analyzed immediately, as well as being stored using other AWS services (such as S3) for later use. By storing the data in MongoDB, it can be used both to drive real-time, operational decisions as well as for deeper analysis.

As the number, variety, and velocity of data sources grow, new architectures and technologies are needed. Technologies like Amazon Kinesis and Apache Kafka are focused on ingesting the massive flow of data from multiple fire hoses and then routing it to the systems that need it – optionally filtering, aggregating, and analyzing en-route.

AWS Kinesis Architecture

Figure 1: AWS Kinesis Architecture

Typical data sources include:

  • IoT assets and devices(e.g., sensor readings)
  • On-line purchases from an ecommerce store
  • Log files
  • Video game activity
  • Social media posts
  • Financial market data feeds

Rather than leave this data to fester in text files, Kinesis can ingest the data, allowing it to be processed to find patterns, detect exceptions, drive operational actions, and provide aggregations to be displayed through dashboards.

There are actually 3 services which make up Amazon Kinesis:

  • Amazon Kinesis Firehose is the simplest way to load massive volumes of streaming data into AWS. The capacity of your Firehose is adjusted automatically to keep pace with the stream throughput. It can optionally compress and encrypt the data before it's stored.
  • Amazon Kinesis Streams are similar to the Firehose service but give you more control, allowing for:
    • Multi-stage processing
    • Custom stream partitioning rules
    • Reliable storage of the stream data until it has been processed.
  • Amazon Kinesis Analytics is the simplest way to process the data once it has been ingested by either Kinesis Firehose or Streams. The user provides SQL queries which are then applied to analyze the data; the results can then be displayed, stored, or sent to another Kinesis stream for further processing.

This post focuses on Amazon Kinesis Streams, in particular, implementing a consumer that ingests the data, enriches it, and then stores it in MongoDB.

Accessing Kinesis Streams – the Libraries

There are multiple ways to read (consume) and write (produce) data with Kinesis Streams:

  • Amazon Kinesis Streams API
  • Amazon Kinesis Producer Library (KPL)
    • Easy to use and highly configurable Java library that helps you put data into an Amazon Kinesis stream. Amazon Kinesis Producer Library (KPL) presents a simple, asynchronous, high throughput, and reliable interface.
  • Amazon Kinesis Agent
    • The agent continuously monitors a set of files and sends new entries to your Stream or Firehose.
  • Amazon Kinesis Client Library (KCL)
    • A Java library that helps you easily build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis stream. KCL handles issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, providing fault-tolerance, and processing data.
  • Amazon Kinesis Client Library MultiLangDemon
    • The MultiLangDemon is used as a proxy by non-Java applications to use the Kinesis Client Library.
  • Amazon Kinesis Connector Library
    • A library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools.
  • Amazon Kinesis Storm Spout
    • A library that helps you easily integrate Amazon Kinesis Streams with Apache Storm.

The example application in this post use the Kinesis Agent and the Kinesis Client Library MultiLangDemon (with Node.js).

Role of MongoDB Atlas

MongoDB is a distributed database delivering a flexible schema for rapid application development, rich queries, idiomatic drivers, and built in redundancy and scale-out. This makes it the go-to database for anyone looking to build modern applications.

MongoDB Atlas is a hosted database service for MongoDB. It provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on what you do best.

It’s easy to get started – use a simple GUI to select the instance size, region, and features you need. MongoDB Atlas provides:

  • Security features to protect access to your data
  • Built in replication for always-on availability, tolerating complete data center failure
  • Backups and point in time recovery to protect against data corruption
  • Fine-grained monitoring to let you know when to scale. Additional instances can be provisioned with the push of a button
  • Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
  • A choice of regions and billing options

Like Amazon Kinesis, MongoDB Atlas is a natural fit for users looking to simplify their development and operations work, letting them focus on what makes their application unique rather than commodity (albeit essential) plumbing. Also like Kinesis, you only pay for MongoDB Atlas when you're using it with no upfront costs and no charges after you terminate your cluster.

Example Application

The rest of this post focuses on building a system to process log data. There are 2 sources of log data:

  1. A simple client that acts as a Kinesis Streams producer, generating sensor readings and writing them to a stream
  2. Amazon Kinesis Agent monitoring a SYSLOG file and sending each log event to a stream

In both cases, the data is consumed from the stream using the same consumer, which adds some metadata to each entry and then stores it in MongoDB Atlas.

Create Kinesis IAM Policy in AWS

From the IAM section of the AWS console use the wizard to create a new policy. The policy should grant permission to perform specific actions on a particular stream (in this case "ClusterDBStream") and the results should look similar to this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1476360711000",
            "Effect": "Allow",
            "Action": [
                "kinesis:DescribeStream",
                "kinesis:GetShardIterator",
                "kinesis:GetRecords",
                "kinesis:PutRecord",
                "kinesis:PutRecords",
                "kinesis:CreateStream"
            ],
            "Resource": [
                "arn:aws:kinesis:eu-west-1:658153047537:stream/ClusterDBStream"
            ]
        },
        {
            "Sid": "Stmt1476360824000",
            "Effect": "Allow",
            "Action": [
                "dynamodb:CreateTable",
                "dynamodb:DeleteItem",
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:PutItem",
                "dynamodb:Scan",
                "dynamodb:UpdateItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:eu-west-1:658153047537:table/ClusterDBStream"
            ]
        },
        {
            "Sid": "Stmt1476360951000",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Next, create a new user and associate it with the new policy. Important: Take a note of the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

Create MongoDB Atlas Cluster

Register with MongoDB Atlas and use the simple GUI to select the instance size, region, and features you need (Figure 2). Create MongoDB Atlas Cluster

Figure 2: Create MongoDB Atlas Cluster

Create a user with read and write privileges for just the database that will be used for your application, as shown in Figure 3.

Creating an Application user in MongoDB Atlas

Figure 3: Creating an Application user in MongoDB Atlas

You must also add the IP address of your application server to the IP Whitelist in the MongoDB Atlas security tab (Figure 4). Note that if multiple application servers will be accessing MongoDB Atlas then an IP address range can be specified in CIDR format (IP Address/number of significant bits).

Add App Server IP Address to MongoDB Atlas

Figure 4: Add App Server IP Address(es) to MongoDB Atlas

If your application server(s) are running in AWS, then an alternative to IP Whitelisting is to configure a VPC (Virtual Private Cloud) Peering relationship between your MongoDB Atlas group and the VPC containing your AWS resources. This removes the requirement to add and remove IP addresses as AWS reschedules functions, and is especially useful when using highly dynamic services such as AWS Lambda.

Click the "Connect" button and make a note of the URI that should be used when connecting to the database (note that you will substitute the user name and password with ones that you've just created).

App Part 1 – Kinesis/Atlas Consumer

The code and configuration files in Parts 1 & 2 are based on the sample consumer and producer included with the client library for Node.js (MultiLangDaemon).

Install the Node.js client library:

git clone https://github.com/awslabs/amazon-kinesis-client-nodejs.git
cd amazon-kinesis-client-nodejs
npm install

Install the MongoDB Node.js Driver:

npm install --save mongodb

Move to the consumer sample folder:

cd samples/basic_sample/consumer/

Create a configuration file ("logging_consumer.properties"), taking care to set the correct stream and application names and AWS region:

# The script that abides by the multi-language protocol. This script will
# be executed by the MultiLangDaemon, which will communicate with this script
# over STDIN and STDOUT according to the multi-language protocol.
executableName = node logging_consumer_app.js

# The name of an Amazon Kinesis stream to process.
streamName = ClusterDBStream

# Used by the KCL as the name of this application. Will be used as the name
# of an Amazon DynamoDB table which will store the lease and checkpoint
# information for workers with this application name
applicationName = ClusterDBStream

# Users can change the credentials provider the KCL will use to retrieve credentials.
# The DefaultAWSCredentialsProviderChain checks several other providers, which is
# described here:
# http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html
AWSCredentialsProvider = DefaultAWSCredentialsProviderChain

# Appended to the user agent of the KCL. Does not impact the functionality of the
# KCL in any other way.
processingLanguage = nodejs/0.10

# Valid options at TRIM_HORIZON or LATEST.
# See http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html#API_GetShardIterator_RequestSyntax
initialPositionInStream = TRIM_HORIZON

# The following properties are also available for configuring the KCL Worker that is created
# by the MultiLangDaemon.

# The KCL defaults to us-east-1
regionName = eu-west-1

The code for working with MongoDB can be abstracted to a helper file ("db.js"):

var MongoClient = require('mongodb').MongoClient;
var assert = require('assert');
var logger = require('../../util/logger');
var util = require('util');

function DB() {
    this.db = "empty";
    this.log = logger().getLogger('mongoMange-DB');
}

DB.prototype.connect = function(uri, callback) {
    this.log.info(util.format('About to connect to DB'));
    if (this.db != "empty") {
        callback();
        this.log.info('Already connected to database.');
    } else {
        var _this = this;
        MongoClient.connect(uri, function(err, database) {
            if (err) {
                _this.log.info(util.format('Error connecting to DB: %s', err.message));
                callback(err);
            } else {
                _this.db = database;
                _this.log.info(util.format('Connected to database.'));
                callback();
            }
        })
    }
}

DB.prototype.close = function(callback) {
    log.info('Closing database');
    this.db.close();
    this.log.info('Closed database');
    callback();
}

DB.prototype.addDocument = function(coll, doc, callback) {
    var collection = this.db.collection(coll);
    var _this = this;
    collection.insertOne(doc, function(err, result) {
    if (err) {
            _this.log.info(util.format('Error inserting document: %s', err.message));
            callback(err.message);
        } else {
            _this.log.info(util.format('Inserted document into %s collection.', coll));
            callback();
        }
  });
};

module.exports = DB;

Create the application Node.js file ("logging_consumer_app.js"), making sure to replace the database user and host details in "mongodbConnectString" with your own:

'use strict';

var fs = require('fs');
var path = require('path');
var util = require('util');
var kcl = require('../../..');
var logger = require('../../util/logger');
var DB = require('./DB.js')

var mongodbConnectString = 'mongodb://kinesis-user:??????@cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/clusterdb?ssl=true&replicaSet=Cluster0-shard-0&authSource=admin'
var mongodbCollection = 'logdata'
var database = new DB;

function recordProcessor() {
  var log = logger().getLogger('recordProcessor');
  var shardId;

  return {

    initialize: function(initializeInput, completeCallback) {
      shardId = initializeInput.shardId;

      // WARNING – the connection string may contain the password and so consider removing logging for any production system
      log.info(util.format('About to connect to %s.', mongodbConnectString));
      database.connect(mongodbConnectString, function(err) {
        log.info(util.format('Back from connecting to %s', mongodbConnectString));
        if (err) {
          log.info(util.format('Back from connecting to %s', mongodbConnectString));
        }
        completeCallback();
      })
    },

    processRecords: function(processRecordsInput, completeCallback) {
      log.info(util.format('In processRecords', mongodbConnectString));

      if (!processRecordsInput || !processRecordsInput.records) {
        completeCallback();
        return;
      }
      var records = processRecordsInput.records;
      var record, data, sequenceNumber, partitionKey, objectToStore;
      for (var i = 0 ; i < records.length ; ++i) {
        record = records[i];
        data = new Buffer(record.data, 'base64').toString();
        sequenceNumber = record.sequenceNumber;
        partitionKey = record.partitionKey;
        log.info(util.format('ShardID: %s, Record: %s, SeqenceNumber: %s, PartitionKey:%s', shardId, data, sequenceNumber, partitionKey));
        objectToStore = {};
        try {
          objectToStore = JSON.parse(data);
        }
        catch(err) {
          // Looks like it wasn't JSON so store the raw string
          objectToStore.payload = data;
        }
        objectToStore.metaData = {};
        objectToStore.metaData.mongoLabel = "Added by MongoMange";
        objectToStore.metaData.timeAdded = new Date();
        database.addDocument(mongodbCollection, objectToStore, function(err) {})
      }
      if (!sequenceNumber) {
        completeCallback();
        return;
      }
      // If checkpointing, completeCallback should only be called once checkpoint is complete.
      processRecordsInput.checkpointer.checkpoint(sequenceNumber, function(err, sequenceNumber) {
        log.info(util.format('Checkpoint successful. ShardID: %s, SeqenceNumber: %s', shardId, sequenceNumber));
        completeCallback();
      });
    },

    shutdown: function(shutdownInput, completeCallback) {
      // Checkpoint should only be performed when shutdown reason is TERMINATE.
      if (shutdownInput.reason !== 'TERMINATE') {
        completeCallback();
        return;
      }
      // Whenever checkpointing, completeCallback should only be invoked once checkpoint is complete.
      database.close(function(){
        shutdownInput.checkpointer.checkpoint(function(err) {
          completeCallback();
        });        
      });
    }
  };
}

kcl(recordProcessor()).run();

Note that this code adds some metadata to the received object before writing it to MongoDB. At this point, it is also possible to filter objects based on any of their fields.

Note also that this Node.js code logs a lot of information to the "application log" file (including the database password!); this is for debugging and would be removed from a real application.

The simplest way to have the application use the user credentials (noted when creating the user in AWS IAM) is to export them from the shell where the application will be launched:

export AWS_ACCESS_KEY_ID=????????????????????
export AWS_SECRET_ACCESS_KEY=????????????????????????????????????????

Finally, launch the consumer application:

../../../bin/kcl-bootstrap --java /usr/bin/java -e -p ./logging_consumer.properties

Check the "application.log" file for any errors.

App Part 2 – Kinesis Producer

As for the consumer, export the credentials for the user created in AWS IAM:

cd amazon-kinesis-client-nodejs/samples/basic_sample/producer


export AWS_ACCESS_KEY_ID=????????????????????
export AWS_SECRET_ACCESS_KEY=????????????????????????????????????????

Create the configuration file ("config.js") and ensure that the correct AWS region and stream are specified:

'use strict';

var config = module.exports = {
  kinesis : {
    region : 'eu-west-1'
  },

  loggingProducer : {
    stream : 'ClusterDBStream',
    shards : 2,
    waitBetweenDescribeCallsInSeconds : 5
  }
};

Create the producer code ("logging_producer.js"):

'use strict';

var util = require('util');
var logger = require('../../util/logger');

function loggingProducer(kinesis, config) {
  var log = logger().getLogger('loggingProducer');

  function _createStreamIfNotCreated(callback) {
    var params = {
      ShardCount : config.shards,
      StreamName : config.stream
    };

    kinesis.createStream(params, function(err, data) {
      if (err) {
        if (err.code !== 'ResourceInUseException') {
          callback(err);
          return;
        }
        else {
          log.info(util.format('%s stream is already created. Re-using it.', config.stream));
        }
      }
      else {
        log.info(util.format("%s stream doesn't exist. Created a new stream with that name ..", config.stream));
      }

      // Poll to make sure stream is in ACTIVE state before start pushing data.
      _waitForStreamToBecomeActive(callback);
    });
  }

  function _waitForStreamToBecomeActive(callback) {
    kinesis.describeStream({StreamName : config.stream}, function(err, data) {
      if (!err) {
        log.info(util.format('Current status of the stream is %s.', data.StreamDescription.StreamStatus));
        if (data.StreamDescription.StreamStatus === 'ACTIVE') {
          callback(null);
        }
        else {
          setTimeout(function() {
            _waitForStreamToBecomeActive(callback);
          }, 1000 * config.waitBetweenDescribeCallsInSeconds);
        }
      }
    });
  }

  function _writeToKinesis() {
    var currTime = new Date().getMilliseconds();
    var sensor = 'sensor-' + Math.floor(Math.random() * 100000);
    var reading = Math.floor(Math.random() * 1000000);

    var record = JSON.stringify({
      program: "logging_producer",
      time : currTime,
      sensor : sensor,
      reading : reading
    });

    var recordParams = {
      Data : record,
      PartitionKey : sensor,
      StreamName : config.stream
    };

    kinesis.putRecord(recordParams, function(err, data) {
      if (err) {
        log.error(err);
      }
      else {
        log.info('Successfully sent data to Kinesis.');
      }
    });
  }

  return {
    run: function() {
      _createStreamIfNotCreated(function(err) {
        if (err) {
          log.error(util.format('Error creating stream: %s', err));
          return;
        }
        var count = 0;
        while (count < 10) {
          setTimeout(_writeToKinesis(), 1000);
          count++;
        }
      });
    }
  };
}

module.exports = loggingProducer;

The producer is launched from "logging_producer_app.js":

'use strict';

var AWS = require('aws-sdk');
var config = require('./config');
var producer = require('./logging_producer');

var kinesis = new AWS.Kinesis({region : config.kinesis.region});
producer(kinesis, config.loggingProducer).run();

Run the producer:

node logging_producer_app.js

Check the consumer and producer "application.log" files for errors.

At this point, data should have been written to MongoDB Atlas. Using the connection string provided after clicking the "Connect" button in MongoDB Atlas, connect to the database and confirm that the documents have been added:

mongo "mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0" --ssl --username kinesis-user --password ?????? 

use clusterdb
db.logdata.find()

{ "_id" : ObjectId("5804d1d0aa1f330731204597"), "program" : "logging_producer", "time" : 702, "sensor" : "sensor-81057", "reading" : 639075, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.142Z") } }
{ "_id" : ObjectId("5804d1d0aa1f330731204598"), "program" : "logging_producer", "time" : 695, "sensor" : "sensor-805", "reading" : 981144, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.142Z") } }
{ "_id" : ObjectId("5804d1d0aa1f330731204599"), "program" : "logging_producer", "time" : 699, "sensor" : "sensor-2581", "reading" : 752020, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.143Z") } }
{ "_id" : ObjectId("5804d1d0aa1f33073120459a"), "program" : "logging_producer", "time" : 700, "sensor" : "sensor-56194", "reading" : 455700, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } }
{ "_id" : ObjectId("5804d1d0aa1f33073120459b"), "program" : "logging_producer", "time" : 706, "sensor" : "sensor-32956", "reading" : 113233, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } }
{ "_id" : ObjectId("5804d1d0aa1f33073120459c"), "program" : "logging_producer", "time" : 707, "sensor" : "sensor-96487", "reading" : 179047, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } }
{ "_id" : ObjectId("5804d1d0aa1f33073120459d"), "program" : "logging_producer", "time" : 697, "sensor" : "sensor-37595", "reading" : 935647, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:44.144Z") } }
{ "_id" : ObjectId("5804d1d15f0fbb074446ad6d"), "program" : "logging_producer", "time" : 704, "sensor" : "sensor-92660", "reading" : 756624, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.263Z") } }
{ "_id" : ObjectId("5804d1d15f0fbb074446ad6e"), "program" : "logging_producer", "time" : 701, "sensor" : "sensor-95222", "reading" : 850749, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.263Z") } }
{ "_id" : ObjectId("5804d1d15f0fbb074446ad6f"), "program" : "logging_producer", "time" : 704, "sensor" : "sensor-1790", "reading" : 271359, "metaData" : { "mongoLabel" : "Added by MongoMange", "timeAdded" : ISODate("2016-10-17T13:27:45.266Z") } }
App Part 3 – Capturing Live Logs Using Amazon Kinesis Agent

Using the same consumer, the next step is to stream real log data. Fortunately, this doesn't require any additional code as the Kinesis Agent can be used to monitor files and add every new entry to a Kinesis Stream (or Firehose).

Install the Kinesis Agent:

sudo yum install &#x2013;y aws-kinesis-agent

and edit the configuration file to use the correct AWS region, user credentials, and stream in "/etc/aws-kinesis/agent.json":

{
  "cloudwatch.emitMetrics": true,
  "kinesis.endpoint": "kinesis.eu-west-1.amazonaws.com",
  "cloudwatch.endpoint": "monitoring.eu-west-1.amazonaws.com",
  "awsAccessKeyId": "????????????????????",
  "awsSecretAccessKey": "????????????????????????????????????????", 
  "flows": [
    {
      "filePattern": "/var/log/messages*",
      "kinesisStream": "ClusterDBStream",
      "dataProcessingOptions": [{
        "optionName": "LOGTOJSON",
        "logFormat": "SYSLOG"
      }]
    }
  ]
}

"/var/log/messages" is a SYSLOG file and so a "dataProcessingOptions" field is included in the configuration to automatically convert each log into a JSON document before writing it to the Kinesis Stream.

The agent will not run as root and so the permissions for "/var/log/messages" need to be made more permissive:

sudo chmod og+r /var/log/messages

The agent can now be started:

sudo service aws-kinesis-agent start

Monitor the agent's log file to see what it's doing:

sudo tail -f /var/log/aws-kinesis-agent/aws-kinesis-agent.log

If there aren't enough logs being generated on the machine then extra ones can be injected manually for testing:

logger -i This is a test log

This will create a log with the "program" field set to your username (in this case, "ec2-user"). Check that the logs get added to MongoDB Atlas:

mongo "mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0" --ssl --username kinesis-user --password ?????? 

use clusterdb
db.logdata.findOne({program: "ec2-user"})

{
  "_id" : ObjectId("5804c9ed5f0fbb074446ad5f"),
  "timestamp" : "Oct 17 12:53:48",
  "hostname" : "ip-172-31-40-154",
  "program" : "ec2-user",
  "processid" : "6377",
  "message" : "This is a test log",
  "metaData" : {
    "mongoLabel" : "Added by MongoMange",
    "timeAdded" : ISODate("2016-10-17T12:54:05.456Z")
  }
}
Checking the Data with MongoDB Compass

To visually navigate through the MongoDB schema and data, download and install MongoDB Compass. Use your MongoDB Atlas credentials to connect Compass to your MongoDB database (the hostname should refer to the primary node in your replica set or a "mongos" process if your MongoDB cluster is sharded).

Navigate through the structure of the data in the "clusterdb" database (Figure 5) and view the JSON documents.

MongoDB Compass Explore Schema

Figure 5: Explore Schema Using MongoDB Compass

Clicking on a value builds a query and then clicking "Apply" filters the results (Figure 6).

Documents in MongoDB Compass

Figure 6: View Filtered Documents in MongoDB Compass
Add Document Validation Rules

One of MongoDB’s primary attractions for developers is that it gives them the ability to start application development without first needing to define a formal schema. Operations teams appreciate the fact that they don’t need to perform a time-consuming schema upgrade operation every time the developers need to store a different attribute.

This is well suited to the application built in this post as logs from different sources are likely to include different attributes. There are however some attributes that we always expect to be there (e.g., the metadata that the application is adding). For applications reading the documents from this collection to be able to rely on those fields being present, the documents should be validated before they are written to the database. Prior to MongoDB 3.2, those checks had to be implemented in the application but they can now be performed by the database itself.

Executing a single command from the "mongo" shell adds the document checks:

db.runCommand({
   collMod: "logdata",
   validator: { 
      $and: [
        {program: {$type: "string"}},
        {"metaData.mongoLabel": {$type: "string"}},
        {"metaData.timeAdded": {$type: "date"}}
    ]}})

The above command adds multiple checks:

  • The "program" field exists and contains a string
  • There's a sub-document called "metadata" containing at least 2 fields:
    • "mongoLabel" which must be a string
    • "timeAdded" which must be a date

Test that the rules are correctly applied when attempting to write to the database:

db.logdata.insert(
  {
  "program" : "dummy_entry",
  "time" : 666,
  "sensor" : "sensor-6666",
  "reading" : 66666,
  "metaData" : {
    "mongoLabel" : "Test Data",
    "timeAdded" : ISODate("2016-10-17T13:27:44.142Z")
  }
})
WriteResult({ "nInserted" : 1 })

db.logdata.insert(
  {
  "program" : "dummy_entry",
  "time" : 666,
  "sensor" : "sensor-6666",
  "reading" : 66666,
  "metaData" : {
    "mongoLabel" : "Test Data",
    "timeAdded" : "Just now"
  }
})

WriteResult({
  "nInserted" : 0,
  "writeError" : {
    "code" : 121,
    "errmsg" : "Document failed validation"
  }
})
Cleaning Up (IMPORTANT!)

Remember that you will continue to be charged for the services even when you're no longer actively using them. If you no longer need to use the services then clean up:

  • From the MongoDB Atlas GUI, select your Cluster, click on the ellipses and select "Terminate".
  • From the AWS management console select the Kinesis service, then Kinesis Streams, and then delete your stream.
  • From the AWS management console select the DynamoDB service, then tables, and then delete your table.

Using MongoDB Atlas with Other Frameworks and Services

We have detailed walkthroughs for using MongoDB Atlas with several programming languages and frameworks, as well as generic instructions that can be used with others. They can all be found in Using MongoDB Atlas From Your Favorite Language or Framework.


MongoDB Atlas, the cloud database service for MongoDB, is the easiest way to deploy and run MongoDB, allowing you to get started in minutes. Click here to learn more.

The MongoDB team will be at AWS re:Invent this November in Las Vegas and our CTO Eliot Horowitz will be speaking Thursday (12/1) at 11 am PST. If you’re attending re:Invent, be sure to attend the session & visit us at booth #1344!


Learn more about AWS re:Invent

MongoDB Atlas as The Data Store for Apostrophe

Apostrophe is a Content Management Systems that's designed to build content-driven web sites. Because of their ease of use, Apostrophe is built upon MongoDB and Node.js.

This post explains why MongoDB Atlas is an ideal choice for Apostrophe and then goes on to show how to configure Apostrophe to use it.

Why MongoDB Atlas is the Ideal Database for Apostrophe

MongoDB delivers flexible schemas, rich queries, an idiomatic Node.js driver, and simple to use high availability and scaling. This makes it the go-to database for anyone looking to build applications on Node.js.

MongoDB Atlas provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on what you do best.

It’s easy to get started – use a simple GUI to select the instance size, region, and features you need. MongoDB Atlas provides:

  • Security features to protect access to your data
  • Built in replication for always-on availability, tolerating complete data center failure
  • Backups and point in time recovery to protect against data corruption
  • Fine-grained monitoring to let you know when to scale. Additional instances can be provisioned with the push of a button
  • Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
  • A choice of cloud providers, regions, and billing options

Like Apostrophe, MongoDB Atlas is a natural fit for users looking to simplify their development and operations work, letting them focus on what makes their application unique rather than commodity (albeit essential) plumbing.

Installing Apostrophe and Setting it up to Use MongoDB Atlas

Before starting with Apostrophe, you should launch your MongoDB cluster using MongoDB Atlas and then (optionally) create a user with read and write privileges for just the database that will be used for this project, as shown in Figure 1. You must also add the IP address of your application server to the IP Whitelist in the MongoDB Atlas security tab.

Figure 1: Creating an Apostrophe user in MongoDB Atlas

If it isn't already installed on your system, download and install Node.js:

$ curl https://nodejs.org/dist/v4.4.7/node-v4.4.7-linux-x64.tar.xz -o node.tar.xz
$ tar xf node.tar.xz

You should then add the bin sub-folder to your .bash_profile file and then install ImageMagick (used by Apostrophe to handle image files); clone the Apostrophe Sandbox project; and then install its dependencies:

$ sudo yum install ImageMagick
$ mkdir Sites
$ cd Sites/
$ git clone https://github.com/punkave/apostrophe-sandbox
$ cd apostrophe-sandbox
$ npm install

Before starting Apostrophe you need to configure it with details on how to connect to your specific MongoDB Atlas cluster. This is done by cloning the configuration file to data/local.js:

$ mkdir data
$ cp local.example.js data/local.js

You should then edit the data/local.js file and set the uri parameter using the specific connection information provided for your MongoDB Atlas group:

db: {
    uri: 'mongodb://apostrophe_user:my_password@cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/clusterdb?ssl=true&authSource=admin'
  }

The URI contains these components:

  • apostrophe_user is the name of the user you created in the MongoDB Atlas UI
  • my_password is the password you chose when creating the user in MongoDB Atlas
  • cluster0-shard-00-00-qfovx.mongodb.net, cluster0-shard-00-01-qfovx.mongodb.net, & cluster0-shard-00-02-qfovx.mongodb.net are the hostnames of the instances in your MongoDB Atlas replica set (click on the "CONNECT" button in the MongoDB Atlas UI if you don't have these)
  • 27017 is the standard MongoDB port number
  • clusterdb is the name of the database (schema) that Apostrophe will use (note that this must match the project name used when installing Apostrophe as well as the database you granted the user access to
  • To enforce security, MongoDB Atlas mandates that the ssl option is used
  • admin is the database that's being used to store the credentials for apostrophe_user

Clients connect to Apostrophe through port 3000 and so you must open that port in your firewall.

You can then create the database and start Apostrophe:

$ node app apostrophe:reset
$ node app

Testing the Application

Browse to the application at http://address-of-app-server:3000 as shown in Figure 2 and then login using the username admin and the password demo.

Figure 2: Apostrophe Running on MongoDB Atlas

Now, go ahead and add some content (Figure 3).

Figure 3: Edit Apostrophe Home Page with Data Stored in MongoDB Atlas

Upload some images as shown in Figure 4.

Figure 4: Upload Images to Apostrophe on MongoDB Atlas

Optionally, to confirm that, MongoDB Atlas really is being used by Apostrophe, you can connect using the MongoDB shell:

$ sudo yum install mongodb-org-shell
$ mongo mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0 --ssl --username billy --password XXXXXX

Cluster0-shard-0:PRIMARY> show dbs
admin      0.000GB
clusterdb  0.001GB
local      0.000GB
Cluster0-shard-0:PRIMARY> use clusterdb
switched to db clusterdb
Cluster0-shard-0:PRIMARY> show collections
aposCache
aposPages
aposRedirects
aposVersions
aposVideos
sessions
Cluster0-shard-0:PRIMARY> db.aposPages.findOne()
{
    "_id" : "4444444444444",
    "slug" : "/",
    "path" : "home",
    "title" : "Pokémon - Gotta catch 'em all",
    "level" : 0,
    "type" : "home",
    "published" : true,
    "homepageText" : {
        "slug" : "/:homepageText",
        "items" : [
            {
                "type" : "richText",
                "content" : "<h3>Pokémon – Gotta catch 'em all</h3>\n\n<p>At first it felt like a bit of fun, a good way to encourage my son to get off the sofa and walk around the area a bit. Noone warned me what came next.</p>\n\n<p> </p>\n\n<p>For a while, things were OK – the Pokémon Go servers only stayed up for a few minutes at a time and so enjoyed some harmless entertainment chasing critters around the neighbourhood before the rest of the World woke up and brought things down. </p>\n\n<p> </p>\n\n<p> </p>"
            }
        ],
        "type" : "area"
    },
    "sortTitle" : "pokémon gotta catch em all",
    "highSearchText" : "pokémon gotta catch em all pokemon",
    "highSearchWords" : [
        "pokémon",
        "gotta",
        "catch",
        "em",
        "all",
        "pokemon"
    ],
    "lowSearchText" : "pokémon gotta catch em all pokemon pokémon gotta catch em all at first it felt like a bit of fun a good way to encourage my son to get off the sofa and walk around the area a bit noone warned me what came next for a while things were ok the pokémon go servers only stayed up for a few minutes at a time and so enjoyed some harmless entertainment chasing critters around the neighbourhood before the rest of the world woke up and brought things down",
    "searchSummary" : "\nPokémon – Gotta catch 'em all\n\n\nAt first it felt like a bit of fun, a good way to encourage my son to get off the sofa and walk around the area a bit. Noone warned me what came next.\n\n\n \n\n\nFor a while, things were OK – the Pokémon Go servers only stayed up for a few minutes at a time and so enjoyed some harmless entertainment chasing critters around the neighbourhood before the rest of the World woke up and brought things down. \n\n\n \n\n\n ",
    "seoDescription" : "",
    "tags" : [
        "pokemon"
    ],
    "orphan" : false,
    "pagePermissions" : [ ]
}

To visually navigate through the schema and data created by Apostrophe, download and install MongoDB Compass. Use your MongoDB Atlas credentials to connect Compass to your MongoDB database – Figure 5.

Figure 5: Connect MongoDB Compass to MongoDB Atlas

Navigate through the structure of the data in the clusterdb database (Figure 6) and view the JSON documents (Figure 7).

Figure 6: Explore Apostrophe Schema Using MongoDB Compass

Figure 7: View Apostrophe Documents in MongoDB Compass

What Next?

While MongoDB Atlas radically simplifies the operation of MongoDB there are still some decisions to take to ensure the best performance and reliability for your application. The MongoDB Atlas Best Practices white paper provides guidance on best practices for deploying, managing, and optimizing the performance of your database with MongoDB Atlas.

The guide outlines considerations for achieving performance at scale with MongoDB Atlas across a number of key dimensions, including instance size selection, application patterns, schema design and indexing, and disk I/O. While this guide is broad in scope, it is not exhaustive. Following the recommendations in the guide will provide a solid foundation for ensuring optimal application performance.

Download Atlas Best Practice Guide


Atlas Driver Compatibility and Testing for Node.js

Jay Gordon

Cloud

Jay Gordon is a Technical Account Manager with MongoDB and is available via our chat to discuss MongoDB Cloud Products at https://cloud.mongodb.com.

MongoDB was excited to announce the introduction of our Database as a Service, Atlas. Atlas lets you put aside your concerns with building your database’s infrastructure and focus on building your application.

One of the more important parts of developing your application using the driver that fits with the version of MongoDB. Today we’re going to build a tiny Node.js application to validate that our driver works, that we’re able to authenticate and validate we’re ready to start building our data.

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--e0ab80d70d90539c61985a87d471ef59--tumblr_ob3gcedWXl1sdaytmo1_1280.png

Let’s get started by building a very basic MongoDB Atlas Cluster. For this exercise we just require an M10 class, which is both our most modestly priced and best for low end use. Navigate your browser to https://cloud.mongodb.com and sign up for an account.

https://webassets.mongodb.com/_com_assets/blog/tblr/66.media.tumblr.com--be4003190da49bab1b43a83d19b429ce--tumblr_ob3gcedWXl1sdaytmo4_1280.png

As you can see, we are building a new cluster that’s using MongoDB 3.2 along with the WiredTiger storage engine. These are our defaults for you to work with as you determine the needs for your application from a development standpoint. Let’s check our compatibility with NodeJS and this version of MongoDB by navigating to our NodeJS driver page.

https://webassets.mongodb.com/_com_assets/blog/tblr/65.media.tumblr.com--6d3dbe4d9b2cacd92aacee2a19979f40--tumblr_ob3gcedWXl1sdaytmo2_1280.png

Now let’s validate my local environment, I used npm to install this package so we’ll query it to see what version of the MongoDB driver we have installed:

bash-3.2$ npm ls
/Users/jaygordon
└─┬ mongodb@2.2.2  
  ├── es6-promise@3.0.2
  ├─┬ mongodb-core@2.0.5
  │ ├── bson@0.5.2
  │ └─┬ require_optional@1.0.0
  │   ├── resolve-from@2.0.0
  │   └── semver@5.3.0
  └─┬ readable-stream@1.0.31
    ├── core-util-is@1.0.2
    ├── inherits@2.0.1
    ├── isarray@0.0.1
    └── string_decoder@0.10.31

So let’s prep our script once our cluster is ready, we’ll need our connection string and we’ll want to ensure our environment we’re connecting from is whitelisted. I like using this trick to validate my IP address from a command line, but you can easily go to http://icanhazip.com and get the same information:

bash-3.2$ curl icanhazip.com
192.168.0.1 (example IP) 

https://webassets.mongodb.com/_com_assets/blog/tblr/66.media.tumblr.com--e2a9e3c2b18f2f56b26cda7f74620851--tumblr_ob3gcedWXl1sdaytmo3_1280.png

Great, now when you created your Atlas cluster, you selected a username and a password. Let’s get the connection string and enter these into it and then begin prepping our connection test app.

https://webassets.mongodb.com/_com_assets/blog/tblr/66.media.tumblr.com--fd0125a095b1330f6d1dcce716c5df53--tumblr_ob3gcedWXl1sdaytmo6_1280.png

So not only does Atlas handle your infrastructure, scaling and backups… but we even provide you with a guide on how to integrate your application for your specific programming language.

This is what I just call app.js, a simple connection test:

var MongoClient = require('mongodb').MongoClient
    , format = require('util').format;
MongoClient.connect('mongodb://', function (err, db) {
    if (err) {
        throw err;
    } else {
        console.log("successfully connected to the database");
    }
    db.close();
});

We can insert our connection string along with our username and password to look something like this:

var MongoClient = require('mongodb';).MongoClient
    , format = require('util').format;
MongoClient.connect('mongodb://jay:YOURPASSWORD@nodejs-demo-shard-00-00-cbei2.mongodb.net:27017,nodejs-demo-shard-00-01-cbei2.mongodb.net:27017,nodejs-demo-shard-00-02-cbei2.mongodb.net:27017/admin?ssl=true&replicaSet=nodejs-demo-shard-0', function (err, db) {
    if (err) {
        throw err;
    } else {
        console.log("successfully connected to the database");
    }
    db.close();
});

We’ve reached a point now where we can begin testing, and it’s pretty easy to do so:

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--6cd39a42be211c781ea6fe8ae102b3f5--tumblr_ob3gcedWXl1sdaytmo5_1280.png

MongoDB Atlas gives you the tools to work quickly and effectively. Atlas makes it easy to build something GIANT without having to manage the systems hosting your data.

Migrating Data to MongoDB Atlas

Editor's Note: Migrating data to MongoDB Atlas is easier than ever with the Live Import tool. Simply select the destination cluster you'd like to import data to, and follow the instructions to seamlessly pull in an existing MongoDB deployment. Learn more here.

MongoDB Atlas was announced at this year's MongoDB World. It's great not just for new applications, but also your existing MongoDB databases running on other platforms. This post will focus on how you migrate your data and applications over to MongoDB Atlas.

What is MongoDB Atlas?

MongoDB Atlas provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on what you do best.

It’s easy to get started – use a simple GUI to select the instance size, region, and features you need.

MongoDB Atlas provides:

  • Security features to protect access to your data
  • Built in replication for always-on availability, tolerating complete data center failure
  • Backups and point in time recovery to protect against data corruption
  • Fine-grained monitoring to let you know when to scale. Additional instances can be provisioned with the push of a button
  • Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
  • A choice of cloud providers, regions, and billing options

But what if you already have application data held in your own on-prem or cloud-based MongoDB database – is it possible to safely migrate that data to MongoDB Atlas? What if your data is held in a 3rd party hosted MongoDB service such as Compose or mLab? Conversely, is it possible to build your application against MongoDB Atlas and then move the data to a MongoDB database running on another platform in the future?

The answer to all of those questions is "yes". In the future you should expect this to be a highly automated process but right now it involves some manual steps – the purpose of this blog post is to describe the process.

Moving Your Application Data to MongoDB Atlas

The procedure is very straightforward, but if you can't tolerate losing any of your updates then it does involve stopping application writes for a period. That means it's vital that you prepare in advance in order to minimize the impact.

Pre-Migration Checklist

  1. How long will writes need to be stopped? Perform a dry-run of the mongodump & mongorestore steps but without stopping application writes to answer this.
  2. When will the stopping of writes have the smallest impact?
  3. What can you change in the application to minimize the impact, e.g. provide a read-only version of the service when it isn't possible to write to the database?
  4. Will you warn users of planned maintenance ahead of time?
  5. Do you have sufficient storage space to store the dumped data on the machine where you plan to run mongodump?
  6. Once the data has been migrated to MongoDB Atlas, the application will need to switch its database connections to the new address; identify how this will be done.
  7. List the IP Addresses of all the machines that will need to connect to MongoDB Atlas – this includes your application nodes as well as the machine where mongorestore will be run. These will need to be added to your MongoDB Atlas group's whitelist.
  8. Decide on what MongoDB Atlas instance size to use and, if necessary how many shards will be needed.
  9. Decide on which region to use, e.g. co locating the MongoDB Atlas instances with your cloud-based application servers.

Execute the Migration

  1. Create the MongoDB Atlas cluster.
  2. Add the required IP Addresses to the whitelist in your group's security tab.
  3. Stop database writes to your existing database; either in your application logic or by blocking them for each of your databases (schemas) in the original MongoDB deployment:
    laptop> mongo --host=ec2-52-208-185-213.eu-west-1.compute.amazonaws.com \
     --eval "db.fsyncLock()"
  4. Back up the data from the existing database (writes the data to a directory named dump):
    laptop> mongodump --host=ec2-52-208-185-213.eu-west-1.compute.amazonaws.com \
    --port=27017
  5. Write the data to MongoDB Atlas (using the connection information provided in the Web UI):
    mongorestore --ssl --host cluster0-shard-00-00-qfovx.mongodb.net \
     --port 27017 -u billy -p XXX dump
  6. Switch the application's database connections over to your MongoDB Atlas instance.

Want more help? We offer a MongoDB Atlas Migration service to help you properly configure MongoDB Atlas and develop a migration plan. This is especially helpful if you need to minimize downtime for your application, if you have a complex sharded deployment, or if you want to revise your deployment architecture as part of the migration. Contact us to learn more about the MongoDB Atlas Migration service.

Moving Your Application Data Out of MongoDB Atlas

To migrate data out, you can download a MongoDB Atlas backup and then copy the contents to the receiving MongoDB cluster; the documentation describes how to load the data into the receiving replica set. The backup can be either a periodic snapshot or a point-in-time view of the MongoDB Atlas database. If you can't tolerate lost writes, they must be stopped by the application (fsyncLock is not available in MongoDB Atlas).

Getting the Best Out of MongoDB Atlas

While MongoDB Atlas radically simplifies the operation of MongoDB there are still some decisions to take to ensure the best performance and reliability for your application. The MongoDB Atlas Best Practices white paper provides guidance on best practices for deploying, managing, and optimizing the performance of your database with MongoDB Atlas.

The guide outlines considerations for achieving performance at scale with MongoDB Atlas across a number of key dimensions, including instance size selection, application patterns, schema design and indexing, and disk I/O. While this guide is broad in scope, it is not exhaustive. Following the recommendations in the guide will provide a solid foundation for ensuring optimal application performance.

Download MongoDB Atlas white paper


Configuring KeystoneJS to Use MongoDB Atlas

KeystoneJS is an open source framework for building web applications and Content Management Systems. It's built on top of MongoDB, Express, and Node.js - key components of the ubiquitous MEAN stack.

This post explains why MongoDB Atlas is an ideal choice for KeystoneJS and then goes on to show how to configure KeystoneJS to use it.

Why are KeystoneJS and MongoDB Atlas a Good Match

The MEAN stack is extremely popular and well supported and it's the go to platform when developing modern applications. For its part, MongoDB brings flexible schemas, rich queries, an idiomatic Node.js driver, and simple to use high availability and scaling.

MongoDB Atlas provides all of the features of MongoDB, without the operational heavy lifting required for any new application. MongoDB Atlas is available on demand through a pay-as-you-go model and billed on an hourly basis, letting you focus on what you do best.

It’s easy to get started – use a simple GUI to select the instance size, region, and features you need. MongoDB Atlas provides:

  • Security features to protect access to your data
  • Built in replication for always-on availability, tolerating complete data center failure
  • Backups and point in time recovery to protect against data corruption
  • Fine-grained monitoring to let you know when to scale. Additional instances can be provisioned with the push of a button
  • Automated patching and one-click upgrades for new major versions of the database, enabling you to take advantage of the latest and greatest MongoDB features
  • A choice of cloud providers, regions, and billing options

Like KeystoneJS, MongoDB Atlas is a natural fit for users looking to simplify their development and operations work, letting them focus on what makes their application unique rather than commodity (albeit essential) plumbing.

Installing KeystoneJS and Configuring it to Use MongoDB Atlas

Before starting with KeystoneJS, you should launch your MongoDB cluster using MongoDB Atlas and then (optionally) create a user with read and write privileges for just the database that will be used for this project, as shown in Figure 1. You must also add the IP address of your application server to the IP Whitelist in the MongoDB Atlas security tab.

Figure 1: Creating KeystoneJS user in MongoDB Atlas

If it isn't already installed on your system, download and install Node.js:

$ curl https://nodejs.org/dist/v4.4.7/node-v4.4.7-linux-x64.tar.xz -o node.tar.xz
$ tar xf node.tar.xz

You should then add the bin sub-folder to your .bash_profile file and then install KeystoneJS:

$ npm install -g generator-keystone
$ mkdir keystone-project
$ cd keystone-project
$ npm install -g yo
...
Welcome to KeystoneJS.

? What is the name of your project? ClusterDB
? Would you like to use Jade, Nunjucks, Twig or Handlebars for templates? [jade 
| nunjucks | twig | hbs] jade
? Which CSS pre-processor would you like? [less | sass | stylus] less
? Would you like to include a Blog? Yes
? Would you like to include an Image Gallery? Yes
? Would you like to include a Contact Form? Yes
? What would you like to call the User model? User
? Enter an email address for the first Admin user: a@clusterdb.com
? Enter a password for the first Admin user:
 Please use a temporary password as it will be saved in plain text and change it
 after the first login. admin
...
To start your new website, run "cd clusterdb" then "node keystone".

$ cd clusterdb

Before starting KeystoneJS you need to configure it with details on how to connect to your specific MongoDB Atlas cluster. This is done by updating the MONGO_URI value within the .env file:

MONGO_URI=mongodb://keystonejs_user:my_password@cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/clusterdb?ssl=true&authSource=admin

The URI contains these components:

  • keystonejs_user is the name of the user you created in the MongoDB Atlas UI
  • my_password is the password you chose when creating the user in MongoDB Atlas
  • cluster0-shard-00-00-qfovx.mongodb.net, cluster0-shard-00-01-qfovx.mongodb.net, & cluster0-shard-00-02-qfovx.mongodb.net are the hostnames of the instances in your MongoDB Atlas replica set (click on the "CONNECT" button in the MongoDB Atlas UI if you don't have these)
  • 27017 is the standard MongoDB port number
  • clusterdb is the name of the database (schema) that KeystoneJS will use (note that this must match the project name used when installing KeystoneJS as well as the database you granted the user access to)
  • To enforce security, MongoDB Atlas mandates that the ssl option is used
  • admin is the database that's being used to store the credentials for keystonejs_user

Clients connect to KeystoneJS through port 3000 and so you must open that port in your firewall.

You can then start KeystoneJS:

$ node keystone

Testing the Configuration

Browse to the application at http://address-of-app-server:3000 as shown in Figure 2.

Figure 2: KeystoneJS Running on MongoDB Atlas

Sign in using the credentials shown and then confirm that you can upload some images to a gallery and create a new page as shown in Figure 3.

Figure 3: Create a Page in KeystoneJS with Data Stored in MongoDB Atlas

After saving the page, confirm that you can browse to the newly created post (Figure 4).

Figure 4: View KeystoneJS Post with Data Read from MongoDB Atlas

Optionally, confirm that, MongoDB Atlas really is being used by KeystoneJS, you can connect using the MongoDB shell:

$ sudo yum install mongodb-org-shell
$ mongo mongodb://cluster0-shard-00-00-qfovx.mongodb.net:27017,cluster0-shard-00-01-qfovx.mongodb.net:27017,cluster0-shard-00-02-qfovx.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0 --ssl --username billy --password XXXXXX

Cluster0-shard-0:PRIMARY> show dbs
admin      0.000GB
clusterdb  0.000GB
local      0.000GB
Cluster0-shard-0:PRIMARY> use clusterdb
switched to db clusterdb
Cluster0-shard-0:PRIMARY> show collections
Cannot use 'commands' readMode, degrading to 'legacy' mode
app_updates
galleries
postcategories
posts
users

Cluster0-shard-0:PRIMARY> db.users.findOne()
{
    "_id" : ObjectId("5790829fa585c9cf10080d40"),
    "isAdmin" : true,
    "password" : "$2a$10$RmXv35cYu7V8fY.ZV/hJy.fFo7zjj.EwBsaTErMdVtG8MAhybJJUi",
    "email" : "a@clusterdb.com",
    "name" : {
        "last" : "User",
        "first" : "Admin"
    },
    "__v" : 0
}

To visually navigate through the schema and data created by KeystoneJS, download and install MongoDB Compass. The same credentials can be used to connect Compass to your MongoDB database – Figure 5.

Figure 5: Connect MongoDB Compass to MongoDB Atlas Database

Navigate through the structure of the data in the clusterdb database (Figure 6) and view the JSON documents (Figure 7).

Figure 6: Explore KeystoneJS Schema Using MongoDB Compass

Figure 7: View Documents Stored by KeystoneJS Using MongoDB Atlas

Next Steps

While MongoDB Atlas radically simplifies the operation of MongoDB there are still some decisions to take to ensure the best performance and reliability for your application. The MongoDB Atlas Best Practices white paper provides guidance on best practices for deploying, managing, and optimizing the performance of your database with MongoDB Atlas.

The guide outlines considerations for achieving performance at scale with MongoDB Atlas across a number of key dimensions, including instance size selection, application patterns, schema design and indexing, and disk I/O. While this guide is broad in scope, it is not exhaustive. Following the recommendations in the guide will provide a solid foundation for ensuring optimal application performance.

Download Atlas Best Practice Guide

Atlas on Day One, Importing Data

MongoDB

Cloud
Update:: We recently released a live migration tool for MongoDB Atlas called mongomirror. Learn more about mongomirror on our documentation.

MongoDB Atlas brings the ability for you as the end user to no longer concern yourself with the day to day aspects of system administration of your MongoDB Cluster. Like many databases, Atlas exists to ensure your data is always available with little overhead to your organization.

On day one you may be concerned on how to import your existing data and take a test drive of Atlas. There are numerous ways to copy your data over from one MongoDB service to another, today we’ll focus on a simple export and import using mongodump and mongorestore.

mongodump

The mongodump binary is a utility for creating a binary export of the contents of a database. mongodump can export data from either mongod or mongos instances.

Exporting your data from mongodump can be done with a command that exports the data to the system you run the command on.

In today’s case let’s think we are working with a standalone we’ve been testing on our local laptop for a while.

MongoDB shell version: 3.2.7
connecting to: test
> show databases
local  0.000GB
test   0.070GB
> show collections
testData

We’re going to export the test database that contains our testData collection. My local computer has enough disk space to handle this export, but when working with large datasets you may want to concern yourself with available disk.

bash-3.2$ df -h
Filesystem      Size   Used  Avail Capacity  iused    ifree %iused  Mounted on
/dev/disk1     465Gi   96Gi  369Gi    21% 25216209 96623405   21%   /

Indeed we have the space, so let’s go ahead and export this database:

bash-3.2$ mongodump -d test
2016-06-13T10:43:52.147-0400    writing test.testData to
2016-06-13T10:43:55.147-0400    [##########..............]  test.testData  1326267/2900790  (45.7%)
2016-06-13T10:43:58.147-0400    [######################..]  test.testData  2666589/2900790  (91.9%)
2016-06-13T10:43:58.670-0400    [########################]  test.testData  2900790/2900790  (100.0%)
2016-06-13T10:43:58.670-0400    done dumping test.testData (2900790 documents)

We are now left with two files which contain both the binary document data in BSON format along with a json file containing metadata about your collection:

bash-3.2$ cd dump/test/
bash-3.2$ ls -al
total 186976
drwxr-xr-x  4 jaygordon  staff       136 Jun 13 10:43 .
drwxr-xr-x  3 jaygordon  staff       102 Jun 13 10:43 ..
-rw-r--r--  1 jaygordon  staff  95726070 Jun 13 10:43 testData.bson
-rw-r--r--  1 jaygordon  staff        85 Jun 13 10:43 testData.metadata.json
bash-3.2$ cat dump/test/testData.metadata.json
{"options":{},"indexes":[{"v":1,"key":{"_id":1},"name":"_id_","ns":"test.testData"}]}

Important note: Since MongoDB Atlas will be managing your users for you from here on in, make sure to remove any files called system.users.bson and system.users.metadata.bson to prevent any issues with your import.

mongorestore

The mongorestore program writes data from a binary database dump created by mongodump to a MongoDB instance.

With mongorestore we should only need to create our Atlas cluster and then ensure we are whitelisted to connect. Here’s a quick one line command to confirm what IP you are currently using (including DCHP/NAT networks) according to the rest of the world. (Below IP is just an example)

bash-3.2$ curl icanhazip.com
1.2.3.4

Now we know our IP, we can add it into our collection of IPs we use for our cluster, go to the Security tab and “ADD IP ADDRESS:”

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--0d2f0d76c05d8f89db2bbc3c0a9d8e9a--tumblr_o9wa3o86Ok1sdaytmo1_1280.png

Now let’s validate we can connect to our Atlas Cluster from the laptop containing our export. Go to your Atlas Custer deployment page and find your connection string:

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--57f1878e5d68172aae3dd7312b254412--tumblr_o9wa3o86Ok1sdaytmo2_1280.png

Click on Connect:

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--0569ebb31fb06ca3e942da70e223bba2--tumblr_o9wa3o86Ok1sdaytmo4_1280.png

We have our info, let’s see if it works:

bash-3.2$ mongo mongodb://cluster0-shard-00-00-cbei2.mongodb.net:27017,cluster0-shard-00-01-cbei2.mongodb.net:27017,cluster0-shard-00-02-cbei2.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0 --ssl --username jay --password
MongoDB shell version: 3.2.7
Enter password:
connecting to: mongodb://cluster0-shard-00-00-cbei2.mongodb.net:27017,cluster0-shard-00-01-cbei2.mongodb.net:27017,cluster0-shard-00-02-cbei2.mongodb.net:27017/admin?replicaSet=Cluster0-shard-0
2016-06-13T11:34:53.235-0400 I NETWORK  [thread1] Starting new replica set monitor for Cluster0-shard-0/cluster0-shard-00-00-cbei2.mongodb.net:27017,cluster0-shard-00-01-cbei2.mongodb.net:27017,cluster0-shard-00-02-cbei2.mongodb.net:27017
2016-06-13T11:34:53.235-0400 I NETWORK  [ReplicaSetMonitorWatcher] starting
Cluster0-shard-0:PRIMARY>

Great, we are ready to import into Atlas!

Let’s make sure we have a user ready for the admin database:

https://webassets.mongodb.com/_com_assets/blog/tblr/67.media.tumblr.com--c22d63580ace89123092e3f7403ebe72--tumblr_o9wa3o86Ok1sdaytmo3_1280.png

We’ll modify our connection string so our restore command should look something like this (note, the --host option has a different format than before):

bash-3.2$ mongorestore --ssl --host Cluster0-shard-0/cluster0-shard-00-00-cbei2.mongodb.net:27017,cluster0-shard-00-01-cbei2.mongodb.net:27017,cluster0-shard-00-02-cbei2.mongodb.net:27017 --authenticationDatabase admin
 --dir=dump/test -u jay --password $PASSWORD
2016-06-13T11:46:00.071-0400    building a list of collections to restore from dump/test dir
2016-06-13T11:46:00.081-0400    reading metadata for test.testData from dump/test/testData.metadata.json
2016-06-13T11:46:00.099-0400    restoring test.testData from dump/test/testData.bson

The restore will continue till it gets to 100% and notify you when it’s done:

2016-06-13T11:48:36.073-0400    [#######################.]  test.testData  88.3 MB/91.3 MB  (96.8%)
2016-06-13T11:48:39.075-0400    [#######################.]  test.testData  90.3 MB/91.3 MB  (98.9%)
2016-06-13T11:48:40.701-0400    [########################]  test.testData  91.3 MB/91.3 MB  (100.0%)
2016-06-13T11:48:40.701-0400    restoring indexes for collection test.testData from metadata
2016-06-13T11:48:40.710-0400    finished restoring test.testData (2900790 documents)
2016-06-13T11:48:40.710-0400    done

Great, let’s log into Atlas and verify our data made it into our cluster:

Cluster0-shard-0:PRIMARY> use test
switched to db test
Cluster0-shard-0:PRIMARY> show databases
admin  0.000GB
local  0.098GB
test   0.070GB

Now you’re ready to start using your application along with MongoDB Atlas! Start building something GIANT today!

Jay Gordon is a Technical Account Manager with MongoDB and is available via our chat to discuss MongoDB Cloud Products at https://cloud.mongodb.com.