GIANT Stories at MongoDB

Mitigating the "fat-finger delete" with Queryable Backups

When a user accidentally deletes data sometimes the only way to retrieve that data is through a full database restore. In MongoDB Ops Manager and MongoDB Atlas, it is easy to make a client connection to a backup and perform read-only queries. No need to restore the backup either!

How to Integrate Azure Functions with MongoDB

MongoDB Atlas, the database as a service provided by the creators of MongoDB, is available on all the major cloud providers including Microsoft Azure. In this tutorial, I’ll cover the integration of Microsoft Azure Functions with MongoDB Atlas from a developer standpoint.

Every database on the planet is a time-series database

If you can store an integer and a value you have a time-series data store. Every database on the planet can support a time-series workload so what sets MongoDB apart? A flexible document model, transactional consistency, horizontal scaling, native datetime support, real-time analytics and visualizations, and millions of users worldwide to name a few.

Working with MongoDB Transactions with C# and the .NET Framework

MongoDB .NET Drivers versions 2.7 and above support MongoDB Transactions. While MongoDB has always had transactional consistency at the document level, broader cross collection transactional support is now available with MongoDB version 4.0 In this blog post we will take a first look at using MongoDB Transactions within your C# code.

Implementing an end-to-end IoT solution in MongoDB: From sensor to cloud

Robert Walters

IoT, Technical

Many companies across the world have chosen MongoDB as the data platform for their IoT workloads. MongoDB makes it easy to store a variety of heterogeneous sensor data in a natural, intuitive way and blend it with enterprise data, allowing you to integrate IoT apps across your organization. Experience setting up your own temperature sampling solution in the referenced article.

Time Series Data and MongoDB: Part 3 – Querying, Analyzing, and Presenting Time-Series Data

Robert Walters

Technical

This blog series seeks to provide best practices as you build out your time series application on MongoDB. In this post, part three, we will cover how to query, analyze and present time-series data stored in MongoDB.

Time Series Data and MongoDB: Part 2 – Schema Design Best Practices

Robert Walters

Technical

In the previous blog post, “Time Series Data and MongoDB: Part 1 – An Introduction,” we introduced the concept of time-series data followed by some discovery questions you can use to help gather requirements for your time-series application. Answers to these questions help guide the schema and MongoDB database configuration needed to support a high-volume production application deployment. In this blog post we will focus on how two different schema designs can impact memory and disk utilization under read, write, update, and delete operations.

In the end of the analysis you may find that the best schema design for your application may be leveraging a combination of schema designs. By following the recommendations we lay out below, you will have a good starting point to develop the optimal schema design for your app, and appropriately size your environment.

Designing a time-series schema

Let’s start by saying that there is no one canonical schema design that fits all application scenarios. There will always be trade-offs to consider regardless of the schema you develop. Ideally you want the best balance of memory and disk utilization to yield the best read and write performance that satisfy your application requirements, and that enables you to support both data ingest and analysis of time-series data streams.

In this blog post we will look at various schema design configurations. First, storing one document per data sample, and then bucketing the data using one document per time-series time range and one document per fixed size. Storing more than one data sample per document is known as bucketing. This will be implemented at the application level and requires nothing to be configured specifically in MongoDB. With MongoDB’s flexible data model you can optimally bucket your data to yield the best performance and granularity for your application’s requirements.

This flexibility also allows your data model to adapt to new requirements over time – such as capturing data from new hardware sensors that were not part of the original application design. These new sensors provide different metadata and properties than the sensors you used in the original design. WIth all this flexibility you may think that MongoDB databases are the wild west, where anything goes and you can quickly end up with a database full of disorganized data. MongoDB provides as much control as you need via schema validation that allows you full control to enforce things like the presence of mandatory fields and range of acceptable values, to name a few.

To help illustrate how schema design and bucketing affects performance, consider the scenario where we want to store and analyze historical stock price data. Our sample stock price generator application creates sample data every second for a given number of stocks that it tracks. One second is the smallest time interval of data collected for each stock ticker in this example. If you would like to generate sample data in your own environment, the StockGen tool is available on GitHub. It is important to note that although the sample data in this document uses stock ticks as an example, you can apply these same design concepts to any time-series scenario like temperature and humidity readings from IoT sensors.

The StockGen tool used to generate sample data will generate the same data and store it in two different collections: StockDocPerSecond and StockDocPerMinute that each contain the following schemas:

Scenario 1: One document per data point

{
  "_id" : ObjectId("5b4690e047f49a04be523cbd"),
"p" : 56.56,
  "symbol" : "MDB",
  "d" : ISODate("2018-06-30T00:00:01Z")
},
{
  "_id" : ObjectId("5b4690e047f49a04be523cbe"),
"p" : 56.58,
  "symbol" : "MDB",
  "d" : ISODate("2018-06-30T00:00:02Z")
}
,...
Figure 1: Sample documents representing one document per second granularity

Scenario 2: Time-based bucketing of one document per minute

{
"_id" : ObjectId("5b5279d1e303d394db6ea0f8"), 
"p" : {
"0" : 56.56,
"1" : 56.56,
"2" : 56.58,
…
"59" : 57.02
},
"symbol" : "MDB",
"d" : ISODate("2018-06-30T00:00:00Z")
},
{
"_id" : ObjectId("5b5279d1e303d394db6ea134"), 
"p" : {
 "0" : 69.47,
 "1" : 69.47,
 "2" : 68.46,
...
 "59" : 69.45
},
"symbol" : "TSLA",
"d" : ISODate("2018-06-30T00:01:00Z")
},...
Figure 2: Sample documents representing one minute granularity

Note that the field “p” contains a subdocument with the values for each second of the minute.

Schema design comparisons

Let’s compare and contrast the database metrics of storage size and memory impact based off of 4 weeks of data generated by the StockGen tool. Measuring these metrics is useful when assessing database performance.

Effects on Data Storage

In our application the smallest level of time granularity is a second. Storing one document per second as described in Scenario 1 is the most comfortable model concept for those coming from a relational database background. That is because we are using one document per data point, which is similar to a row per data point in a tabular schema. This design will produce the largest number of documents and collection size per unit of time as seen in Figures 3 and 4.

Document count per day comparing per second vs per minute schema design
Figure 3: Document count over time, comparing per second vs per minute schema design

Comparison between data size and storage size for each scenario
Figure 4: Comparison between data size and storage size for each scenario

Figure 4 shows two sizes per collection. The first value in the series is the size of the collection that is stored on disk, while the second value is the size of the data in the database. These numbers are different because MongoDB’s WiredTiger storage engine supports compression of data at rest. Logically the PerSecond collection is 605MB, but on disk it is consuming around 190 MB of storage space.

Effects on memory utilization

A large number of documents will not only increase data storage consumption but increase index size as well. An index was created on each collection and covered the symbol and date fields. Unlike some key-value databases that position themselves as time-series databases, MongoDB provides secondary indexes giving you flexible access to your data and allowing you to optimize query performance of your application.

Index Size (MB) comparison between PerSecond and PerMinute
Figure 5: Index Size (MB) comparison between PerSecond and PerMinute

The size of the index defined in each of the two collections IS seen in Figure 5. Optimal performance of MongoDB happens when indexes and most recently used documents fit into the memory allocated by the WiredTiger cache (we call this the “working set”). In our example we generated data for just 5 stocks over the course of 4 weeks. Given this small test case our data already generated an index that is 103MB in size for the PerSecond scenario. Keep in mind that there are some optimizations such as index prefix compression that help reduce the memory footprint of an index. However, even with these kind of optimizations proper schema design is important to prevent runaway index sizes. Given the trajectory of growth, any changes to the application requirements, like tracking more than just 5 stocks or more than 4 weeks of prices in our sample scenario, will put much more pressure on memory and eventually require indexes to page out to disk. When this happens your performance will be degraded. To mitigate this situation, consider scaling horizontally.

Scale horizontally

As your data grows in size you may end up scaling horizontally when you reach the limits of the physical limits of the server hosting the primary mongod in your MongoDB replica set.

By horizontally scaling via MongoDB Sharding, performance can be improved since the indexes and data will be spread over multiple MongoDB nodes. Queries are no longer directed at a specific primary node. Rather they are processed by an intermediate service called a query router (mongos), which sends the query to the specific nodes that contain the data that satisfy the query. Note that this is completely transparent to the application – MongoDB handles all of the routing for you

Scenario 3: Size-based bucketing

The key takeaway when comparing the previous scenarios is that bucketing data has significant advantages. Time-based bucketing as described in scenario 2 buckets an entire minute's worth of data into a single document. In time-based applications such as IoT, sensor data may be generated at irregular intervals and some sensors may provide more data than others. In these scenarios, time-based bucketing may not be the optimal approach to schema design. An alternative strategy is size-based bucketing. With size-based bucketing we design our schema around one document per a certain number of emitted sensor events, or for the entire day, whichever comes first.

To see size-based bucketing in action, consider the scenario where you are storing sensor data and limiting the bucket size to 200 events per document, or a single day (whichever comes first). Note: The 200 limit is an arbitrary number and can be changed as needed, without application changes or schema migrations.

{
  _id: ObjectId(),
 deviceid: 1234,
 sensorid: 3,
 nsamples: 5,
  day: ISODate("2018-08-29"),
 first:1535530412,
 last: 1535530432,
 samples : [
   { val: 50, time: 1535530412},
   { val: 55, time : 1535530415},
   { val: 56, time: 1535530420},
   { val: 55, time : 1535530430},
   { val: 56, time: 1535530432}
   ]
}
Figure 6: Size-based bucketing for sparse data

An example size-based bucket is shown in figure 6. In this design, trying to limit inserts per document to an arbitrary number or a specific time period may seem difficult; however, it is easy to do using an upsert, as shown in the following code example:

sample = {val: 59, time: 1535530450}
day = ISODate("2018-08-29")
db.iot.updateOne({deviceid: 1234, sensorid: 3, nsamples: {$lt: 200}, day: day},
                 {$push: {samples: sample},
                  $min: {first: sample.time},
                  $max: {last: sample.time},
                  $inc: {nsamples: 1}, {upsert: true} )
Figure 7: Sample code to add to the size-based bucket

As new sensor data comes in it is simply appended to the document until the number of samples hit 200, then a new document is created because of our upsert:true clause.

The optimal index in this scenario would be on {deviceid:1,sensorid:1,day:1,nsamples:1}. When we are updating data, the day is an exact match, and this is super efficient. When querying we can specify a date, or a date range on a single field which is also efficient as well as filtering by first and last using UNIX timestamps. Note that we are using integer values for times. These are really times stored as a UNIX timestamp and only take 32 bits of storage versus an ISODate which takes 64 bits. While not a significant query performance difference over ISODate, storing as UNIX timestamp may be significant if you plan on ending up with terabytes of ingested data and you do not need to store a granularity less than a second.

Bucketing data in a fixed size will yield very similar database storage and index improvements as seen when bucketing per time in scenario 2. It is one of the most efficient ways to store sparse IoT data in MongoDB.

What to do with old data

Should we store all data in perpetuity? Is data older than a certain time useful to your organization? How accessible should older data be? Can it be simply restored from a backup when you need it, or does it need to be online and accessible to users in real time as an active archive for historical analysis? As we covered in part 1 of this blog series, these are some of the questions that should be asked prior to going live.

There are multiple approaches to handling old data and depending on your specific requirements some may be more applicable than others. Choose the one that best fits your requirements.

Pre-aggregation

Does your application really need a single data point for every event generated years ago? In most cases the resource cost of keeping this granularity of data around outweighs the benefit of being able to query down to this level at any time. In most cases data can be pre-aggregated and stored for fast querying. In our stock example, we may want to only store the closing price for each day as a value. In most architectures, pre-aggregated values are stored in a separate collection since typically queries for historical data are different than real-time queries. Usually with historical data, queries are looking for trends over time versus individual real-time events. By storing this data in different collections you can increase performance by creating more efficient indexes as opposed to creating more indexes on top of real-time data.

Offline archival strategies

When data is archived, what is the SLA associated with retrieval of the data? Is restoring a backup of the data acceptable or does the data need to be online and ready to be queried at any given time? Answers to these questions will help drive your archive design. If you do not need real-time access to archival data you may want to consider backing up the data and removing it from the live database. Production databases can be backed up using MongoDB Ops Manager or if using the MongoDB Atlas service you can use a fully managed backup solution.

Removing documents using remove statement

Once data is copied to an archival repository via a database backup or an ETL process, data can be removed from a MongoDB collection via the remove statement as follows:

 db.StockDocPerSecond.remove ( { "d" : { $lt: ISODate( "2018-03-01" ) } } )

In this example all documents that have a date before March 1st, 2018 defined on the “d” field will be removed from the StockDocPerSecond collection.

You may need to set up an automation script to run every so often to clean out these records. Alternatively, you can avoid creating automation scripts in this scenario by defining a time to live (TTL) index.

Removing documents using a TTL Index

A TTL index is similar to a regular index except you define a time interval to automatically remove documents from a collection. In the case of our example, we could create a TTL index that automatically deletes data that is older than 1 week.

db.StockDocPerSecond.createIndex( { "d": 1 }, { expireAfterSeconds: 604800 } )

Although TTL indexes are convenient, keep in mind that the check happens every minute or so and the interval cannot be configured. If you need more control so that deletions won’t happen during specific times of the day you may want to schedule a batch job that performs the deletion in lieu of using a TTL index.

Removing documents by dropping the collection

It is important to note that using the remove command or TTL indexes will cause high disk I/O. On a database that may be under high load already this may not be desirable. The most efficient and fastest way to remove records from the live database is to drop the collection. If you can design your application such that each collection represents a block of time, \ when you need to archive or remove data all you need to do is drop the collection. This may require some smarts within your application code to know which collections should be queried, but the benefit may outweigh this change. When you issue a remove, MongoDB also has to remove data from all affected indexes as well and this could take a while depending on the size of data and indexes.

Online archival strategies

If archival data still needs to be accessed in real time, consider how frequently these queries occur and if storing only pre-aggregated results can be sufficient.

Sharding archival data

One strategy for archiving data and keeping the data accessible real-time is by using zoned sharding to partition the data. Sharding not only helps with horizontally scaling the data out across multiple nodes, but you can tag shard ranges so partitions of data are pinned to specific shards. A cost saving measure could be to have the archival data live on shards running lower cost disks and periodically adjusting the time ranges defined in the shards themselves. These ranges would cause the balancer to automatically move the data between these storage layers, providing you with tiered, multi-temperature storage. Review our tutorial for creating tiered storage patterns with zoned sharding for more information.

Accessing archived data via queryable backups

If your archive data is not accessed that frequently and the query performance does not need to meet any strict latency SLAs, consider backing the data up and using the Queryable Backups feature of MongoDB Atlas or MongoDB OpsManager. Queryable Backups allow you to connect to your backup and issue read-only commands to the backup itself, without having to first restore the backup.

Querying data from the data lake

MongoDB is an inexpensive solution not only for long term archival but for your data lake as well. Companies who have made investments in technologies like Apache Spark can leverage the MongoDB Spark Connector. This connector materializes MongoDB data as DataFrames and Datasets for use with Spark and machine learning, graph, streaming, and SQL APIs.

Key Takeaways

Once an application is live in production and is multiple terabytes in size, any major change can be very expensive from a resource standpoint. Consider the scenario where you have 6 TB of IoT sensor data and are accumulating new data at a rate of 50,000 inserts per second. Performance of reads is starting to become an issue and you realize that you have not properly scaled out the database. Unless you are willing to take application downtime, a change of schema in this configuration – e.g., moving from raw data storage to bucketed storage – may require building out shims, temporary staging areas and all sorts of transient solutions to move the application to the new schema. The moral of the story is to plan for growth and properly design the best time-series schema that fits your application’s SLAs and requirements.

This article analyzed two different schema designs for storing time-series data from stock prices. Is the schema that won in the end for this stock price database the one that will be the best in your scenario? Maybe. Due to the nature of time-series data and the typical rapid ingestion of data the answer may in fact be leveraging a combination of collections that target a read or write heavy use case. The good news is that with MongoDB’s flexible schema, it is easy to make changes. In fact you can run two different versions of the app writing two different schemas to the same collection. However, don’t wait until your query performance starts suffering to figure out an optimal design as migrating TBs of existing documents into a new schema can take time and resources, and delay future releases of your application. You should undertake real world testing before commiting on a final design. Quoting a famous proverb, “Measure twice and cut once.”

In the next blog post, “Querying, Analyzing, and Presenting Time-Series Data with MongoDB,” we will look at how to effectively get value from the time-series data stored in MongoDB.

Key Tips:

  • The MMAPV1 storage engine is deprecated, so use the default WiredTiger storage engine. Note that if you read older schema design best practices from a few years ago, they were often built on the older MMAPV1 technology.
  • Understand what the data access requirements are from your time-series application.
  • Schema design impacts resources. “Measure twice and cut once” with respect to schema design and indexes.
  • Test schema patterns with real data and a real application if possible.
  • Bucketing data reduces index size and thus massively reduces hardware requirements.
  • Time-series applications traditionally capture very large amounts of data, so only create indexes where they will be useful to the app’s query patterns.
  • Consider more than one collection: one focused on write heavy inserts and recent data queries and another collection with bucketed data focused on historical queries on pre-aggregated data.
  • When the size of your indexes exceeds the amount of memory on the server hosting MongoDB, consider horizontally scaling out to spread the index and load over multiple servers.
  • Determine at what point data expires, and what action to take, such as archival or deletion.

Time Series Data and MongoDB: Part 1 – An Introduction

Robert Walters

Technical

Time-series data is increasingly at the heart of modern applications - think IoT, stock trading, clickstreams, social media, and more. With the move from batch to real time systems, the efficient capture and analysis of time-series data can enable organizations to better detect and respond to events ahead of their competitors, or to improve operational efficiency to reduce cost and risk. Working with time series data is often different from regular application data, and there are best practices you should observe. This blog series seeks to provide these best practices as you build out your time series application on MongoDB:

  1. Introduce the concept of time-series data, and describe some of the challenges associated with this type of data
  2. How to query, analyze and present time-series data
  3. Provide discovery questions that will help you gather technical requirements needed for successfully delivering a time-series application.

What is time-series data?

While not all data is time-series in nature, a growing percentage of it can be classified as time-series – fueled by technologies that allow us to exploit streams of data in real time rather than in batches. In every industry and in every company there exists the need to query, analyze and report on time-series data. Consider a stock day trader constantly looking at feeds of stock prices over time and running algorithms to analyze trends in identify opportunities. They are looking at data over a time interval, e.g. hourly or daily ranges. A connected car company might obtain telemetry such as engine performance and energy consumption to improve component design, and monitor wear rates so they can schedule vehicle servicing before problems occur. They are also looking at data over a time interval.

Why is time-series data challenging?

Time-series data can include data that is captured at constant time intervals – like a device measurement per second – or at irregular time intervals like those generated from alerts and auditing event use cases. Time-series data is also often tagged with attributes like the device type and location of the event, and each device may provide a variable amount of additional metadata. Data model flexibility to meet diverse and rapidly changing data ingestion and storage requirements make it difficult for traditional relational (tabular) database systems with a rigid schema to effectively handle time-series data. Also, there is the issue of scalability. With a high frequency of readings generated by multiple sensors or events, time series applications can generate vast streams of data that need to be ingested and analyzed. So platforms that allow data to be scaled out and distributed across many nodes are much more suited to this type of use case than scale-up, monolithic tabular databases.

Time series data can come from different sources, with each generating different attributes that need to be stored and analyze. Each stage of the data lifecycle places different demands on a database – from ingestion through to consumption and archival.

  • During data ingestion, the database is primarily performing write intensive operations, comprising mainly inserts with occasional updates. Consumers of data may want to be alerted in real time when an anomaly is detected in the data stream during ingestion, such as a value exceeding a certain threshold.
  • As more data is ingested consumers may want to query it for specific insights, and to uncover trends. At this stage of the data lifecycle, the workload is read, rather than write heavy, but the database will still need to maintain high write rates as data is concurrently ingested and then queried.
  • Consumers may want to query historical data and perform predictive analytics leveraging machine learning algorithms to anticipate future behavior or identify trends. This will impose additional read load on the database.
  • In the end, depending on the application’s requirements, the data captured may have a shelf life and needs to be archived or deleted after a certain period of time.

As you can see working with time-series data is not just simply storing the data, but requires a wide range of data platform capabilities including handling simultaneous read and write demands, advanced querying capabilities, and archival to name a few.

Who is using MongoDB for time-series data?

MongoDB provides all the capabilities needed to meet the demands of a highly performing time-series applications. One company that took advantage of MongoDB’s time series capabilities is Quantitative Investment Manager Man AHL.

Man AHL’s Arctic application leverages MongoDB to store high frequency financial services market data (about 250M ticks per second). The hedge fund manager’s quantitative researchers (“quants”) use Arctic and MongoDB to research, construct and deploy new trading models in order to understand how markets behave. With MongoDB, Man AHL realized a 40x cost saving when compared to an existing proprietary database. In addition to cost savings, they were able to increase processing performance by 25x over the previous solution. Man AHL open sourced their Arctic project on GitHub.

Bosch Group is a multinational engineering conglomerate with nearly 300,000 employees and is the world’s largest automotive components manufacturer. IoT is a strategic initiative at Bosch, and so the company selected MongoDB as the data platform layer in its IoT suite. The suite powers IoT applications both within the Bosch group and in many of its customers in industrial internet applications, such as automotive, manufacturing, smart city, precision agriculture, and more. If you want to learn more about the key challenges presented by managing diverse, rapidly changing and high volume time series data sets generated by IoT platforms, download the Bosch and MongoDB whitepaper.

Siemens is a global company focusing on the areas of electrification, automation and digitalization. Siemens developed “Monet,” a platform backed by MongoDB that provides advanced energy management services. Monet uses MongoDB for real time raw data storage, querying and analytics.

Focus on application requirements

When working with time-series data it is imperative that you invest enough time to understand how data is going to be created, queried, and expired. With this information you can optimize your schema design and deployment architecture to best meet the application’s requirements.

You should not agree to performance metrics or SLAs without capturing the application’s requirements.

As you begin your time-series project with MongoDB you should get answers to the following questions:

Write workload

  • What will the ingestion rate be? How many inserts and updates per second?
  • As the rate of inserts increases, your design may benefit from horizontal scaling via MongoDB auto-sharding, allowing you to partition and scale your data across many nodes
  • How many simultaneous client connections will there be?
  • While a single MongoDB node can handle many simultaneous connections from tens of thousands of IoT devices, you need to consider scaling those out with sharding to meet the expected client load.
  • Do you need to store all raw data points or can data be pre-aggregated? If pre-aggregated, what summary level of granularity or interval is acceptable to store? Per minute? Every 15 minutes?
  • MongoDB can store all your raw data if you application requirements justify this. However, keep in mind that reducing the data size via pre-aggregation will yield lower dataset and index storage and an increase in query performance.
  • What is the size of data stored in each event?
  • MongoDB has an individual document size limit of 16 MB. If your application requires storing larger data within a single document, such as binary files you may want to leverage MongoDB GridFS. Ideally when storing high volume time-series data it is a best practice to keep the document size small around 1 disk block size.

Read workload:

  • How many read queries per second?
  • A higher read query load may benefit from additional indexes or horizontal scaling via MongoDB auto-sharding.
    As with write volumes, reads can be scaled with auto-sharding. You can also distribute read load across secondary replicas in your replica set.
  • Will clients be geographically dispersed or located in the same region?
  • You can reduce network read latency by deploying read-only secondary replicas that are geographically closer to the consumers of the data.
  • What are the common data access patterns you need to support? For example, will you retrieve data by a single value such as time, or do you need more complex queries where you look for data by a combination of attributes, such as event class, by region, by time?
  • Query performance is optimal when proper indexes are created. Knowing how data is queried and defining the proper indexes is critical to database performance. Also, being able to modify indexing strategies in real time, without disruption to the system, is an important attribute of a time-series platform.
  • What analytical libraries or tools will your consumers use?
  • If your data consumers are using tools like Hadoop or Spark, MongoDB has a MongoDB Spark Connector that integrates with these technologies. MongoDB also has drivers for Python, R, Matlab and other platforms used for analytics and data science.
  • Does your organization use BI visualization tools to create reports or analyze the data?
  • MongoDB integrates with most of the major BI reporting tools including Tableau, QlikView, Microstrategy, TIBCO, and others via the MongoDB BI Connector. MongoDB also has a native BI reporting tool called MongoDB Charts, which provides the fastest way to visualize your data in MongoDB without needing any third-party products.
Data retention and archival:
  • What is the data retention policy? Can data be deleted or archived? If so, at what age?
  • If archived, for how long and how accessible should the archive be? Does archive data need to be live or can it be restored from a backup?
  • There are various strategies to remove and archive data in MongoDB. Some of these strategies include using TTL indexes, Queryable Backups, zoned sharding (allowing you to create a tiered storage pattern), or simply creating an architecture where you just drop the collection of data when no longer needed.

Security:

  • What users and roles need to be defined, and what is the least privileged permission needed for each of these entities?
  • What are the encryption requirements? Do you need to support both in-flight (network) and at-rest (storage) encryption of time series data?
  • Do all activities against the data need to be captured in an audit log?
  • Does the application need to conform with GDPR, HIPAA, PCI, or any other regulatory framework?
  • The regulatory framework may require enabling encryption, auditing, and other security measures. MongoDB supports the security configurations necessary for these compliances, including encryption at rest and in flight, auditing, and granular role-based access control controls.

While not an exhaustive list of all possible things to consider, it will help get you thinking about the application requirements and their impact on the design of the MongoDB schema and database configuration. In the next blog post, "Part 2: Schema design for time-series data in MongoDB” we will explore a variety of ways to architect a schema for different sets of requirements, and their corresponding effects on the application’s performance and scale. In part 3, "Time Series Data and MongoDB: Part 3 – Querying, Analyzing, and Presenting Time-Series Data", we will show how to query, analyze and present time-series data.

Introducing the MongoDB Enterprise Operator for Kubernetes and OpenShift

Today more DevOps teams are leveraging the power of containerization, and technologies like Kubernetes and Red Hat OpenShift, to manage containerized database clusters. To support teams building cloud-native apps with Kubernetes and OpenShift, we are introducing a Kubernetes Operator (beta) that integrates with Ops Manager, the enterprise management platform for MongoDB. The operator enables a user to deploy and manage MongoDB clusters from the Kubernetes API, without having to manually configure them in Ops Manager.

With this Kubernetes integration, you can consistently and effortlessly run and deploy workloads wherever they need to be, standing up the same database configuration in different environments, all controlled with a simple, declarative configuration. Operations teams can also offer developers new services like MongoDB-as-a-Service, that could provide for them a fully managed database, alongside other products and services, managed by Kubernetes and OpenShift.

In this blog, we’ll cover the following:

  • Brief discussion on the container revolution
  • Overview of MongoDB Ops Manager
  • How to Install and configure the MongoDB Enterprise Operator for Kubernetes
  • Troubleshooting
  • Where to go for more information

The containerization movement

If you ever visited an international shipping port or drove down an interstate highway you may have seen large rectangular metal containers generally referred to as intermodal containers. These containers are designed and built using the same specifications even though the contents of these boxes can vary greatly. The consistent design not only enables these containers to freely move from ship, to rail, and to truck, they also allow this movement without unloading and reloading the cargo contents.

This same concept of a container can be applied to software applications where the application is the contents of the container along with its supporting frameworks and libraries. The container can be freely moved from one platform to another all without disturbing the application. This capability makes it easy to move an application from an on-premise datacenter server to a public cloud provider, or to quickly stand up replica environments for development, test, and production usage.

MongoDB 4.0 introduces the MongoDB Enterprise Operator for Kubernetes which enables a user to deploy and manage MongoDB clusters from the Kubernetes API, without the user having to connect directly to Ops Manager or Cloud Manager (the hosted version of Ops Manager, delivered as a service.

While MongoDB is fully supported in a containerized environment, you need to make sure that the benefits you get from containerizing the database exceed the cost of managing the configuration. As with any production database workload, these containers should use persistent storage and will require additional configuration depending on the underlying container technology used. To help facilitate the management of the containers themselves, DevOps teams are leveraging the power of orchestration technologies like Kubernetes and Red Hat OpenShift. While these technologies are great at container management, they are not aware of application specific configurations and deployment topologies such as MongoDB replica sets and sharded clusters. For this reason, Kubernetes has Custom Resources and Operators which allow third-parties to extend the Kubernetes API and enable application aware deployments.

Later in this blog you will learn how to install and get started with the MongoDB Enterprise Operator for Kubernetes. First let’s cover MongoDB Ops Manager, which is a key piece in efficient MongoDB cluster management.

Managing MongoDB

Ops Manager is an enterprise class management platform for MongoDB clusters that you run on your own infrastructure. The capabilities of Ops Manager include monitoring, alerting, disaster recovery, scaling, deploying and upgrading of replica sets and sharded clusters, and other MongoDB products, such as the BI Connector. While a thorough discussion of Ops Manager is out of scope of this blog it is important to understand the basic components that make up Ops Manager as they will be used by the Kubernetes Operator to create your deployments..

Figure 1: MongoDB Ops Manager deployment screen

A simplified Ops Manager architecture is shown in Figure 2 below. Note that there are other agents that Ops Manager uses to support features like backup but these are outside the scope of this blog and not shown. For complete information on MongoDB Ops Manager architecture see the online documentation found at the following URL: https://docs.opsmanager.mongodb.com/current/

Figure 2: Simplified Ops Manager deployment

The MongoDB HTTP Service provides a web application for administration. These pages are simply a front end to a robust set of Ops Manager REST APIs that are hosted in the Ops Manager HTTP Service. It is through these REST APIs that the Kubernetes Operator will interact with Ops Manager.

MongoDB Automation Agent

With a typical Ops Manager deployment there are many management options including upgrading the cluster to a different version, adding secondaries to an existing replica set and converting an existing replica set into a sharded cluster. So how does Ops Manager go about upgrading each node of a cluster or spinning up new MongoD instances? It does this by relying on a locally installed service called the Ops Manager Automation Agent which runs on every single MongoDB node in the cluster. This lightweight service is available on multiple operating systems so regardless if your MongoDB nodes are running in a Linux Container or Windows Server virtual machine or your on-prem PowerPC Server, there is an Automation Agent available for that platform. The Automation Agents receive instructions from Ops Manager REST APIs to perform work on the cluster node.

MongoDB Monitoring Agent

When Ops Manager shows statistics such as database size and inserts per second it is receiving this telemetry from the individual nodes running MongoDB. Ops Manager relies on the Monitoring Agent to connect to your MongoDB processes, collect data about the state of your deployment, then send that data to Ops Manager. There can be one or more Monitoring Agents deployed in your infrastructure for reliability but only one primary agent per Ops Manager Project is collecting data. Ops Manager is all about automation and as soon as you have the automation agent deployed, other supporting agents like the Monitoring agent are deployed for you. In the scenario where the Kubernetes Operator has issued a command to deploy a new MongoDB cluster in a new project, Ops Manager will take care of deploying the monitoring agent into the containers running your new MongoDB cluster.

Getting started with MongoDB Enterprise Operator for Kubernetes

Ops Manager is an integral part of automating a MongoDB cluster with Kubernetes. To get started you will need access to an Ops Manager 4.0+ environment or MongoDB Cloud Manager.

The MongoDB Enterprise Operator for Kubernetes is compatible with Kubernetes v1.9 and above. It also has been tested with Openshift version 3.9. You will need access to a Kubernetes environment. If you do not have access to a Kubernetes environment, or just want to stand up a test environment, you can use minikube which deploys a local single node Kubernetes cluster on your machine. For additional information and setup instructions check out the following URL: https://kubernetes.io/docs/setup/minikube.

The following sections will cover the three step installation and configuration of the MongoDB Enterprise Operator for Kubernetes. The order of installation will be as follows:

  • Step 1: Installing the MongoDB Enterprise Operator via a helm or yaml file
  • Step 2: Creating and applying a Kubernetes ConfigMap file
  • Step 3: Create the Kubernetes secret object which will store the Ops Manager API Key

Step 1: Installing MongoDB Enterprise Operator for Kubernetes

To install the MongoDB Enterprise Operator for Kubernetes you can use helm, the Kubernetes package manager, or pass a yaml file to kubectl. The instructions for both of these methods is as follows, pick one and continue to step 2.

To install the operator via Helm:

To install with Helm you will first need to clone the public repo https://github.com/mongodb/mongodb-enterprise-kubernetes.git

Change directories into the local copy and run the following command on the command line:

helm install helm_chart/ --name mongodb-enterprise

To install the operator via a yaml file:

Run the following command from the command line:

kubectl apply -f https://raw.githubusercontent.com/mongodb/mongodb-enterprise-kubernetes/master/mongodb-enterprise.yaml

At this point the MongoDB Enterprise Operator for Kubernetes is installed and now needs to be configured. First, we must create and apply a Kubernetes ConfigMap file. A Kubernetes ConfigMap file holds key-value pairs of configuration data that can be consumed in pods or used to store configuration data. In this use case the ConfigMap file will store configuration information about the Ops Manager deployment we want to use.

Step 2: Creating the Kubernetes ConfigMap file

For the Kubernetes Operator to know what Ops Manager you want to use you will need to obtain some properties from the Ops Manager console and create a ConfigMap file. These properties are as follows:

Base Url - The URL of your Ops Manager or Cloud Manager.

Project Id - The id of an Ops Manager Project which the Kubernetes Operator will deploy into.

User - An existing Ops Manager username

Public API Key - Used by the Kubernetes Operator to connect to the Ops Manager REST API endpoint

If you already know how to obtain these follows copy them down and proceed to Step 3.

Base Url

The Base Uri is the URL of your Ops Manager or Cloud Manager.

Note: If you are using Cloud Manager the Base Url is, “https://cloud.mongodb.com

To obtain the Base Url in Ops Manager copy the Url used to connect to your Ops Manager server from your browser's navigation bar. It should be something similar to http://servername:8080. You can also perform the following:

Login to Ops Manager and click on the Admin button. Next select the “Ops Manager Config” menu item. You will be presented with a screen similar to the figure below:

Figure 3: Ops Manager Config page

Copy down the value displayed in the URL To Access Ops Manager box. Note: If you don’t have access to the Admin drop down you will have to copy the Url used to connect to your Ops Manager server from your browser's navigation bar.

Project Id

The Project Id is the id of an Ops Manager Project which the Kubernetes Operator will deploy into.

An Ops Manager Project is a logical organization of MongoDB clusters and also provides a security boundary. One or more Projects are apart of an Ops Manager Organization. If you need to create an Organization click on your user name at the upper right side of the screen and select, “Organizations”. Next click on the “+ New Organization” button and provide a name for your Organization. Once you have an Organization you can create a Project.

Figure 4: Ops Manager Organizations page

To create a new Project, click on your Organization name. This will bring you to the Projects page and from here click on the “+ New Project” button and provide a unique name for your Project. If you are not an Ops Manager administrator you may not have this option and will have to ask your administrator to create a Project.

Once the Project is created or if you already have a Project created on your behalf by an administrator you can obtain the Project Id by clicking on the Settings menu option as shown in the Figure below.

Figure 5: Project Settings page

Copy the Project ID.

User

The User is an existing Ops Manager username

To see the list of Ops Manager users return to the Project and click on the “Users & Teams” menu. You can use any Ops Manager user who has at least Project Owner access. If you’d like to create another username click on the “Add Users & Team” button as shown in Figure 6.

Figure 6: Users & Teams page

Copy down the email of the user you would like the Kubernetes Operator to use when connecting to Ops Manager.

Public API Key

The Ops Manager API Key is used by the Kubernetes Operator to connect to the Ops Manager REST API endpoint. You can create a API Key by clicking on your username on the upper right hand corner of the Ops Manager console and selecting, “Account” from the drop down menu. This will open the Account Settings page as shown in Figure 7.

Figure 7: Public API Access page

Click on the “Public API Access” tab. To create a new API key click on the “Generate” button and provide a description. Upon completion you will receive an API key as shown in Figure 8.

Figure 8: Confirm API Key dialog

Be sure to copy the API Key as it will be used later as a value in a configuration file. It is important to copy this value while the dialog is up since you can not read it back once you close the dialog. If you missed writing the value down you will need to delete the API Key and create a new one.

Note: If you are using MongoDB Cloud Manager or have Ops Manager deployed in a secured network you may need to whitelist the IP range of your Kubernetes cluster so that the Operator can make requests to Ops Manager using this API Key.

Now that we have acquired the necessary Ops Manager configuration information we need to create a Kubernetes ConfigMap file for the Kubernetes Project. To do this use a text editor of your choice and create the following yaml file, substituting the bold placeholders for the values you obtained in the Ops Manager console. For sample purposes we can call this file “my-project.yaml”.

apiVersion: v1
kind: ConfigMap
metadata:
  name:<<Name i.e. name of OpsManager Project>>
  namespace: mongodb
data:
  projectId:<<Project ID>>
  baseUrl: <<OpsManager URL>>

Figure 9: Sample ConfigMap file

Note: The format of the ConfigMap file may change over time as features and capabilities get added to the Operator. Be sure to check with the MongoDB documentation if you are having problems submitting the ConfigMap file.

Once you create this file you can apply the ConfigMap to Kubernetes using the following command:

kubectl apply -f my-project.yaml

Step 3: Creating the Kubernetes Secret

For a user to be able to create or update objects in an Ops Manager Project they need a Public API Key. Earlier in this section we created a new API Key and you hopefully wrote it down. This API Key will be held by Kubernetes as a Secret object.

You can create this Secret with the following command:

kubectl -n mongodb create secret generic <<Name of credentials>> --from-literal="user=<<User>>" --from-literal="publicApiKey=<<public-api-key>>"

Make sure you replace the User and Public API key values with those you obtained from your Ops Manager console. You can pick any name for the credentials – just make a note of it as you will need it later when you start creating MongoDB clusters.

Now we're ready to start deploying MongoDB Clusters!

Deploying a MongoDB Replica Set

Kubernetes can deploy a MongoDB standalone, replica set or a sharded cluster. To deploy a 3 node replica set create the following yaml file:

apiVersion: mongodb.com/v1
kind: MongoDbReplicaSet
metadata:
  name: <<Name of your new MongoDB replica set>>
  namespace: mongodb
spec:
  members: 3
  version: 3.6.5

  persistent: false

  project: <<Name value specified in metadata.name of ConfigMap file>>
  credentials: <<Name of credentials secret>>
Figure 10: simple-rs.yaml file describing a three node replica set

The name of your new cluster can be any name you chose. The name of the OpsManager Project config map and the name of credentials secret were defined previously.

To submit the request for Kubernetes to create this cluster simply pass the name of the yaml file you created to the following kubectl command:

kubectl apply -f simple-rs.yaml

After a few minutes your new cluster will show up in Ops Manager as shown in Figure 11.

Figure 11: Servers tab of the Deployment page in Ops Manager

Notice that Ops Manager installed not only the Automation Agents on these three containers running MongoDB, it also installed Monitoring Agent and Backup Agents.

A word on persistent storage

What good would a database be if anytime the container died your data went to the grave as well? Probably not a good situation and maybe one where tuning up the resumé might be a good thing to do as well. Up until recently, the lack of persistent storage and consistent DNS mappings were major issues with running databases within containers. Fortunately, recent work in the Kubernetes ecosystem has addressed this concern and new features like PersistentVolumes and StatefulSets have emerged allowing you to deploy databases like MongoDB without worrying about losing data because of hardware failure or the container moved elsewhere in your datacenter. Additional configuration of the storage is required on the Kubernetes cluster before you can deploy a MongoDB Cluster that uses persistent storage. In Kubernetes there are two types of persistent volumes: static and dynamic. The Kubernetes Operator can provision MongoDB objects (i.e. standalone, replica set and sharded clusters) using either type.

Connecting your application

Connecting to MongoDB deployments in Kubernetes is no different than other deployment topologies. However, it is likely that you'll need to address the network specifics of your Kubernetes configuration. To abstract the deployment specific information such as hostnames and ports of your MongoDB deployment, the Kubernetes Enterprise Operator for Kubernetes uses Kubernetes Services.

Services

Each MongoDB deployment type will have two Kubernetes services generated automatically during provisioning. For example, suppose we have a single 3 node replica set called "my-replica-set", then you can enumerate the services using the following statement:

kubectl get all -n mongodb --selector=app=my-replica-set-svc

This statement yields the following results:

NAME                   READY     STATUS    RESTARTS   AGE
pod/my-replica-set-0   1/1       Running   0          29m
pod/my-replica-set-1   1/1       Running   0          29m
pod/my-replica-set-2   1/1       Running   0          29m

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)           AGE
service/my-replica-set-svc            ClusterIP   None             <none>        27017/TCP         29m
service/my-replica-set-svc-external   NodePort    10.103.220.236   <none>        27017:30057/TCP   29m

NAME                              DESIRED   CURRENT   AGE
statefulset.apps/my-replica-set   3         3         29m

Note the appended string "-svc" to the name of the replica set.

The service with "-external" is a NodePort - which means it's exposed to the overall cluster DNS name on port 30057.

Note: If you are using Minikube you can obtain the IP address of the running replica set by issuing the following:

minikube service list

In our example which used minikube the result set contained the following information: mongodb my-replica-set-svc-external http://192.168.39.95:30057

Now that we know the IP of our MongoDB cluster we can connect using the Mongo Shell or whatever application or tool you would like to use.

Basic Troubleshooting

If you are having problems submitting a deployment you should read the logs. Issues like authentication issues and other common problems can be easily detected in the log files. You can view the MongoDB Enterprise Operator for Kubernetes log files via the following command:

kubectl logs -f deployment/mongodb-enterprise-operator -n mongodb

You can also use kubectl to see the logs of the database pods. The main container processes is continually tailing the Automation Agent logs and can be seen with the following statement:

kubectl logs <<name of pod>> -n mongodb

Note: You can enumerate the list of pods using

kubectl get pods -n mongodb

Another common troubleshooting technique is to shell into one of the containers running MongoDB. Here you can use common Linux tools to view the processes, troubleshoot, or even check mongo shell connections (sometimes helpful in diagnosing network issues).

kubectl exec -it <<name of pod>> -n mongodb -- /bin/bash

An example output of this command is as follows:

UID        PID  PPID  C STIME TTY          TIME CMD
mongodb      1     0  0 16:23 ?        00:00:00 /bin/sh -c supervisord -c /mongo
mongodb      6     1  0 16:23 ?        00:00:01 /usr/bin/python /usr/bin/supervi
mongodb      9     6  0 16:23 ?        00:00:00 bash /mongodb-automation/files/a
mongodb     25     9  0 16:23 ?        00:00:00 tail -n 1000 -F /var/log/mongodb
mongodb     26     1  4 16:23 ?        00:04:17 /mongodb-automation/files/mongod
mongodb     45     1  0 16:23 ?        00:00:01 /var/lib/mongodb-mms-automation/
mongodb     56     1  0 16:23 ?        00:00:44 /var/lib/mongodb-mms-automation/
mongodb     76     1  1 16:23 ?        00:01:23 /var/lib/mongodb-mms-automation/
mongodb   8435     0  0 18:07 pts/0    00:00:00 /bin/bash

From inside the container we can make a connection to the local MongoDB node easily by running the mongo shell via the following command:

/var/lib/mongodb-mms-automation/mongodb-linux-x86_64-3.6.5/bin/mongo --port 27017

Note: The version of the automation agent may be different than 3.6.5, be sure to check the directory path

Where to go for more information

More information will be available on the MongoDB documentation website in the near future. Until then check out these resources for more information:

Slack: #enterprise-kubernetes

Sign up @ https://launchpass.com/mongo-db

GitHub: https://github.com/mongodb/mongodb-enterprise-kubernetes

To see all MongoDB operations best practices, download our whitepaper: https://www.mongodb.com/collateral/mongodb-operations-best-practices

Getting Started with Python and MongoDB

You can get started with MongoDB and your favorite programming language by leveraging one of its drivers, many of which are maintained by MongoDB engineers, and others which are maintained by members of the community. MongoDB has a native Python driver, PyMongo, and a team of Driver engineers dedicated to making the driver fit to the Python community’s needs.

In this article, which is aimed at Python developers who are new to MongoDB, you will learn how to do the following:

  • Create a free hosted MongoDB database using MongoDB Atlas
  • Install PyMongo, the Python Driver
  • Connect to MongoDB
  • Explore MongoDB Collections and Documents
  • Perform basic Create, Retrieve, Update and Delete (CRUD) operations using PyMongo

Let’s get started!

You can start working immediately with MongoDB by using a free MongoDB cluster via MongoDB Atlas. MongoDB Atlas is a hosted database service that allows you to choose your database size and get a connection string! If you are interested in using the free tier follow the instructions in the Appendix section at the end of this article.

Install the Python Driver

For this article we will install the Python driver called, “PyMongo”.

Although there are other drivers written by the community, PyMongo is the official Python driver for MongoDB. For a detailed documentation on the driver check out the documentation here.

The easiest way to install the driver is through the pip package management system. Execute the following on a command line:

python -m pip install pymongo

Note: If you are using the Atlas M0 (Free Tier) cluster, you must use Python 2.7.9+ and use a Python 3.4 or newer. You can check which version of Python and PyMongo you have installed by issuing “python --version” and “pip list” commands respectively.

For variations of driver installation check out the complete documentation:

Once PyMongo is installed we can write our first application that will return information about the MongoDB server. In your Python development environment or from a text editor enter the following code.

from pymongo import MongoClient
# pprint library is used to make the output look more pretty
from pprint import pprint
# connect to MongoDB, change the << MONGODB URL >> to reflect your own connection string
client = MongoClient(<<MONGODB URL>>)
db=client.admin
# Issue the serverStatus command and print the results
serverStatusResult=db.command("serverStatus")
pprint(serverStatusResult)

Replace the “<>” with your connection string to MongoDB. Save this file as “mongodbtest.py” and run it from the command line via, “python mongodbtest.py”

An example output appears as follows:

{u'asserts': {u'msg': 0,
              u'regular': 0,
              u'rollovers': 0,
              u'user': 0,
              u'warning': 0},
 u'connections': {u'available': 96, u'current': 4, u'totalCreated': 174L},
 u'extra_info': {u'note': u'fields vary by platform', u'page_faults': 0},
 u'host': u'cluster0-shard-00-00-6czvq.mongodb.net:27017',
 u'localTime': datetime.datetime(2017, 4, 4, 0, 18, 45, 616000),
.
.
.
}

Note that the ‘u’ character comes from the python output and it means that the strings are stored in unicode. This example also uses the pprint library which is not related to MongoDB but is used here only to make the output structured and visually appealing from a console.

In this example we are connecting to our MongoDB instance and issuing the “db.serverStatus()” command (reference). This command returns information about our MongoDB instance and is used in this example as a way to execute a command against MongoDB.

If your application runs successfully, you are ready to continue!

Exploring Collections and Documents

MongoDB stores data in documents. Documents are not like Microsoft Word or Adode PDF documents but rather JSON documents based on the JSON specification.
An example of a JSON document would be as follows:

![JSON document example](https://webassets.mongodb.com/_com_assets/cms/JSON_Example_Python_MongoDB-mzqqz0keng.png)
Figure 1: Sample document

Notice that documents are not just key/value pairs but can include arrays and subdocuments. The data itself can be different data types like geospatial, decimal, and ISODate to name a few. Internally MongoDB stores a binary representation of JSON known as BSON. This allows MongoDB to provide data types like decimal that are not defined in the JSON specification. For more information on the BSON spec check out the following URL: http://bsonspec.org.

A collection in MongoDB is a container for documents. A database is the container for collections. This grouping is similar to relational databases and is pictured below:

</tr>
<tr>
    <td>Tables</td>
    <td>Collections</td>

</tr>
<tr>
    <td>Rows</td>
    <td>Documents</td>

</tr>
<tr>
    <td>Index</td>
    <td>Index</td>    
</tr>
Relational concept MongoDB equivalent
Database Database

There are many advantages to storing data in documents. While a deeper discussion is out of the scope of this article, some of the advantages like dynamic, flexible schema, and the ability to store arrays can be seen from our simple Python scripts. For more information on MongoDB document structure take a look at the online documentation at the following URL: https://docs.mongodb.com/manual/core/document/.

Let’s take a look at how to perform basic CRUD operations on documents in MongoDB using PyMongo.

Performing basic CRUD operations using PyMongo

To establish a connection to MongoDB with PyMongo you use the MongoClient class.

from pymongo import MongoClient
client = MongoClient('<<MongoDB URL>>’)

The “<<MongoDB URL>>”is a placeholder for the connection string to MongoDB. See the connection string documentation for detail information on how to create your MongoDB connection string. If you are using Atlas for your MongoDB database, refer to the “testing your connection” section for more information on obtaining the connection string for MongoDB Atlas.

We can now create a database object referencing a new database, called “business”, as follows:

db = client.business

Once we create this object we can perform our CRUD operations. Since we want something useful to query let’s start by building a sample data generator application.

Generating sample data code example

Create a new file called createsamples.py using your development tool or command line text editor and copy the following code:

from pymongo import MongoClient
from random import randint
#Step 1: Connect to MongoDB - Note: Change connection string as needed
client = MongoClient(port=27017)
db=client.business
#Step 2: Create sample data
names = ['Kitchen','Animal','State', 'Tastey', 'Big','City','Fish', 'Pizza','Goat', 'Salty','Sandwich','Lazy', 'Fun']
company_type = ['LLC','Inc','Company','Corporation']
company_cuisine = ['Pizza', 'Bar Food', 'Fast Food', 'Italian', 'Mexican', 'American', 'Sushi Bar', 'Vegetarian']
for x in xrange(1, 501):
    business = {
        'name' : names[randint(0, (len(names)-1))] + ' ' + names[randint(0, (len(names)-1))]  + ' ' + company_type[randint(0, (len(company_type)-1))],
        'rating' : randint(1, 5),
        'cuisine' : company_cuisine[randint(0, (len(company_cuisine)-1))] 
    }
    #Step 3: Insert business object directly into MongoDB via isnert_one
    result=db.reviews.insert_one(business)
    #Step 4: Print to the console the ObjectID of the new document
    print('Created {0} of 100 as {1}'.format(x,result.inserted_id))
#Step 5: Tell us that you are done
print('finished creating 100 business reviews')

Be sure to change the MongoDB client connection URL to one that points to your MongoDB database instance. Once you run this application, 500 randomly named businesses with their corresponding ratings will be created in the MongoDB database called, “business”. All of these businesses are created in a single collection called, “reviews”. Notice that we do not have to explicitly create a database beforehand in order to use it. This is different from other databases that require statements like, “CREATE DATABASE” to be performed first.

The command that inserts data into MongoDB in this example is the insert_one() function. A bit self-explanatory, insert_one will insert one document into MongoDB. The result set will return the single ObjectID that was created. This is one of a few methods that insert data. If you wanted to insert multiple documents in one call you can use the insert_many function. In addition to an acknowledgement of the insertion, the result set for insert_many will include a list of the ObjectIDs that were created. For more information on insert_many see the documentation located here.

For details on the result set of insert_many check out this section of documentation as well.

We are now ready to explore querying and managing data in MongoDB using Python. To guide this exploration we will create another application that will manage our business reviews.

Exploring business review data

Now that we have a good set of data in our database let’s query for some results using PyMongo.

In MongoDB the find_one command is used to query for a single document much like select statements are used in relational databases. To use the find_one command in PyMongo we pass a Python dictionary that specifies the search criteria. For example, let’s find a single business with a review score of 5 by passing the dictionary, “{ ‘rating’ : 5 } “.

fivestar = db.reviews.find_one({'rating': 5})
print(fivestar)

The result will contain data similar to the following:

{u'rating': 5,
 u'_id': ObjectId('58e65383ea0b650c867ef195'),
 u'name': u'Fish Salty Corporation', 
u'cuisine': u'Sushi Bar'}

Given we created 500 sample pieces of data there is more than one business with rating 5. The find_one method is just one in a series of find statements that support querying MongoDB data. Another statement, called “find”, will return a cursor over all documents that match the search criteria. These cursors also support methods like count() which returns the number of results in the query. To find the total count of businesses that are rated with a 5 we can use the count() method as follows:

fivestarcount = db.reviews.find({'rating': 5}).count()
print(fivestarcount)

Your results may vary since the data was randomly generated but in a test run the value of 103 was returned.

MongoDB can easily perform these straightforward queries. However, consider the scenario where you want to sum the occurrence of each rating across the entire data set. In MongoDB you could create 5 separate find queries, execute them and present the results, or you could simply issue a single query using the MongoDB aggregation pipeline as follows:

from pymongo import MongoClient
# Connect to the MongoDB, change the connection string per your MongoDB environment
client = MongoClient(port=27017)
# Set the db object to point to the business database
db=client.business
# Showcasing the count() method of find, count the total number of 5 ratings 
print('The number of 5 star reviews:')
fivestarcount = db.reviews.find({'rating': 5}).count()
print(fivestarcount)
# Not let's use the aggregation framework to sum the occurrence of each rating across the entire data set
print('\nThe sum of each rating occurance across all data grouped by rating ')
stargroup=db.reviews.aggregate(
# The Aggregation Pipeline is defined as an array of different operations
[
# The first stage in this pipe is to group data
{ '$group':
    { '_id': "$rating",
     "count" : 
                 { '$sum' :1 }
    }
},
# The second stage in this pipe is to sort the data
{"$sort":  { "_id":1}
}
# Close the array with the ] tag             
] )
# Print the result
for group in stargroup:
    print(group)

A deep dive into the aggregation framework is out of scope of this article, however, if you are interested in learning more about it check out the following URL: https://docs.mongodb.com/manual/aggregation/.

Updating data with PyMongo

Similar to insert_one and insert_many there exists functions to help you update your MongoDB data including update_one, update_many and replace_one. The update_one method will update a single document based on a query that matches a document. For example, let’s assume that our business review application now has the ability for users to “like” a business. To illustrate updating a document with this new “likes” field, let’s first take a look at what an existing document looks like from our previous application’s insertion into MongoDB. Next, let’s update the document and requery the document and see the change.

from pymongo import MongoClient
#include pprint for readabillity of the 
from pprint import pprint

#change the MongoClient connection string to your MongoDB database instance
client = MongoClient(port=27020)
db=client.business

ASingleReview = db.reviews.find_one({})
print('A sample document:')
pprint(ASingleReview)

result = db.reviews.update_one({'_id' : ASingleReview.get('_id') }, {'$inc': {'likes': 1}})
print('Number of documents modified : ' + str(result.modified_count))

UpdatedDocument = db.reviews.find_one({'_id':ASingleReview.get('_id')})
print('The updated document:')
pprint(UpdatedDocument)

When running the sample code above you may see results similar to the following:

A sample document:
{'_id': ObjectId('58eba417ea0b6523b0fded4f'),
 'cuisine': 'Pizza',
 'name': 'Kitchen Goat Corporation',
 'rating': 1}

Number of documents modified : 1

The updated document:
{'_id': ObjectId('58eba417ea0b6523b0fded4f'),
 'cuisine': 'Pizza',
 'likes': 1,
 'name': 'Kitchen Goat Corporation',
 'rating': 1}

Notice that the original document did not have the “likes” field and an update allowed us to easily add the field to the document. This ability to dynamically add keys without the hassle of costly Alter_Table statements is the power of MongoDB’s flexible data model. It makes rapid application development a reality.

If you wanted to update all the fields of the document and keep the same ObjectID you will want to use the replace_one function. For more details on replace_one check out the pymongo documentation here.

The update functions also support an option called, “upsert”. With upsert you can tell MongoDB to create a new document if the document you are trying to update does not exist.

Deleting documents

Much like the other command discussed so far the delete_one and delete_many command takes a query that matches the document to delete as the first parameter. For example, if you wanted to delete all documents in the reviews collection where the category was “Bar Food” issue the following:

result = db.restaurants.delete_many({“category”: “Bar Food“})

If you are deleting a large number of documents it may be more efficient to drop the collection instead of deleting all the documents.

Where to go next

There are lots of options when it comes to learning about MongoDB and Python. MongoDB University is a great place to start and learn about administration, development and other topics such as analytics with MongoDB. One course in particular is MongoDB for Developers (Python). This course covers the topics of this article in much more depth including a discussion on the MongoDB aggregation framework. For more information go to the following URL:

https://university.mongodb.com/courses/M101P/about

Appendix: Creating a free tier MongoDB Atlas database

MongoDB Atlas is a hosted database service that allows you to choose your database size and get a connection string! Follow the steps below to start using your free

Build your cluster for free

Follow the below steps to create a free MongoDB database:

  1. Go to the following URL: https://www.mongodb.com/cloud/atlas.
  2. Click the “Start Free” button
  3. Fill out the form to create an account. You will use this information to later login and manage your MongoDB.

Once you fill out the form, the website will create your account and you will be presented with the “Build Your New Cluster” pop up as shown in Figure 1.

!["Build your new cluster"](https://webassets.mongodb.com/_com_assets/cms/Build_Cluster_PopUp-d20lxqthmw.png)

To use the free tier scroll down and select, “M0”. When you do this the regions panel will be disabled. The free tier has some restrictions with the ability to select a region being one of them and your database size will be limited to 512MB of storage. Given that, when you are ready to use MongoDB for more than just some simple operations you can easily create another instance by choosing a size from the “Instance Size” list. Before you click “Confirm & Deploy” scroll down the page and notice the additional options shown in Figure 2.

![Additional options in Build New Cluster dialog](https://webassets.mongodb.com/_com_assets/cms/Additional_options_build_cluster-ylbrcshrxr.png)

From the “Build Your New Cluster” pop up you can see that there are other options available including choosing a 3, 5 or 7 node replica set and up to a 12 shard cluster. Note that the free tier does not allow you to chose anything more than the 3 node cluster, but if you move into other sizes these options will become available. At this point we are almost ready; the last thing to address is the admin username and password. You may also choose to have a random password generated for you by clicking the “Autogenerate Secure Password” button. Finally, click the “Confirm & Deploy” button to create your Atlas cluster. Setting up your IP Whitelist

While Atlas is creating your database you will need to define which IP’s are allowed access to your new database since MongoDB Atlas does not allow access from the internet by default. This list of granted IP addresses is called the “IP Whitelist”. To add the IP of your machine to this list click on the “Security” tab, then “IP Whitelist” then click the “+ ADD IP ADDRESS” button. This will pop up another dialog shown in Figure 3 below. You can click the “Add current IP Address” button to add your IP or provide a specific IP address or enable access to the world by not restricting IPs at all (not a fantastic idea but there in case you have no other choice and need to allow authentication from any IP).

Add whitelist entry

Once you have filled out this dialog click “Confirm” and this will update the firewall settings on your MongoDB Atlas cluster. Next, click on the “Clusters” tab and you should see your new MongoDB database ready for action!

!["Cluster0" ready for action](https://webassets.mongodb.com/_com_assets/cms/Cluster0-sscn19f6w9.png)

Testing your connection

We want to make sure the MongoDB database is accessible from our development box before we start typing in code. A quick way to test is to make a connection using the Mongo Shell command line tool. Be sure to have your MongoDB connection information available. If you are using MongoDB Atlas you can obtain the connection information by clicking on the “Connect” button on the Clusters tab as shown in Figure 5.

![Connect button of the MongoDB Atlas cluster](https://webassets.mongodb.com/_com_assets/cms/Connect_Button_Atlas_Cluster-pesbifigmx.png)

The Connect button will launch a dialog that provides connection information. At the bottom of this dialog you will see a prepared command line ready for you to simply copy and paste in a command prompt.

![Connect with Mongo Shell section of the Connect dialog](https://webassets.mongodb.com/_com_assets/cms/Connect_Mongo_shell-rb8l8ue7et.png)

Note that if you copy the connection text as-is you will have to replace with the password for the admin user, and with the name of the database to which you wish to connect.

The command text that comes from this dialog is lengthy. For clarity, let’s take a look at each of the parameters individually.

mongo
"mongodb://cluster0-shard-00-00-2ldwo.mongodb.net:27017,cluster0-shard-00-01-2ldwo.mongodb.net:27017,cluster0-shard-00-02-2ldwo.mongodb.net:27017/test?replicaSet=Cluster0-shard-0"
 --authenticationDatabase admin 
--ssl
--username myadmin 
--password S$meComPLeX1!

The first parameter is a string containing the list of all the nodes in our cluster including the definition of a replica set called, “Cluster0-shard-0”. The next parameter, “--authenticationDatabase” tells which database contains the user we want to authenticate. The “--ssl” forces the connection to be encrypted via SSL/TLS protocol. Finally we provide the username and password, and we are connected! Note that if you are not using MongoDB Atlas, your MongoDB deployment may not have security enabled or require SSL. Thus, connecting to it could be as simple as typing “mongo” in the command prompt.

You are now ready to use MongoDB!


If you're interested in learning everything you need to know to get started building a MongoDB-based app you can sign up for one of our free online MongoDB University courses.