GIANT Stories at MongoDB

Scaling Your Replica Set: Non-Blocking Secondary Reads in MongoDB 4.0

MongoDB 4.0 adds the ability to read from secondaries while replication is simultaneously processing writes. To see why this is new and important let's look at secondary read behavior in versions prior to 4.0.

Background

From the outset MongoDB has been designed so that when you have sequences of writes on the primary, each of the secondary nodes must show the writes in the same order. If you change field "A" in a document and then change field "B", it is not possible to see that document with changed field "B" and not changed field "A". Eventually consistent systems allow you to see it, but MongoDB does not, and never has.

On secondary nodes, we apply writes in batches, because applying them sequentially would likely cause secondaries to fall behind the primary. When writes are applied in batches, we must block reads so that applications cannot see data applied in the "wrong" order. This is why when reading from secondaries, periodically the readers have to wait for replication batches to be applied. The heavier the write load, the more likely that your secondary reads will have these occasional "pauses", impacting your latency metrics. Given that applications frequently use secondary reads to reduce the latency of queries (for example when they use "nearest" readPreference) having to wait for replication batches to be applied defeats the goal of getting lowest latency on your reads.

In addition to readers having to wait for replication batch writes to finish, the writing of batches needs a lock that requires all reads to complete before it can be taken. That means that in the presence of high number of reads, the replication writes can start lagging – an issue that is compounded when chain replication is enabled.

What was our goal in MongoDB 4.0?

Our goal was to allow reads during oplog application to decrease read latency and secondary lag, and increase maximum throughput of the replica set. For replica sets with a high write load, not having to wait for readers between applying oplog batches allows for lower lag and quicker confirmation of majority writes, resulting in less cache pressure on the primary and better performance overall.

How did we do it?

Starting with MongoDB 4.0 we took advantage of the fact that we implemented support for timestamps in the storage engine, which allows transactions to get a consistent view of data at a specific "cluster time". For more details about this see the video: WiredTiger timestamps.

Secondary reads can now also take advantage of the snapshots, by reading from the latest consistent snapshot prior to the current replication batch that's being applied. Reading from that snapshot guarantees a consistent view of the data, and since applying current replication batch doesn't change these earlier records, we can now relax the replication lock and allow all these secondary reads at the same time the writes are happening.

How much difference does this make?

A lot! The range of performance improvements for throughput could range from none (if you were not impacted by the replication lock - that is your write load is relatively low) to 2X.

Most importantly, this improves latency for secondary reads – for those who use readPreference "nearest" because they want to reduce latency from the application to the database – this feature means their latency in the database will also be as low as possible. We saw significant improvement in 95 and 99th percentile latency in these tests.

Thread levels 8 16 32 64
Feature off 1 2 3 5
Feature on 0 1 1 0

95th percentile read latency (ms)

Best part of this new feature? You don't need to do anything to enable it or opt-into it. All secondary reads in 4.0 will read from snapshot without waiting for replication writes.

This is just one of a number of great new features coming in MongoDB 4.0. Take a look at our blog on the 4.0 release candidate to learn more. And don’t forget, you’ve still got time to register for MongoDB World where you can meet with the engineers who are building all of these great new features.

Introducing the Aggregation Pipeline Builder in MongoDB Compass

Building MongoDB aggregations has never been so easy.

The most efficient way to analyze your data is where it already lives. That’s why we have MongoDB’s built-in aggregation framework. Have you tried it yet? If so, you know that it’s one of the most powerful MongoDB tools at your disposal. If not, you’re missing out on the ability to query your data in incredibly powerful ways. In fact, we like to say that “aggregate is the new find”. Built on the concept of data processing pipelines (like in Unix or PowerShell), the aggregation framework lets users “funnel” their documents through a multi-stage pipeline that filters, transforms, sorts, computes, aggregates your data, and more. The aggregation framework enables you to perform extensive analytics and statistical analysis in real time and generate pre-aggregated reports for dashboarding.

There are no limits to the number of stages an aggregation pipeline can have – pipelines can be as simple or as complex as you wish. In fact, the only limit is one’s imagination when it comes to deciding how to aggregate data. We’ve seen some very comprehensive pipelines!

With a rich library of over 25 stages and 100 operators (and growing with every release), the aggregation framework is an amazingly versatile tool. To help you be even more successful with it, we decided to build an aggregation construction user interface. The new Aggregation Pipeline Builder is now available with the latest release of Compass for beta testing. It’s available under the Aggregations tab.

The screenshot below depicts a sample pipeline on a movies collection that produces a listing of the title, year, and rating of all movies except for crime or horror, in English and Japanese which are rated either PG or G, starting with most recent, and sorted alphabetically within each year. Each stage was added gradually, with an ability to preview the result of our aggregation.

This easy-to-use UI lets you build your aggregation queries faster than ever before. There’s no need to worry about bracket matching, reordering stages, or remembering operator syntax with its intuitive drag-and-drop experience and code skeletons. You also get auto-completion for aggregation operators as well as query operators and even document field names.

If you need help understanding a particular operator, click on the info icon next to it and you’ll be taken directly to the appropriate guidance.

As you are building your pipeline, you can easily preview your results. This, in combination with an ability to rearrange and toggle stages on and off, makes it easy to troubleshoot your pipelines. When you are satisfied with the results, the constructed pipeline can be copied to the clipboard for easy pasting in your code, or simply saved in your favorites list for re-use later!

The aggregation authoring experience just got even more incredible with the new Compass aggregation pipeline builder. Why not check it out today?

  • Download the latest beta version of Compass
  • See the documentation for the aggregation pipeline builder in Compass
  • See the aggregation framework quick reference
  • To learn or brush up your aggregation framework skills, take M121 from our MongoDB University – it’s well worth it!

Also, please remember to send us your feedback by filing JIRA tickets or emailing it to: compass@mongodb.com.

MongoDB Enterprise Server for Pivotal Cloud Foundry goes GA

Jason Mimick
April 24, 2018
Technical

Last fall we launched MongoDB Enterprise Server for Pivotal Cloud Foundry (PCF) in beta. Earlier this year, we released v1.0.1, our first GA version of the integration. This week, we're happy to announce the next version, v1.0.3. In this post will outline all of the new post-beta features we've added to our PCF tile. For information on how the tile works and more background, please check out my colleague's excellent post.

New Features:

  • MongoDB 3.6 and MongoDB Ops Manager 3.6.2+ are now supported
  • Support for Organizations/Projects in MongoDB Ops Manager
  • Stemcell now follows all published MongoDB production best practices
  • Backups can now be enabled by default
  • T-Shirt sizing for PCF cluster VM specifications
  • TLS/SSL for MongoDB deployments using Bosh DNS

Speed: Devs and users care about getting features out quickly. They are reorganizing their teams to work on Strategic Initiatives like microservices. They need a database that supports traditional use cases as well as newer ones. These new features together with PCF help address these concerns. The MongoDB Enterprise Server for PCF Tile allows customers to leverage two powerful solutions for database management simultaneously to enhance their DevOps processes and accelerate development.

Standardization to mitigate risk: The PCF PaaS delivers features such as machine provisioning and initialization, and MongoDB Ops Manager provides runtime database management. Together these solutions allow modern enterprises the ability to ensure consistent configuration, security policies, backup, and monitoring are applied across the board to development, test, and production deployments.

Ease of use: This powerful combination additionally affords application developers the capability to deliver out-of-the-box cloud-ready solutions without needing to worry about complex infrastructure details. If you haven't already, download MongoDB Ops Manager and the latest MongoDB Enterprise Server for PCF Tile today!

MongoDB 3.6 and MongoDB Ops Manager 3.6.2+ are now supported

You must use MongoDB Ops Manager 3.6.2 or later for this version of the tile. The reason for this is the shift from a strict single level hierarchy of "Groups" in MongoDB Ops Manager to the more versatile "Organizations/Projects" structure. Refer to the Ops Manager documentation for more information.

Support for Organizations and Projects in MongoDB Ops Manager

The beta versions of the tile generated a cluster (group) name which wasn't very human-friendly. This version resolves this issue by allowing users to specify a cluster (project) name for their deployment at service provisioning time.

Stemcell now follows all published MongoDB production best practices

We have integrated all of the production best practices defined in the production notes. These best practices ensure that PCF tunes your MongoDB clusters with the optimal operating system settings for maximum performance and least risk.

Backups can now be enabled by default (v1.0.3)

Certain organizations enforce strict policies around backups due to regulatory restrictions. MongoDB Ops Manager offers the ability to backup any MongoDB deployment with point-in-time restores and queryable backups. This version of the tile supports the ability to enable backups by default for all MongoDB clusters. Alternatively, backups can be disabled by default and enabled for a cluster during service provisioning. Organizations no longer need to manually configure backups to ensure all deployments are always backed up for disaster recovery and governance scenarios.

Figure 1: MongoDB Enterprise Server Tile for PCF configuration in PCF Ops Manager

T-Shirt sizing for PCF cluster VM specifications (v1.0.3)

Beta versions of the tile only allowed for a single "VM type" (CPU/RAM/Disk) to be defined for all deployed MongoDB clusters. This version provides the ability to define three different "VM types". These types (Small, Medium, and Large) allow PCF operators to define categories of cluster types and then limit which types specific PCF users are allowed to deploy. The ability to support a wide variety of cluster types and sizes will enable an enterprise to support self-service scenarios and provide MongoDB as a service right out of the box.

TLS/SSL for MongoDB deployments using Bosh DNS

PCF Operators can now deploy MongoDB clusters with TLS/SSL enabled. The tile configuration in PCF Ops Manager now includes a new security tab in which one can enter the appropriate certificates and private PEM key files for database servers and Certificate Authorities. These will then be automatically distributed to each MongoDB server, deployed into known locations which can then easily be entered into the security settings for the corresponding MongoDB Ops Manager project for the deployment.

Conclusion

In our beta releases we focused on core functionality, and now with our GA and subsequent release, we have included some key usability features and other helpful additions. Please download and install the tile, wire it up to your MongoDB Ops Manager instance, and take it for a drive. To get started watch this demo video and refer to our documentation.

Also – check out a recent webinar we delivered with Pivotal for a hands-on look at using the MongoDB Enterprise Service Tile for PCF to refactor legacy monolith applications into microservices: How to Overcome Data Challenges When Refactoring Monoliths to Microservices.

MongoDB 3.6: Here to SRV you with easier replica set connections

If you have logged into MongoDB Atlas recently – and you should, the entry-level tier is free! – you may have noticed a strange new syntax on 3.6 connection strings.

MongoDB Seed Lists

What is this mongodb+srv syntax?

Well, in MongoDB 3.6 we introduced the concept of a seed list that is specified using DNS records, specifically SRV and TXT records. You will recall from using replica sets with MongoDB that the client must specify at least one replica set member (and may specify several of them) when connecting. This allows a client to connect to a replica set even if one of the nodes that the client specifies is unavailable.

You can see an example of this URL on a 3.4 cluster connection string:

Note that without the SRV record configuration we must list several nodes (in the case of Atlas we always include all the cluster members, though this is not required). We also have to specify the ssl and replicaSet options.

With the 3.4 or earlier driver, we have to specify all the options on the command line using the MongoDB URI syntax.

The use of SRV records eliminates the requirement for every client to pass in a complete set of state information for the cluster. Instead, a single SRV record identifies all the nodes associated with the cluster (and their port numbers) and an associated TXT record defines the options for the URI.

Reading SRV and TXT Records

We can see how this works in practice on a MongoDB Atlas cluster with a simple Python script.

import srvlookup #pip install srvlookup
import sys 
import dns.resolver #pip install dnspython

host = None

if len(sys.argv) > 1 :
   host = sys.argv[1]

if host :
   services = srvlookup.lookup("mongodb", domain=host)
   for i in services:
       print("%s:%i" % (i.hostname, i.port))
   for txtrecord in dns.resolver.query(host, 'TXT'):
       print("%s: %s" % ( host, txtrecord))
else:
   print("No host specified")

We can run this script using the node specified in the 3.6 connection string as a parameter.

$ python mongodb_srv_records.py freeclusterjd-ffp4c.mongodb.net
freeclusterjd-shard-00-00-ffp4c.mongodb.net:27017
freeclusterjd-shard-00-01-ffp4c.mongodb.net:27017
freeclusterjd-shard-00-02-ffp4c.mongodb.net:27017
freeclusterjd-ffp4c.mongodb.net: "authSource=admin&replicaSet=FreeClusterJD-shard-0"
$

You can also do this lookup with nslookup:

JD10Gen-old:~ jdrumgoole$ nslookup
> set type=SRV
> _mongodb._tcp.rs.joedrumgoole.com
Server:        10.65.141.1
Address:    10.65.141.1#53

Non-authoritative answer:
_mongodb._tcp.rs.joedrumgoole.com    service = 0 0 27022 rs1.joedrumgoole.com.
_mongodb._tcp.rs.joedrumgoole.com    service = 0 0 27022 rs2.joedrumgoole.com.
_mongodb._tcp.rs.joedrumgoole.com    service = 0 0 27022 rs3.joedrumgoole.com.

Authoritative answers can be found from:
> set type=TXT
> rs.joedrumgoole.com
Server:        10.65.141.1
Address:    10.65.141.1#53

Non-authoritative answer:
rs.joedrumgoole.com    text = "authSource=admin&replicaSet=srvdemo"

You can see how this could be used to construct a 3.4 style connection string by comparing it with the 3.4 connection string above.

As you can see, the complexity of the cluster and its configuration parameters are stored in the DNS server and hidden from the end user. If a node's IP address or name changes or we want to change the replica set name, this can all now be done completely transparently from the client’s perspective. We can also add and remove nodes from a cluster without impacting clients.

So now whenever you see mongodb+srv you know you are expecting a SRV and TXT record to deliver the client connection string.

Creating SRV and TXT records

Of course, SRV and TXT records are not just for Atlas. You can also create your own SRV and TXT records for your self-hosted MongoDB clusters. All you need for this is edit access to your DNS server so you can add SRV and TXT records. In the examples that follow we are using the AWS Route 53 DNS service.

I have set up a demo replica set on AWS with a three-node setup. They are :

rs1.joedrumgoole.com
rs2.joedrumgoole.com
rs3.joedrumgoole.com

Each has a mongod process running on port 27022. I have set up a security group that allows access to my local laptop and the nodes themselves so they can see each other.

I also set up the DNS names for the above nodes in AWS Route 53.

We can start the mongod processes by running the following command on each node.

$ sudo /usr/local/m/versions/3.6.3/bin/mongod --auth --port 27022 --replSet srvdemo --bind_ip 0.0.0.0 --keyFile mdb_keyfile"

Now we need to set up the SRV and TXT records for this cluster.

The SRV record points to the server or servers that will comprise the members of the replica set. The TXT record defines the options for the replica set, specifically the database that will be used for authorization and the name of the replica set. It is important to note that the mongodb+srv format URI implicitly adds “ssl=true”. In our case SSL is not used for the demo so we have to append “&ssl=false” to the client connector. Note that the SRV record is specifically designed to look up the mongodb service referenced at the start of the URL.

The settings in AWS Route 53 are:

Which leads to the following entry in the zone file for Route 53.

Now we can add the TXT record. By convention, we use the same name as the SRV record (rs.joedrumgoole.com) so that MongoDB knows where to find the TXT record.

We can do this on AWS Route 53 as follows:

This will create the following TXT record.

Now we can access this service as :

mongodb+srv://rs.joedrumgoole.com/test

This will retrieve a complete URL and connection string which can then be used to contact the service.

The whole process is outlined below:

Once your records are set up, you can easily change port numbers without impacting clients and also add and remove cluster members.

SRV records are another way in which MongoDB is making life easier for database developers everywhere.

You should also check out full documentation on SRV and TXT records in MongoDB 3.6.

---

You can sign up for a free MongoDB Atlas tier which is suitable for single user use.

Find out how to use your favorite programming language with MongoDB via our MongoDB drivers.

Please visit MongoDB University for free online training in all aspects of MongoDB.

Follow Joe Drumgoole on twitter for more news about MongoDB.


Meet the team that builds MongoDB in-person at MongoDB World.

Modern Distributed Application Deployment with Kubernetes and MongoDB Atlas

Jay Gordon
April 05, 2018
Technical, Cloud

Storytelling is one of the parts of being a Developer Advocate that I enjoy. Sometimes the stories are about the special moments when the team comes together to keep a system running or build it faster. But there are less than glorious tales to be told about the software deployments I’ve been involved in. And for situations where we needed to deploy several times a day, now we are talking nightmares.

For some time, I worked at a company that believed that deploying to production several times a day was ideal for project velocity. Our team was working to ensure that advertising software across our media platform was always being updated and released. One of the issues was a lack of real automation in the process of applying new code to our application servers.

What both ops and development teams had in common was a desire for improved ease and agility around application and configuration deployments. In this article, I’ll present some of my experiences and cover how MongoDB Atlas and Kubernetes can be leveraged together to simplify the process of deploying and managing applications and their underlying dependencies.

Let's talk about how a typical software deployment unfolded:

  1. The developer would send in a ticket asking for the deployment
  2. The developer and I would agree upon a time to deploy the latest software revision
  3. We would modify an existing bash script with the appropriate git repository version info
  4. We’d need to manually back up the old deployment
  5. We’d need to manually create a backup of our current database
  6. We’d watch the bash script perform this "Deploy" on about six servers in parallel
  7. Wave a dead chicken over my keyboard

Some of these deployments would fail, requiring a return to the previous version of the application code. This process to "rollback" to a prior version would involve me manually copying the repository to the older version, performing manual database restores, and finally confirming with the team that used this system that all was working properly. It was a real mess and I really wasn't in a position to change it.

I eventually moved into a position which gave me greater visibility into what other teams of developers, specifically those in the open source space, were doing for software deployments. I noticed that — surprise! — people were no longer interested in doing the same work over and over again.

Developers and their supporting ops teams have been given keys to a whole new world in the last few years by utilizing containers and automation platforms. Rather than doing manual work required to produce the environment that your app will live in, you can deploy applications quickly thanks to tools like Kubernetes.

What's Kubernetes?

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes can help reduce the amount of work your team will have to do when deploying your application. Along with MongoDB Atlas, you can build scalable and resilient applications that stand up to high traffic or can easily be scaled down to reduce costs. Kubernetes runs just about anywhere and can use almost any infrastructure. If you're using a public cloud, a hybrid cloud or even a bare metal solution, you can leverage Kubernetes to quickly deploy and scale your applications.

The Google Kubernetes Engine is built into the Google Cloud Platform and helps you quickly deploy your containerized applications.

For the purposes of this tutorial, I will upload our image to GCP and then deploy to a Kubernetes cluster so I can quickly scale up or down our application as needed. When I create new versions of our app or make incremental changes, I can simply create a new image and deploy again with Kubernetes.

Why Atlas with Kubernetes?

By using these tools together for your MongoDB Application, you can quickly produce and deploy applications without worrying much about infrastructure management. Atlas provides you with a persistent data-store for your application data without the need to manage the actual database software, replication, upgrades, or monitoring. All of these features are delivered out of the box, allowing you to build and then deploy quickly.

In this tutorial, I will build a MongoDB Atlas cluster where our data will live for a simple Node.js application. I will then turn the app and configuration data for Atlas into a container-ready image with Docker.

MongoDB Atlas is available across most regions on GCP so no matter where your application lives, you can keep your data close by (or distributed) across the cloud.

Figure 1: MongoDB Atlas runs in most GCP regions

Requirements

To follow along with this tutorial, users will need some of the following requirements to get started:

First, I will download the repository for the code I will use. In this case, it's a basic record keeping app using MongoDB, Express, React, and Node (MERN).

bash-3.2$ git clone git@github.com:cefjoeii/mern-crud.git
Cloning into 'mern-crud'...
remote: Counting objects: 326, done.
remote: Total 326 (delta 0), reused 0 (delta 0), pack-reused 326
Receiving objects: 100% (326/326), 3.26 MiB | 2.40 MiB/s, done.
Resolving deltas: 100% (137/137), done.

cd mern-crud

Next, I will npm install and get all the required npm packages installed for working with our app:

> uws@9.14.0 install /Users/jaygordon/work/mern-crud/node_modules/uws
> node-gyp rebuild > build_log.txt 2>&1 || exit 0

Selecting your GCP Region for Atlas

Each GCP region includes a set number of independent zones. Each zone has power, cooling, networking, and control planes that are isolated from other zones. For regions that have at least three zones (3Z), Atlas deploys clusters across three zones. For regions that only have two zones (2Z), Atlas deploys clusters across two zones.

The Atlas Add New Cluster form marks regions that support 3Z clusters as Recommended, as they provide higher availability. If your preferred region only has two zones, consider enabling cross-region replication and placing a replica set member in another region to increase the likelihood that your cluster will be available during partial region outages.

The number of zones in a region has no effect on the number of MongoDB nodes Atlas can deploy. MongoDB Atlas clusters are always made of replica sets with a minimum of three MongoDB nodes.

For general information on GCP regions and zones, see the Google documentation on regions and zones.

Create Cluster and Add a User

In the provided image below you can see I have selected the Cloud Provider "Google Cloud Platform." Next, I selected an instance size, in this case an M10. Deployments using M10 instances are ideal for development. If I were to take this application to production immediately, I may want to consider using an M30 deployment. Since this is a demo, an M10 is sufficient for our application. For a full view of all of the cluster sizes, check out the Atlas pricing page. Once I’ve completed these steps, I can click the "Confirm & Deploy" button. Atlas will spin up my deployment automatically in a few minutes.

Let’s create a username and password for our database that our Kubernetes deployed application will use to access MongoDB.

  • Click "Security" at the top of the page.
  • Click "MongoDB Users"
  • Click "Add New User"
  • Click "Show Advanced Options"
  • We'll then add a user "mernuser" for our mern-crud app that only has access to a database named "mern-crud" and give it a complex password. We'll specify readWrite privileges for this user:

Click "Add User"

Your database is now created and your user is added. You still need our connection string and to whitelist access via the network.

Connection String

Get your connection string by clicking "Clusters" and then clicking "CONNECT" next to your cluster details in your Atlas admin panel. After selecting connect, you are provided several options to use to connect to your cluster. Click "connect your application."

Options for the 3.6 or the 3.4 versions of the MongoDB driver are given. I built mine using the 3.4 driver, so I will just select the connection string for this version.

I will typically paste this into an editor and then modify the info to match my application credentials and my database name:

I will now add this to the app's database configuration file and save it.

Next, I will package this up into an image with Docker and ship it to Google Kubernetes Engine!

Docker and Google Kubernetes Engine

Get started by creating an account at Google Cloud, then follow the quickstart to create a Google Kubernetes Project.

Once your project is created, you can find it within the Google Cloud Platform control panel:

It's time to create a container on your local workstation:

Set the PROJECT_ID environment variable in your shell by retrieving the pre- configured project ID on gcloud with the command below:

export PROJECT_ID="jaygordon-mongodb"
gcloud config set project $PROJECT_ID
gcloud config set compute/zone us-central1-b

Next, place a Dockerfile in the root of your repository with the following:

FROM node:boron

RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app

COPY . /usr/src/app

EXPOSE 3000

CMD [npm, start]

To build the container image of this application and tag it for uploading, run the following command:

bash-3.2$ docker build -t gcr.io/${PROJECT_ID}/mern-crud:v1 .
Sending build context to Docker daemon  40.66MB
Successfully built b8c5be5def8f
Successfully tagged gcr.io/jgordon-gc/mern-crud:v1

Upload the container image to the Container Registry so we can deploy to it:

Successfully tagged gcr.io/jaygordon-mongodb/mern-crud:v1
bash-3.2$ gcloud docker -- push gcr.io/${PROJECT_ID}/mern-crud:v1The push refers to repository [gcr.io/jaygordon-mongodb/mern-crud]

Next, I will test it locally on my workstation to make sure the app loads:

docker run --rm -p 3000:3000 gcr.io/${PROJECT_ID}/mern-crud:v1
> mern-crud@0.1.0 start /usr/src/app
> node server
Listening on port 3000

Great — pointing my browser to http://localhost:3000 brings me to the site. Now it's time to create a kubernetes cluster and deploy our application to it.

Build Your Cluster With Google Kubernetes Engine

I will be using the Google Cloud Shell within the Google Cloud control panel to manage my deployment. The cloud shell comes with all required applications and tools installed to allow you to deploy the Docker image I uploaded to the image registry without installing any additional software on my local workstation.

Now I will create the kubernetes cluster where the image will be deployed that will help bring our application to production. I will include three nodes to ensure uptime of our app.

Set up our environment first:

export PROJECT_ID="jaygordon-mongodb"
gcloud config set project $PROJECT_ID
gcloud config set compute/zone us-central1-b

Launch the cluster

gcloud container clusters create mern-crud --num-nodes=3

When completed, you will have a three node kubernetes cluster visible in your control panel. After a few minutes, the console will respond with the following output:

Creating cluster mern-crud...done.
Created [https://container.googleapis.com/v1/projects/jaygordon-mongodb/zones/us-central1-b/clusters/mern-crud].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-b/mern-crud?project=jaygordon-mongodb
kubeconfig entry generated for mern-crud.
NAME       LOCATION       MASTER_VERSION  MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
mern-crud  us-central1-b  1.8.7-gke.1     35.225.138.208  n1-standard-1  1.8.7-gke.1   3          RUNNING

Just a few more steps left. Now we'll deploy our app with kubectl to our cluster from the Google Cloud Shell:

kubectl run mern-crud --image=gcr.io/${PROJECT_ID}/mern-crud:v1 --port 3000

The output when completed should be:

jay_gordon@jaygordon-mongodb:~$ kubectl run mern-crud --image=gcr.io/${PROJECT_ID}/mern-crud:v1 --port 3000
deployment "mern-crud" created

Now review the application deployment status:

jay_gordon@jaygordon-mongodb:~$ kubectl get pods
NAME                         READY     STATUS    RESTARTS   AGE
mern-crud-6b96b59dfd-4kqrr   1/1       Running   0          1m
jay_gordon@jaygordon-mongodb:~$

We'll create a load balancer for the three nodes in the cluster so they can be served properly to the web for our application:

jay_gordon@jaygordon-mongodb:~$ kubectl expose deployment mern-crud --type=LoadBalancer --port 80 --target-port 3000 
service "mern-crud" exposed

Now get the IP of the loadbalancer so if needed, it can be bound to a DNS name and you can go live!

jay_gordon@jaygordon-mongodb:~$ kubectl get service
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)        AGE
kubernetes   ClusterIP      10.27.240.1              443/TCP        11m
mern-crud    LoadBalancer   10.27.243.208   35.226.15.67   80:30684/TCP   2m

A quick curl test shows me that my app is online!

bash-3.2$ curl -v 35.226.15.67
* Rebuilt URL to: 35.226.15.67/
*   Trying 35.226.15.67...
* TCP_NODELAY set
* Connected to 35.226.15.67 (35.226.15.67) port 80 (#0)
> GET / HTTP/1.1
> Host: 35.226.15.67
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express

I have added some test data and as we can see, it's part of my deployed application via Kubernetes to GCP and storing my persistent data in MongoDB Atlas.

When I am done working with the Kubernetes cluster, I can destroy it easily:

gcloud container clusters delete mern-crud

What's Next?

You've now got all the tools in front of you to build something HUGE with MongoDB Atlas and Kubernetes.

Check out the rest of the Google Kubernetes Engine's tutorials for more information on how to build applications with Kubernetes. For more information on MongoDB Atlas, click here.

Have more questions? Join the MongoDB Community Slack!

Continue to learn via high quality, technical talks, workshops, and hands-on tutorials. Join us at MongoDB World.

MongoDB Drops ACID

MongoDB 4.0 will add support for multi-document transactions, making it the only database to combine the speed, flexibility, and power of the document model with ACID data integrity guarantees. Through snapshot isolation, transactions provide a globally consistent view of data, and enforce all-or-nothing execution to maintain data integrity.

Transactions in MongoDB will feel just like transactions developers are familiar with from relational databases. They will be multi-statement, with similar syntax (e.g. start_transaction and commit_transaction), making them familiar to anyone with prior transaction experience. The changes to MongoDB that enable multi-document transactions will not impact performance for workloads that do not require them. In MongoDB 4.0, which will be released this summer*, transactions will work across a single replica set, and MongoDB 4.2* will support transactions across a sharded deployment.

Because documents can bring together related data that would otherwise be modeled across separate parent-child tables in a relational schema, MongoDB’s atomic single-document operations already provide transaction semantics that meet the data integrity needs of the majority of applications. But multi-document transactions will make it easier than ever for developers to address a complete range of use-cases, while for many, simply knowing that they are available will provide critical peace of mind. With MongoDB 4.0, you’ll be able to rely on transactional integrity, regardless of how you model your data.

The imminent arrival of transactions is the culmination of a multi-year engineering effort, beginning over 3 years ago with the integration of the WiredTiger storage engine. We’ve laid the groundwork in almost every part of the server – from the storage layer itself, to the replication consensus protocol, to the sharding architecture. We’ve built out fine-grained consistency and durability guarantees, introduced a global logical clock, refactored cluster metadata management, and more. We’ve also exposed all of these enhancements through APIs that are fully consumable by our drivers. We’re now about 85% of the way through the backlog of features that enable transactions, as this diagram summarizes:

You can read more about our drive to multi-document transactions here. And if you can’t wait to take transactions for a spin, we’d love to have you join our beta program; all the details are at http://mongodb.com/transactions/.


About the Author

Eliot Horowitz is the CTO and Co-Founder of MongoDB. He wrote the core code base for MongoDB starting in 2007, and subsequently built the engineering and product teams. Today, Eliot oversees those teams, and continues to drive technology innovations at MongoDB. Prior to MongoDB, Eliot co-founded and built ShopWiki, a groundbreaking online retail search engine. He built its technology, its team, and presided over its private sale in 2010. Before that, Eliot was a software developer in the R&D group at DoubleClick.

Eliot is on the board of the NY Tech Talent Pipeline. In 2006, he was selected as one of BusinessWeek’s Top 25 Entrepreneurs Under Age 25, and in 2015 was named to the Business Insider “Under 35 and Crushing it” list. He was also recently named to Crain’s NY Business 40 Under 40 Class of 2017 list. Eliot received a B.S. in Computer Science from Brown University.


* Safe Harbour Statement

This post contains “forward-looking statements” within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended. Such forward-looking statements are subject to a number of risks, uncertainties, assumptions and other factors that could cause actual results and the timing of certain events to differ materially from future results expressed or implied by the forward-looking statements. Factors that could cause or contribute to such differences include, but are not limited to, those identified our filings with the Securities and Exchange Commission. You should not rely upon forward-looking statements as predictions of future events. Furthermore, such forward-looking statements speak only as of the date of this presentation.

In particular, the development, release, and timing of any features or functionality described for MongoDB products remains at MongoDB’s sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality. Except as required by law, we undertake no obligation to update any forward-looking statements to reflect events or circumstances after the date of such statements.

MongoDB’s Drive to Multi-Document Transactions

Transactions are important. Any database needs to offer transactional guarantees to enforce data integrity. But they don’t all do it in the same way – different database technologies take different approaches:

  • Relational databases model an entity’s data across multiple rows and parent-child tables, and so transactions need to span those rows and tables.
  • With subdocuments and arrays, document databases allow related data to be unified hierarchically inside a single data structure. The document can be updated with an atomic operation, giving it the same data integrity guarantees as a multi-table transaction in a relational database.

Because of this fundamental difference in data modeling, MongoDB’s existing atomicity guarantees are able to meet the data integrity needs of most applications. In fact, we estimate 80%-90% of applications don’t need multi-document transactions at all. However, there are some legitimate use cases and workloads where transactions across multiple documents are needed. In those cases, without transactions, a developer would have to implement complex logic on their own in the application layer. Also, some developers and DBAs have been conditioned by 40 years of relational data modeling to assume multi-table/document transactions are a requirement for any database, irrespective of the data model they are built upon. Others are concerned that while multi-document transactions aren’t needed by their apps today, they might be in the future and they don’t want to outgrow their database.

And so, the addition of multi-document ACID transactions makes it easier than ever for developers to address a complete range of use-cases on MongoDB.

As one can imagine, multi-document transactions are a much more complex thing to build in a distributed database than in a monolithic, scale-up database. In fact, we have been working on bringing multi-document transactions to MongoDB as part of a massive multi-year engineering investment. We have made enhancements to practically every part of the system – the storage layer itself, our replication consensus protocol, sharding architecture, consistency and durability guarantees, the introduction of a global logical clock, and refactored cluster metadata management and more. And we’ve exposed all of these enhancements through APIs that are fully consumable by our drivers.

The figure below represents the evolution of these enhancements as well as the work in progress to enable multi-document transactions. As you can see, we are nearly done.

In MongoDB 4.0, coming in summer 2018*, multi-document transactions will work across a replica set. We will extend support for transactions across a sharded deployment in the following release.

Importantly, the green boxes highlight all of the critical dependencies to transactions that have already been delivered over the past 3 years. And, frankly, that was the hardest part of the project – how to balance building the stepping stones we needed to get to transactions with delivering useful features to our users straightaway to improve their development experience along this journey. Wherever we could, we built components that suited both goals. For example, the introduction of the global logical clock and timestamps in the storage layer enforces consistent time across every operation in a distributed cluster. These enhancements are needed for transactions in order to provide snapshot isolation, but they also allowed us to implement change stream resumability and causal consistency in MongoDB 3.6, which are immediately valuable on their own. Change streams enable developers to build reactive applications that can view, filter, and act on data changes as they occur in the database in real-time, and recover from transient failures. Causal consistency allows developers to maintain the benefits of strong data consistency with “read your own write” guarantees, while taking advantage of scalability and availability of our intelligent distributed data platform.

The global logical clock is just one example. A selection of other key enhancements along the way illustrates how our engineering team deliberately laid the groundwork for transactions in such a way that we consistently surfaced additional benefits to our users:

  • The acquisition of WiredTiger Inc. and integration of its storage engine way back in MongoDB 3.0 brought massive scalability gains with document level concurrency control and compression to MongoDB. And with MVCC support, it also provided the storage layer foundations for transactions coming in MongoDB 4.0.
  • In MongoDB 3.2, the enhanced consensus protocol allowed for faster and more deterministic recovery from the failure or network partition of the primary replica set member, along with stricter durability guarantees for writes. These enhancements were immediately useful to MongoDB users then, and they are also essential capabilities for transactions.
  • The introduction of readConcern in 3.2 allowed applications to specify read isolation level on a per operation basis, providing powerful and granular consistency controls.
  • Logical sessions in MongoDB 3.6 gave our users causal consistency and retryable writes, but as a foundation for transactions, they provide MongoDB the ability to coordinate client and server operations across the nodes of a distributed cluster, managing the execution context for each statement in a transaction.
  • Similarly, retryable writes, implemented in MongoDB 3.6, simplify the development of applications in the face of elections (or other transient failures) while the server enforces at most once processing semantics.
  • Replica set point in time reads in 4.0 are essential for transactional consistency, but it’s also highly valuable to regular read operations that don’t need to be executed in a transaction. With this feature, reads will only show a view of the data that is consistent at the point the find() operation starts, irrespective of which replica serves the read, or what data has been modified by concurrent operations.

The number of remaining pieces on the roadmap to transactions is small. Once complete, multi-document distributed transactions will provide a globally consistent view of data (both in replica set and sharded deployments) through snapshot isolation and maintain all-or-nothing guarantees in cases of node failures. This will greatly simplify your application code. After all, MongoDB’s job is to take hard problems and solve them for as many developers as possible, so that you can focus on adding value to your applications and not dealing with the underlying plumbing.

We’re really excited about the release of multi-document transactions, and what they will allow you to build with MongoDB going forward. You should view our multi-document transactions page to learn more, and we invite you to sign up for the beta program so that you can start to put all of the work we’ve done through its paces.


* Safe Harbour Statement

This post contains “forward-looking statements” within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended. Such forward-looking statements are subject to a number of risks, uncertainties, assumptions and other factors that could cause actual results and the timing of certain events to differ materially from future results expressed or implied by the forward-looking statements. Factors that could cause or contribute to such differences include, but are not limited to, those identified our filings with the Securities and Exchange Commission. You should not rely upon forward-looking statements as predictions of future events. Furthermore, such forward-looking statements speak only as of the date of this presentation.

In particular, the development, release, and timing of any features or functionality described for MongoDB products remains at MongoDB’s sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality. Except as required by law, we undertake no obligation to update any forward-looking statements to reflect events or circumstances after the date of such statements.

Improving MongoDB Performance with Automatically Generated Index Suggestions

Jay Gordon
February 01, 2018
Technical, Cloud

Beyond good data modeling, there are a few processes that teams responsible for optimizing query performance can leverage: looking for COLLSCANS in logs, analyzing explain results, or relying on third-party tools. While these efforts may help you resolve some of the problems you’re noticing, they often require digging into documentation, time, and money, all the while your application remains bogged down with issues.

MongoDB Atlas, the fully managed database service, helps you resolve performance issues with a greater level of ease by providing you with tools to ensure that your data is accessed as efficiently as possible. This post will provide a basic overview of how to access the MongoDB Atlas Performance Advisor, a tool that reviews your queries for up to two weeks and provides recommended indexes where appropriate.

Getting Started

This short tutorial makes use of the following:

  • A demo data set generated with mgodatagen
  • A dedicated MongoDB Atlas cluster (the Performance Advisor is available for M10s or larger)
  • MongoDB shell install (to create indexes)

My database has two million documents in two separate collections:

If an application tries to access these documents without the right indexes in place, a collection scan will take place. The database will scan the full collection to find the required documents, and any documents that are not in memory are read from disk. This can dramatically reduce performance and cause your application to respond slower than expected.

Case in point, when I try to run an unindexed query against my collections, MongoDB Atlas will automatically create an alert indicating that the query is not well targeted.

Reviewing Performance Advisor

The Performance Advisor monitors slow-running queries (anything that takes longer than 100 milliseconds to execute) and suggests new indexes to improve query performance.

To access this tool, go to your Atlas control panel and click your cluster's name. You’ll then find "Performance Advisor" at the top.

Click the link and you'll be taken to the page where you'll see any relevant index recommendations, based on the fixed time period at the top of the page.

In this example, I will review the performance of my queries from the last 24 hours. The Performance Advisor provides me with some recommendations on how to improve the speed of my slow queries:

It looks like the test collection with the field "name" could use an index. We can review the specific changes to be made by clicking the "More Info" button.

I can copy the contents of this recommendation and paste it into my MongoDB Shell to create the recommended index. You’ll notice a special option, { background: true }, is passed with the createIndex command. Using this command ensures that index creation does not block any operations. If you’re building new indexes on production systems, I highly recommend you read more about index build operations.

Now that the recommended index is created, I can review my application's performance and see if it meets the requirements of my users. The Atlas alerts I received earlier have been resolved, which is a good sign:

Noticeable slowdowns in performance from unindexed queries damage the user experience of your application, which may result in reduced engagement or customer attrition. The Performance Advisor in MongoDB Atlas gives you a simple and cost-efficient way to ensure that you’re getting the most out of the resources you’ve provisioned.

To get started, sign up for MongoDB Atlas and deploy a cluster in minutes.

Training Machine Learning Models with MongoDB

Nicholas Png
January 18, 2018
Technical
This is a guest post by Data Scientist Nicholas Png.

Over the last four months, I attended an immersive data science program at Galvanize in San Francisco. As a graduation requirement, the last three weeks of the program are reserved for a student-selected project that puts to use the skills learned throughout the course. The project that I chose to tackle utilized natural language processing in tandem with sentiment analysis to parse and classify news articles. With the controversy surrounding our nation’s media and the concept of “fake news” floated around every corner, I decided to take a pragmatic approach to address bias in the media.

My resulting model identified three topics within an article and classified the sentiments towards each topic. Next, for each classified topic, the model returned a new article with the opposite sentiment, resulting in three articles provided to the user for each input article. With this model, I hoped to negate some of the inherent bias within an individual news article by providing counter arguments from other sources. The algorithms used were the following (in training order): TFIDF Vectorizer (text preprocessing), Latent Dirichlet Allocation (topic extraction), Scipy’s Implementation of Hierarchical Clustering (document similarity), and Multinomial Naive Bayes (sentiment classifier).

Initially, I was hesitant to use any database, let alone a non-relational one. However, as I progressed through the experiment, managing the plethora of CSV tables became more and more difficult. I needed the flexibility to add additional features to my data as the model engineered them. This is a major drawback of relational databases. Using SQL, there are two options: generate a new table for each new feature and use a multitude of JOINs to retrieve all the necessary data, or use ALTER TABLE to add a new column for each new feature. However, due the the varied algorithms I used, some features were generated one data point at a time, while others were returned as a single python list. Neither option was well suited to my needs. As a result, I turned to MongoDB to resolve my data storage, processing, and analysis issues.

To begin with, I used MongoDB to store the training data scraped from the web. I stored raw text data as individual documents on an AWS EC2 instance running a MongoDB database. Running a simple Python script on my EC2 instance, I generated a list of public news articles URLs to scrape and stored the scraped data (such as the article title and body) into my MongoDB database. I appreciated that, with MongoDB, I could employ indexes to ensure that duplicate URLs, and their associated text data, were not added to the database.

Next, the entire dataset needed to be parsed using NLP and passed in as training data for the TFIDF Vectorizer (in the scikit-learn toolkit) and the Latent Dirichlet Allocation (LDA) model. Since both TFIDF and LDA require training on the entire dataset (represented by a matrix of ~70k rows x ~250k columns), I needed to store a lot of information in memory. LDA requires training on non-reduced data in order to identify correlations between all features in their original space. Scikit Learn’s implementations of TFIDF and LDA are trained iteratively, from the first data point to the last. I was able to reduce the total load on memory and allocate more to actual training, by passing a Python generator function to the model that called my MongoDB database for each new data point. This also enabled me to use a smaller EC2 instance, thereby optimizing costs.

Once the vectorizer and LDA model were trained, I utilized the LDA model to extract 3 topics from each document, storing the top 50 words pertaining to each topic back in MongoDB. These top 50 words were used as the features to train my hierarchical clustering algorithm. The clustering algorithm functions much like a decision tree, and I generated pseudo-labels for each document by determining which leaf the document fell into.Since I could use dimensionally reduced data at this point, memory was not an issue, but all these labels needed to be referenced later in other parts of the pipeline. Rather than assigning several variables and allowing the labels to remain indefinitely in memory, I inserted new key-value pairs for the top words associated with each topic, topic labels according the clustering algorithm, and sentiment labels into each corresponding document in the collection. As each article was analyzed, the resulting labels and topic information were stored in the article’s document in MongoDB. As a result, there was no chance of data loss and any method could query the database for needed information regardless of whether other processes running in parallel were complete.

Sentiment analysis was the most difficult part of the project. There is currently no valuable labeled data related to politics and news so I initially tried to train the base models on a data set of Amazon product reviews. Unsurprisingly, this proved to be a poor choice of training data because the resulting models consistently graded sentences such as “The governor's speech reeked of subtle racism and blatant lack of political savvy” as having positive sentiment with a ~90% probability, which is questionable at best. As a result I had to manually label ~100k data points, which was time-intensive, but resulted in a much more reliable training set. The model trained on manual labels significantly outperformed the base model, trained on the Amazon product reviews data. No changes were made to the sentiment analysis algorithm itself; the only difference was the training set. This highlights the importance of accurate and relevant data for training ML models – and the necessity, more often than not, of human intervention in machine learning. Finally, by code freeze, the model was successfully extracting topics from each article and clustering the topics based on similarity to the topics in other articles.

Conclusion

In conclusion, MongoDB provides several different capabilities such as: flexible data model, indexing and high-speed querying, that make training and using machine learning algorithms much easier than with traditional, relational databases. Running MongoDB as the backend database to store and enrich ML training data allows for persistence and increased efficiency.

A final look at the MongoDB pipeline used for this project

If you are interested in this project, feel free to take a look at the code on GitHub, or feel free to contact me via LinkedIn or email.

About the author - Nicholas Png

Nicholas Png is a Data Scientist recently graduated from the Data Science Immersive Program at Galvanize in San Francisco. He is a passionate practitioner of Machine Learning and Artificial Intelligence, focused on Natural Language Processing, Image Recognition, and Unsupervised Learning. He is familiar with several open source databases including MongoDB, Redis, and HDFS. He has a Bachelors of Science in Mechanical Engineering as well as multiple years experience in both software and business development.

Download the AI and Deep Learning white paper

In case you missed it: plugins, table view, auto-complete, and more in MongoDB Compass

Sam Weaver
January 16, 2018
Technical

We’ve released several new versions of MongoDB Compass in the past few months, and we’re excited about the new features that we’ve introduced. Read on for details or download the latest version here.

If you’re new to Compass, the best way to learn to use it is with the free online tutorial: M001: MongoDB Basics. In this series of online videos and hands-on exercises, you will use Compass to explore MongoDB data models, learn the MongoDB query language, and deploy and connect to MongoDB clusters in Atlas, MongoDB's fully managed cloud service.

Compass plugins: choose your own adventure

With the introduction of a new plugin API, MongoDB Compass is fully extensible. From examining database users and roles to generating sample data, from viewing GridFS files to checking sharding status – if there’s a specific feature you need that’s not yet available in Compass, you can build a plugin for it. And if you need it, it might be useful to others as well! Plugins can be shared with the community and added to any build of Compass 1.11 or later.

You can learn more about creating plugins for Compass here or work through a tutorial to build an example plugin.

View & manipulate documents in a table view

Documents can now be viewed and edited easily in a new table view, which allows for a quick visual comparison between records:

More auth options: X.509

We added x.509 support so our customers now have full coverage of auth options when it comes to connecting to production deployments of MongoDB. (Authentication options already include username/password, Kerberos, and LDAP.)

Type queries faster and store them for later

Typing queries became quicker and easier with an intelligent autocomplete bar that matches brackets and completes field names for you. There’s also a new button for query history: use it to review queries you’ve run, run them again, or save common queries as favorites.

Free Compass Community version

With the launch of MongoDB 3.6, we introduced a new distribution called Compass Community, which contains a subset of Compass functionality, but doesn’t require a paid subscription to use in production. Compass Community has the core building blocks you need to get started with MongoDB: CRUD, indexes, explain plans, along with the new plug-in API.

You can get Compass Community from the download center. It also comes as one of the components of the MongoDB Community Server download.

Read-only Compass

If you want to view your data with Compass but don’t need to edit it (or allow other developers to edit it!), you have a new option: a read-only build of Compass. No need to stress about unintended edits with this version, now available in the download center.

We hope you enjoy this latest release!