Want to try out MongoDB on your laptop? Execute a single command and you have a lightweight, self-contained sandbox; another command removes all trace when you're done.
Need an identical copy of your application stack in multiple environments? Build your own container image and then your entire development, test, operations, and support teams can launch an identical clone environment.
Orchestration tools manage how multiple containers are created, upgraded, and made highly available. Orchestration also controls how containers are connected to build sophisticated applications from multiple, microservice containers.
The rich functionality, simple tools, and powerful APIs make container and orchestration functionality a favorite for DevOps teams who integrate them into Continuous Integration (CI) and Continuous Delivery (CD) workflows.
Shipping containers are efficiently moved using different modes of transport – perhaps initially being carried by a truck to a port, then neatly stacked alongside thousands of other shipping containers on a huge container ship that carries them to the other side of the world. At no point in the journey do the contents of that container need to repacked or modified in any way.
Shipping containers are ubiquitous, standardized, and available anywhere in the world, and they're extremely simple to use – just open them up, load in your cargo, and lock the doors shut.
The contents of each container are kept isolated from that of the others; the container full of Mentos can safely sit next to the container full of soda without any risk of a reaction. Once a spot on the container ship has been booked, you can be confident that there's room for all of your packed cargo for the whole trip – there's no way for a neighboring container to steal more than its share of space.
Software containers fulfill a similar role for your application. Packing the container involves defining what needs to be there for your application to work – operating system, libraries, configuration files, application binaries, and other parts of your technology stack. Once the container has been defined, that image is used to create containers that run in any environment, from the developer's laptop to your test/QA rig, to the production data center, on-premises or in the cloud, without any changes. This consistency can be very useful: for example, a support engineer can spin up a container to replicate an issue and be confident that it exactly matches what's running in the field.
Containers are very efficient and many of them can run on the same machine, allowing full use of all available resources. Linux containers and cgroups are used to make sure that there's no cross-contamination between containers: data files, libraries, ports, namespaces, and memory contents are all kept isolated. They also enforce upper boundaries on how much system resource (memory, storage, CPU, network bandwidth, and disk I/O) a container can consume so that a critical application isn't squeezed out by noisy neighbors.
Metaphors tend to fall apart at some point, and that's true with this one as well. There are exceptions but shipping containers typically don't interact with each other – each has its job to fulfill (keep its contents together and safe during shipping) and it doesn't need help from any of its peers to achieve that. In contrast, it can be very powerful to have software containers interact with each other through well defined interfaces – e.g., one container provides a database service that an application running in another container can access through an agreed port. The modular container model is a great way to implement microservice architectures.
There are a number of similarities between virtual machines (VMs) and containers – in particular, they both allow you to create an image and spin up one or more instances, then safely work in isolation within each one. Containers, however, have a number of advantages which make them better suited to building and deploying applications.
Each instance of a VM must contain an entire operating system, all required libraries, and of course the actual application binaries. All of that software consumes several Gigabytes of storage and memory. In contrast, each container holds its application and any dependencies, but the same Linux kernel and libraries can be shared between multiple containers running on the host. The fact that each container imposes minimal overhead on storage, RAM, and CPU means that many can run on the same host, and each takes just a couple of seconds to launch.
Running many containers allows each one to focus on a specific task; multiple containers then work in concert to implement sophisticated applications. In such microservice architectures, each container can use different versions of programming languages and libraries that can be upgraded independently.
Due to the isolation of capabilities within containers, the effort and risk associated with updating any given container is far lower than with a more monolithic architecture. The lends itself to Continuous Delivery – an approach that involves fast software development iterations and frequent, safe updates to the deployed application.
The tools and APIs provided with container technologies such as Docker are very powerful and more developer-focused than those available with VMs. These APIs allow the management of containers to be integrated into automated systems – such as Chef and Puppet – used by DevOps teams to cover the entire software development lifecycle. This has led to wide scale adoption by DevOps-oriented groups.
Virtual machines still have an essential role to play, as you'll very often be running your containers within VMs – including when using the cloud services provided by Amazon, Google, or Microsoft.
DevOps & Continuous Delivery. When the application consists of multiple containers with clear interfaces between them, it is a simple and low-risk matter to update a container, assess the impact, and then either revert to the old version or roll the update out across similar containers. By having multiple containers provide the same capability, upgrading each container can be done without negatively affecting service.
Replicating Environments. When using containers, it's a trivial matter to instantiate identical copies of your full application stack and configuration. These can then be used by new hires, partners, support teams, and others to safely experiment in isolation.
Accurate Testing. You can have confidence that your QA environment exactly matches what will be deployed – down to the exact version of every library.
Scalability. By architecting an application to be built from multiple container instances, adding more containers scales out capacity and throughput. Similarly, containers can be removed when demand falls. Using orchestration frameworks – such as Kubernetes and Apache Mesos – further simplifies elastic scaling.
Isolation. Every container running on the same host is independent and isolated from the others as well as from the host itself. The same equipment can simultaneously host development, support, test, and production versions of your application – even running different versions of tools, languages, databases, and libraries without any risk that one environment will impact another.
Performance. Unlike VMs (whether used directly or through Vagrant), containers are lightweight and have minimal impact on performance.
High Availability. By running with multiple containers, redundancy can be built into the application. If one container fails, then the surviving peers – which are providing the same capability – continue to provide service. With the addition of some automation (see the orchestration section of this paper), failed containers can be automatically recreated (rescheduled) either on the same or a different host, restoring full capacity and redundancy.
The simplicity of Docker and its rich ecosystem make it extremely powerful and easy to use.
Specific Docker containers are created from images which have been designed to provide a particular capability – whether that be, for example, just a base operating system, a web server, or a database. Docker images are constructed from layered filesystems so they can share common files, reducing disk usage and speeding up image download. Docker Hub provides thousands of images, that can be extended or used as is, to quickly create a container that's running the software you want to use – for example, all it takes to get MongoDB up and running is the command
docker run --name my-mongodb -d mongo which will download the image (if it’s not already on the machine) and use it to start the container. Proprietary images can be made available within the enterprise using a local, private registry rather than Docker Hub.
Docker containers are based on open standards, allowing containers to run on all major Linux distributions. They support bare metal, VMs, and cloud infrastructure from vendors such as Amazon, Google, and Microsoft. Integration with cloud services – e.g., with the Google Container Engine (GCE) – means that running your software in a scalable, highly available configuration is just a few clicks away.
Docker provides strong isolation where each container has its own root filesystem, processes, memory, network ports, namespace, and devices. But to be of use, containers need to be able to communicate with the outside world as well as other containers. To this end, Docker containers can be configured to expose ports as well as map volumes to directories on the host. Alternatively, Docker containers can be linked so that they communicate without opening up these resources to other systems.
Clearly, the process of deploying multiple containers to implement an application can be optimized through automation. This becomes more and more valuable as the number of containers and hosts grow. This type of automation is referred to as orchestration. Orchestration can include a number of features, including:
Instantiating a set of containers
Rescheduling failed containers
Linking containers together through agreed interfaces
Exposing services to machines outside of the cluster
Scaling out or down the cluster by adding or removing containers
There are many orchestration tools available for Docker; some of the most common are described here.
Docker Machine: Provisions hosts and installs Docker Engine (the lightweight runtime and tooling used to run Docker containers) software on them.
Docker Swarm: Produces a single, virtual Docker host by clustering multiple Docker hosts together. It presents the same Docker API; allowing it to integrate with any tool that works with a single Docker host.
Docker Compose: Takes a file defining a multi-container application (including dependencies) and deploys the described application by creating the required containers. It is mostly aimed at development, testing, and staging environments.
Kubernetes: was created by Google and is one of the most feature-rich and widely used orchestration frameworks; its key features include:
Automated deployment and replication of containers
Online scale-in or scale-out of container clusters
Load balancing over groups of containers
Rolling upgrades of application containers
Resilience, with automated rescheduling of failed containers
Controlled exposure of network ports to systems outside of the cluster
Kubernetes is designed to work in multiple environments, including bare metal, on-premises VMs, and public clouds. Google Container Engine provides a tightly integrated platform which includes hosting of the Kubernetes and Docker software, as well as provisioning the host VMs and orchestrating the containers.
The key components making up Kubernetes are:
A Cluster is a collection of one or more bare-metal servers or virtual machines (referred to as nodes) providing the resources used by Kubernetes to run one or more applications.
Pods are groups of containers and volumes co-located on the same host. Containers in the same Pod share the same network namespace and can communicate with each other using
localhost. Pods are considered to be ephemeral, rather than durable entities, and are the basic scheduling unit.
Labels are tags assigned to entities such as containers that allow them to be managed as a group – e.g., to be exposed as a service to the outside world.
Services act as basic load balancers and ambassadors for other containers, exposing them to the outside world.
The Replication Controller handles the scheduling of pods across the cluster.
Apache Mesos is designed to scale to tens of thousands of physical machines. Mesos is in production with a number of large enterprises such as Twitter, Airbnb, and Apple. An application running on top of Mesos is made up of one or more containers and is referred to as a framework. Mesos offers resources to each framework; and each framework must then decide which to accept. Mesos is less feature-rich than Kubernetes and may involve extra integration work – defining services or batch jobs for Mesos is programmatic while it is declarative for Kubernetes.
There is currently a project to run Kubernetes as a Mesos framework. Mesos provides the fine grained resource allocation of Kubernetes pods across the nodes in a cluster. Kubernetes adds the higher level functions such as: load balancing; high availability through failover (rescheduling); and elastic scaling.
Mesos is particularly suited to environments where the application needs to be co-located with other services such as Hadoop, Kafka, and Spark. Mesos is also the foundation for a number of distributed systems such as:
- Apache Aurora – a highly scalable service scheduler for long-running services and
cron jobs; it's used by Twitter. Aurora extends Mesos by adding rolling updates, service registration, and resource quotas.
- Chronos – a fault tolerant service scheduler, to be used as a replacement for
cron, to orchestrate scheduled jobs within Mesos.
- Marathon – a simple to use service scheduler; it builds upon Mesos and Chronos by ensuring that two Chronos instances are running.
Each orchestration platform has advantages relative to the others and so users should evaluate which are best suited to their needs. Aspects to consider include:
Does your enterprise have an existing DevOps framework that the orchestration must fit within and what APIs does it require?
How many hosts will be used? Mesos is proven to work over thousands of physical machines.
Will the containers be run on bare metal, private VMs, or in the cloud? Kubernetes is widely used in cloud deployments.
Are there requirements for automated high availability? Kubernetes’ Replication Controller will automatically reschedule failed pods/containers; Mesos considers that the role of an application’s framework code.
Is grouping and load balancing required for services? Kubernetes provides this but Mesos considers it a responsibility of the application’s framework code.
What skills do you have within your organization? Mesos typically requires custom coding to allow your application to run as a framework; Kubernetes is more declarative.
Setting up the infrastructure to run containers is simple but the same is not true for some orchestration frameworks – including Kubernetes and Mesos. Consider using hosted services such as Google Container Engine for Kubernetes, particularly for proofs of concept.
While many of the concerns when using containers are common to bare metal deployments, containers provide an opportunity to improve levels of security if used properly. Because containers are so lightweight and easy to use, it's easy to deploy them for very specific purposes, and the container technology helps ensure that only the minimum required capabilities are exposed.
Within a container, the ability for malicious or buggy software to cause harm can be reduced by using resource isolation and rationing.
It's important to ensure that the container images are regularly scanned for vulnerabilities, and that the images are digitally signed. There are now many projects that provide scripts and scanning tools that can check if images and packages are up to date and free of security defects. Note that updating the images has no impact on existing containers; fortunately, Kubernetes and Aurora have the ability to perform rolling updates of containers.
Running MongoDB with containers and orchestration introduces some additional considerations:
MongoDB database nodes are stateful. In the event that a container fails, and is rescheduled, it's undesirable for the data to be lost (it could be recovered from other nodes in the replica set, but that takes time). To solve this, features such as the Volume abstraction in Kubernetes can be used to map what would otherwise be an ephemeral MongoDB data directory in the container to a persistent location where the data survives container failure and rescheduling.
MongoDB database nodes within a replica set must communicate with each other – including after rescheduling. All of the nodes within a replica set must know the addresses of all of their peers, but when a container is rescheduled, it is likely to be restarted with a different IP Address. For example, all containers within a Kubernetes Pod share a single IP address, which changes when the pod is rescheduled. With Kubernetes, this can be handled by associating a Kubernetes Service with each MongoDB node, which uses the the Kubernetes DNS service to provide a
hostname for the service that remains constant through rescheduling.
Once each of the individual MongoDB nodes is running (each within its own container), the replica set must be initialized and each node added. This is likely to require some additional logic beyond that offered by off the shelf orchestration tools. Specifically, one MongoDB node within the intended replica set must be used to execute the
If the orchestration framework provides automated rescheduling of containers (as Kubernetes does, for instance) then this can increase MongoDB's resiliency as a failed replica set member can be automatically recreated, restoring full redundancy levels without human intervention.
It should be noted that while the orchestration framework might monitor the state of the containers, it is unlikely to monitor the applications running within the containers, or backup their data. That means it's important to use a strong monitoring and backup solution such as MongoDB Cloud Manager, included with MongoDB Enterprise Advanced.
fuboTV provide a soccer streaming service in North America and they run their full stack (including MongoDB) on Docker and Kubernetes; find out the benefits they see from this and how it's achieved in this case study.
Square Enix is one of the world’s leading providers of gaming experiences, publishing such iconic titles as Tomb Raider and Final Fantasy. They have produced an internal multi-tenant database-as-a-service using MongoDB and Docker – find out more in this case study.