GIANT Stories at MongoDB

Participate in a 2-day celebration of MongoDB and open source

Danielle James
April 13, 2018
Events

At MongoDB World, you'll be able to amp up your skills in deep-dive sessions that cover the latest features and give insight into the product roadmap. You’ll also have plenty of opportunities to get to know fellow community members.

From classroom-style technical workshops to receptions, the largest MongoDB conference offers a ton of activities you can get involved in.

Ask the Experts

Sign up for a consulting slot to get your questions answered by MongoDB experts who will help whiteboard solutions to your top-of-mind questions.

Black Network Happy Hour

Get to know event attendees and members of the MongoDB Black Network over (alcoholic and non-alcoholic) drinks on Day 1 of the conference.

Community Day

A one-day celebration of MongoDB and open source, complete with Q&A time with MongoDB's Co-Founder and CTO Eliot Horowitz, and plenty of unconference and hack time.

Craft Beer Tasting

Enjoy local beers at the MongoDB World closing celebration. Non-alcoholic options will be available.

Leaf Lounge

Socialize, learn, or just relax and rejuvenate with your new friends. We promise to keep the coffee flowing.

LGBTQ Happy Hour

The MongoDB LGBTQ group extends an invitation to anyone who identifies as queer for drinks. Mocktails will be available.

N00bs Reception

First time attending MongoDB World? Connect with other first time attendees at the N00bs social. There will be snacks.

Workshops

Enhance your MongoDB World learning experience by attending an in-depth technical workshop on Day 1 of MongoDB World. Workshops cost $499 and run 9:00am - 5:00pm. Participants get access to on-demand MongoDB University courses and special swag. Class size is strictly limited to ensure individual attention.

Women and Trans Coders Lounge

Run by MongoDB’s Women and Trans Coders group, the purpose of this space is to amplify the voices of non-binary people, women, and trans people of all genders within our engineering community.

Women in Tech Happy Hour

Connect with other women in tech over drinks (including non-alcoholic options).

Join us on June 26-27 for two days of geeking out and having fun. Register by May 11 to pay only $399 for your conference pass.


Event Details:
Date: June 26-27, 2018
Location: New York Hilton Midtown, 1335 6th Ave, New York, NY 10019
mongodbworld.com


*All happy hours will also serve non-alcoholic drink options.

Training Machine Learning Models with MongoDB

Nicholas Png
January 18, 2018
Technical
This is a guest post by Data Scientist Nicholas Png.

Over the last four months, I attended an immersive data science program at Galvanize in San Francisco. As a graduation requirement, the last three weeks of the program are reserved for a student-selected project that puts to use the skills learned throughout the course. The project that I chose to tackle utilized natural language processing in tandem with sentiment analysis to parse and classify news articles. With the controversy surrounding our nation’s media and the concept of “fake news” floated around every corner, I decided to take a pragmatic approach to address bias in the media.

My resulting model identified three topics within an article and classified the sentiments towards each topic. Next, for each classified topic, the model returned a new article with the opposite sentiment, resulting in three articles provided to the user for each input article. With this model, I hoped to negate some of the inherent bias within an individual news article by providing counter arguments from other sources. The algorithms used were the following (in training order): TFIDF Vectorizer (text preprocessing), Latent Dirichlet Allocation (topic extraction), Scipy’s Implementation of Hierarchical Clustering (document similarity), and Multinomial Naive Bayes (sentiment classifier).

Initially, I was hesitant to use any database, let alone a non-relational one. However, as I progressed through the experiment, managing the plethora of CSV tables became more and more difficult. I needed the flexibility to add additional features to my data as the model engineered them. This is a major drawback of relational databases. Using SQL, there are two options: generate a new table for each new feature and use a multitude of JOINs to retrieve all the necessary data, or use ALTER TABLE to add a new column for each new feature. However, due the the varied algorithms I used, some features were generated one data point at a time, while others were returned as a single python list. Neither option was well suited to my needs. As a result, I turned to MongoDB to resolve my data storage, processing, and analysis issues.

To begin with, I used MongoDB to store the training data scraped from the web. I stored raw text data as individual documents on an AWS EC2 instance running a MongoDB database. Running a simple Python script on my EC2 instance, I generated a list of public news articles URLs to scrape and stored the scraped data (such as the article title and body) into my MongoDB database. I appreciated that, with MongoDB, I could employ indexes to ensure that duplicate URLs, and their associated text data, were not added to the database.

Next, the entire dataset needed to be parsed using NLP and passed in as training data for the TFIDF Vectorizer (in the scikit-learn toolkit) and the Latent Dirichlet Allocation (LDA) model. Since both TFIDF and LDA require training on the entire dataset (represented by a matrix of ~70k rows x ~250k columns), I needed to store a lot of information in memory. LDA requires training on non-reduced data in order to identify correlations between all features in their original space. Scikit Learn’s implementations of TFIDF and LDA are trained iteratively, from the first data point to the last. I was able to reduce the total load on memory and allocate more to actual training, by passing a Python generator function to the model that called my MongoDB database for each new data point. This also enabled me to use a smaller EC2 instance, thereby optimizing costs.

Once the vectorizer and LDA model were trained, I utilized the LDA model to extract 3 topics from each document, storing the top 50 words pertaining to each topic back in MongoDB. These top 50 words were used as the features to train my hierarchical clustering algorithm. The clustering algorithm functions much like a decision tree, and I generated pseudo-labels for each document by determining which leaf the document fell into.Since I could use dimensionally reduced data at this point, memory was not an issue, but all these labels needed to be referenced later in other parts of the pipeline. Rather than assigning several variables and allowing the labels to remain indefinitely in memory, I inserted new key-value pairs for the top words associated with each topic, topic labels according the clustering algorithm, and sentiment labels into each corresponding document in the collection. As each article was analyzed, the resulting labels and topic information were stored in the article’s document in MongoDB. As a result, there was no chance of data loss and any method could query the database for needed information regardless of whether other processes running in parallel were complete.

Sentiment analysis was the most difficult part of the project. There is currently no valuable labeled data related to politics and news so I initially tried to train the base models on a data set of Amazon product reviews. Unsurprisingly, this proved to be a poor choice of training data because the resulting models consistently graded sentences such as “The governor's speech reeked of subtle racism and blatant lack of political savvy” as having positive sentiment with a ~90% probability, which is questionable at best. As a result I had to manually label ~100k data points, which was time-intensive, but resulted in a much more reliable training set. The model trained on manual labels significantly outperformed the base model, trained on the Amazon product reviews data. No changes were made to the sentiment analysis algorithm itself; the only difference was the training set. This highlights the importance of accurate and relevant data for training ML models – and the necessity, more often than not, of human intervention in machine learning. Finally, by code freeze, the model was successfully extracting topics from each article and clustering the topics based on similarity to the topics in other articles.

Conclusion

In conclusion, MongoDB provides several different capabilities such as: flexible data model, indexing and high-speed querying, that make training and using machine learning algorithms much easier than with traditional, relational databases. Running MongoDB as the backend database to store and enrich ML training data allows for persistence and increased efficiency.

A final look at the MongoDB pipeline used for this project

If you are interested in this project, feel free to take a look at the code on GitHub, or feel free to contact me via LinkedIn or email.

About the author - Nicholas Png

Nicholas Png is a Data Scientist recently graduated from the Data Science Immersive Program at Galvanize in San Francisco. He is a passionate practitioner of Machine Learning and Artificial Intelligence, focused on Natural Language Processing, Image Recognition, and Unsupervised Learning. He is familiar with several open source databases including MongoDB, Redis, and HDFS. He has a Bachelors of Science in Mechanical Engineering as well as multiple years experience in both software and business development.

Download the AI and Deep Learning white paper

Active-Active Application Architectures with MongoDB

Jay Runkel
December 08, 2017
Technical

Introduction

Determining the best database for a modern application to be deployed across multiple data centers requires careful evaluation to accommodate a variety of complex application requirements. The database will be responsible for processing reads and writes in multiple geographies, replicating changes among them, and providing the highest possible availability, consistency, and durability guarantees. But not all technology choices are equal. For example, one database technology might provide a higher guarantee of availability while providing lower data consistency and durability guarantees than another technology. The tradeoffs made by an individual database technology will affect the behavior of the application upon which it is built.

Unfortunately, there is limited understanding among many application architects as to the specific tradeoffs made by various modern databases. The popular belief appears to be that if an application must accept writes concurrently in multiple data centers, then it needs to use a multi-master database – where multiple masters are responsible for a single copy or partition of the data. This is a misconception and it is compounded by a limited understanding of the (potentially negative) implications this choice has on application behavior.

To provide some clarity on this topic, this post will begin by describing the database capabilities required by modern multi-data center applications. Next, it describes the categories of database architectures used to realize these requirements and summarize the pros and cons of each. Finally, it will look at MongoDB specifically and describe how it fits into these categories. It will list some of the specific capabilities and design choices offered by MongoDB that make it suited for global application deployments.

Active-Active Requirements

When organizations consider deploying applications across multiple data centers (or cloud regions) they typically want to use an active-active architecture. At a high-level, this means deploying an application across multiple data centers where application servers in all data centers are simultaneously processing requests (Figure 1). This architecture aims to achieve a number of objectives:

  • Serve a globally distributed audience by providing local processing (low latencies)
  • Maintain always-on availability, even in the face of complete regional outages
  • Provide the best utilization of platform resources by allowing server resources in multiple data centers to be used in parallel to process application requests.

Active-Active Application Architecture

Figure 1 - Active-Active Application Architecture

An alternative to an active-active architecture is an active-disaster recovery (also known as active-passive) architecture consisting of a primary data center (region) and one or more disaster recovery (DR) regions (Figure 2). Under normal operating conditions, the primary data center processes requests and the DR center is idle. The DR site only starts processing requests (becomes active), if the primary data center fails. (Under normal situations, data is replicated from primary to DR sites, so that the the DR sites can take over if the primary data center fails).

The definition of an active-active architecture is not universally agreed upon. Often, it is also used to describe application architectures that are similar to the active-DR architecture described above, with the distinction being that the failover from primary to DR site is fast (typically a few seconds) and automatic (no human intervention required). In this interpretation, an active-active architecture implies that application downtime is minimal (near zero).

Figure 2 - Active-DR architecture

Figure 2 - Active-DR architecture

A common misconception is that an active-active application architecture requires a multi-master database. This is not only false, but using a multi-master database means relaxing requirements that most data owners hold dear: consistency and data durability. Consistency ensures that reads reflect the results of previous writes. Data durability ensures that committed writes will persist permanently: no data is lost due to the resolution of conflicting writes or node failures. Both these database requirements are essential for building applications that behave in the predictable and deterministic way users expect.

To address the multi-master misconception, let’s start by looking at the various database architectures that could be used to achieve an active-active application, and the pros and cons of each. Once we have done this, we will drill into MongoDB’s architecture and look at how it can be used to deploy an Active-Active application architecture.

Database Requirements for Active-Active Applications

When designing an active-active application architecture, the database tier must meet four architectural requirements (in addition to standard database functionality: powerful query language with rich secondary indexes, low latency access to data, native drivers, comprehensive operational tooling, etc.):

  1. Performance - low latency reads and writes. It typically means processing reads and writes on nodes in a data center local to the application.
  2. Data durability - Implemented by replicating writes to multiple nodes so that data persists when system failures occur.
  3. Consistency - Ensuring that readers see the results of previous writes, readers to various nodes in different regions get the same results, etc.
  4. Availability - The database must continue to operate when nodes, data centers, or network connections fail. In addition, the recovery from these failures should be as short as possible. A typical requirement is a few seconds.

Due to the laws of physics, e.g., the speed of light, it is not possible for any database to completely satisfy all these requirements at the same time, so the important consideration for any engineering team building an application is to understand the tradeoffs made by each database and selecting the one that provides for the application’s most critical requirements.

Let’s look at each of these requirements in more detail.

Performance

For performance reasons, it is necessary for application servers in a data center to be able to perform reads and writes to database nodes in the same data center, as most applications require millisecond (a few to tens) response times from databases. Communication among nodes across multiple data centers can make it difficult to achieve performance SLAs. If local reads and write are not possible, then the latency associated with sending queries to remote servers significantly impacts application response time. For example, customers in Australia would not expect to have a far worse user experience than customers in the eastern US where the e-commerce vendors primary data center is located. In addition, the lack of network bandwidth between data centers can also be a limiting factor.

Data Durability

Replication is a critical feature in a distributed database. The database must ensure that writes made to one node are replicated to the other nodes that maintain replicas of the same record, even if these nodes are in different physical locations. The replication speed and data durability guarantees provided will vary among databases, and are influenced by:

  • The set of nodes that accept writes for a given record
  • The situations when data loss can occur
  • Whether conflicting writes (two different writes occurring to the same record in different data centers at about the same time) are allowed, and how they are resolved when they occur

Consistency

The consistency guarantees of a distributed database vary significantly. This variance depends upon a number of factors, including whether indexes are updated atomically with data, the replication mechanisms used, how much information individual nodes have about the status of corresponding records on other nodes, etc.

The weakest level of consistency offered by most distributed databases is eventual consistency. It simply guarantees that, eventually, if all writes are stopped, the value for a record across all nodes in the database will eventually coalesce to the same value. It provides few guarantees about whether an individual application process will read the results of its write, or if value read is the latest value for a record.

The strongest consistency guarantee that can be provided by distributed databases without severe impact to performance is causal consistency. As described by Wikipedia, causal consistency provides the following guarantees:

  • Read Your Writes: this means that preceding write operations are indicated and reflected by the following read operations.
  • Monotonic Reads: this implies that an up-to-date increasing set of write operations is guaranteed to be indicated by later read operations.
  • Writes Follow Reads: this provides an assurance that write operations follow and come after reads by which they are influenced.
  • Monotonic Writes: this guarantees that write operations must go after other writes that reasonably should precede them.

Most distributed databases will provide consistency guarantees between eventual and causal consistency. The closer to causal consistency the more an application will behave as users expect, e.g.,queries will return the values of previous writes, data won’t appear to be lost, and data values will not change in non-deterministic ways.

Availability

The availability of a database describes how well the database survives the loss of a node, a data center, or network communication. The degree to which the database continues to process reads and writes in the event of different types of failures and the amount of time required to recover from failures will determine its availability. Some architectures will allow reads and writes to nodes isolated from the rest of the database cluster by a network partition, and thus provide a high level of availability. Also, different databases will vary in the amount of time it takes to detect and recover from failures, with some requiring manual operator intervention to restore a healthy database cluster.

Distributed Database Architectures

There are three broad categories of database architectures deployed to meet these requirements:

  1. Distributed transactions using two-phase commit
  2. Multi-Master, sometimes also called “masterless”
  3. Partitioned (sharded) database with multiple primaries each responsible for a unique partition of the data

Let’s look at each of these options in more detail, as well as the pros and cons of each.

Distributed Transactions with Two-Phase Commit

A distributed transaction approach updates all nodes containing a record as part of a single transaction, instead of having writes being made to one node and then (asynchronously) replicated to other nodes. The transaction guarantees that all nodes will receive the update or the transaction will fail and all nodes will revert back to the previous state if there is any type of failure.

A common protocol for implementing this functionality is called a two-phase commit. The two-phase commit protocol ensures durability and multi-node consistency, but it sacrifices performance. The two-phase commit protocol requires two-phases of communication among all the nodes involved in the transaction with requests and acknowledgments sent at each phase of the operation to ensure every node commits the same write at the same time. When database nodes are distributed across multiple data centers this often pushes query latency from the millisecond range to the multi-second range. Most applications, especially those where the clients are users (mobile devices, web browsers, client applications, etc.) find this level of response time unacceptable.

Multi-Master

A multi-master database is a distributed database that allows a record to be updated in one of many possible clustered nodes. (Writes are usually replicated so records exist on multiple nodes and in multiple data centers.) On the surface, a multi-master database seems like the ideal platform to realize an active-active architecture. It enables each application server to read and write to a local copy of the data with no restrictions. It has serious limitations, however, when it comes to data consistency.

The challenge is that two (or more) copies of the same record may be updated simultaneously by different sessions in different locations. This leads to two different versions of the same record and the database, or sometimes the application itself, must perform conflict resolution to resolve this inconsistency. Most often, a conflict resolution strategy, such as most recent update wins or the record with the larger number of modifications wins, is used since performance would be significantly impacted if some other more sophisticated resolution strategy was applied. This also means that readers in different data centers may see a different and conflicting value for the same record for the time between the writes being applied and the completion of the conflict resolution mechanism.

For example, let’s assume we are using a multi-master database as the persistence store for a shopping cart application and this application is deployed in two data centers: East and West. At roughly the same time, a user in San Francisco adds an item to his shopping cart (a flashlight) while an inventory management process in the East data center invalidates a different shopping cart item (game console) for that same user in response to a supplier notification that the release date had been delayed (See times 0 to 1 in Figure 3).

At time 1, the shopping cart records in the two data centers are different. The database will use its replication and conflict resolution mechanisms to resolve this inconsistency and eventually one of the two versions of the shopping cart (See time 2 in Figure 3) will be selected. Using the conflict resolution heuristics most often applied by multi-master databases (last update wins or most updated wins), it is impossible for the user or application to predict which version will be selected. In either case, data is lost and unexpected behavior occurs. If the East version is selected, then the user’s selection of a flashlight is lost and if the West version is selected, the the game console is still in the cart. Either way, information is lost. Finally, any other process inspecting the shopping cart between times 1 and 2 is going to see non-deterministic behavior as well. For example, a background process that selects the fulfillment warehouse and updates the cart shipping costs would produce results that conflict with the eventual contents of the cart. If the process is running in the West and alternative 1 becomes reality, it would compute the shipping costs for all three items, even though the cart may soon have just one item, the book.

Figure 3 - Example inconsistency in multi-master database

Figure 3 - Example inconsistency in multi-master database

The set of uses cases for multi-master databases is limited to the capture of non-mission-critical data, like log data, where the occasional lost record is acceptable. Most use cases cannot tolerate the combination of data loss resulting from throwing away one version of a record during conflict resolution, and inconsistent reads that occur during this process.

Partitioned (Sharded) Database

A partitioned database divides the database into partitions, called shards. Each shard is implemented by a set of servers each of which contains a complete copy of the partition’s data. What is key here is that each shard maintains exclusive control of its partition of the data. At any given time, for each shard, one server acts as the primary and the other servers act as secondary replicas. Reads and writes are issued to the primary copy of the data. If the primary server fails for any reason (e.g., hardware failure, network partition) one of the secondary servers is automatically elected to primary.

Each record in the database belongs to a specific partition, and is managed by exactly one shard, ensuring that it can only be written to the shard’s primary. The mapping of records to shards and the existence of exactly one primary per shard ensures consistency. Since the cluster contains multiple shards, and hence multiple primaries (multiple masters), these primaries may be distributed among the data centers to ensure that writes can occur locally in each datacenter (Figure 4).

Figure 4 - Partitioned database

Figure 4 - Partitioned database

A sharded database can be used to implement an active-active application architecture by deploying at least as many shards as data centers and placing the primaries for the shards so that each data center has at least one primary (Figure 5). In addition, the shards are configured so that each shard has at least one replica (copy of the data) in each of the datacenters. For example, the diagram in Figure 5 depicts a database architecture distributed across three datacenters: New York (NYC), London (LON), and Sydney (SYD). The cluster has three shards where each shard has three replicas.

  • The NYC shard has a primary in New York and secondaries in London and Sydney
  • The LON shard has a primary in London and secondaries in New York and Sydney
  • The SYD shard has a primary in Sydney and secondaries in New York and London

In this way, each data center has secondaries from all the shards so the local app servers can read the entire data set and a primary for one shard so that writes can be made locally as well.

Figure 5 - Active Active architecture with sharded database

Figure 5 - Active Active architecture with sharded database

The sharded database meets most of the consistency and performance requirements for a majority of use cases. Performance is great because reads and writes happen to local servers. When reading from the primaries, consistency is assured since each record is assigned to exactly one primary. This option requires architecting the application so that users/queries are routed to the data center that manages the data (contains the primary) for the query. Often this is done via geography. For example, if we have two data centers in the United States (New Jersey and Oregon), we might shard the data set by geography (East and West) and route traffic for East Coast users to the New Jersey data center, which contains the primary for the Eastern shard, and route traffic for West Coast users to the Oregon data center, which contains the primary for the Western shard.

Let’s revisit the shopping cart example using a sharded database. Again, let’s assume two data centers: East and West. For this implementation, we would shard (partition) the shopping carts by their shopping card ID plus a data center field identifying the data center in which the shopping cart was created. The partitioning (Figure 6) would ensure that all shopping carts with a DataCenter field value of “East” would be managed by the shard with the primary in the East data center. The other shard would manage carts with the value of “West”. In addition, we would need two instances of the inventory management service, one deployed in each data center, with responsibility for updating the carts owned by the local data center.

Figure 6 - Shard key partitioning for shopping cart example

Figure 6 - Shard key partitioning for shopping cart example

This design assumes that there is some external process routing traffic to the correct data center. When a new cart is created, the user’s session will be routed to the geographically closest data center and then assigned a DataCenter value for that data center. For an existing cart, the router can use the cart’s DataCenter field to identify the correct data center.

From this example, we can see that the sharded database gives us all the benefits of a multi-master database without the complexities that come from data inconsistency. Applications servers can read and write from their local primary, but because each cart is owned by a single primary, no inconsistencies can occur. In contrast, multi-master solutions have the potential for data loss and inconsistent reads.

Database Architecture Comparison

The pros and cons of how well each database architecture meets active-active application requirements is provided in Figure 7. In choosing between multi-master and sharded databases, the decision comes down to whether or not the application can tolerate potentially inconsistent reads and data loss. If the answer is yes, then a multi-master database might be slightly easier to deploy. If the answer is no, then a sharded database is the best option. Since inconsistency and data loss are not acceptable for most applications, a sharded database is usually the best option.

MongoDB Active-Active Applications

MongoDB is an example of a sharded database architecture. In MongoDB, the construct of a primary server and set of secondary servers is called a replica set. Replica sets provide high availability for each shard and a mechanism, called Zone Sharding, is used to configure the set of data managed by each shard. Zone sharding makes it possible to implement the geographical partitioning described in the previous section. The details of how to accomplish this are described in the “MongoDB Multi-Data Center Deployments” white paper and Zone Sharding documentation, but MongoDB operates as described in the “Partitioned (Sharded) Database” section.

Numerous organizations use MongoDB to implement active-active application architectures. For example:

  • Ebay has codified the use of zone sharding to enable local reads and writes as one of its standard architecture patterns.
  • YouGov deploys MongoDB for their flagship survey system, called Gryphon, in a “write local, read global” pattern that facilitates active-active multi data center deployments spanning data centers in North America and Europe.
  • Ogilvy and Maher uses MongoDB as the persistence store for its core auditing application. Their sharded cluster spans three data centers in North America and Europe with active data centers in North American and mainland Europe and a DR data center in London. This architecture minimizes write latency and also supports local reads for centralized analytics and reporting against the entire data set.

In addition to the standard sharded database functionality, MongoDB provides fine grain controls for write durability and read consistency that make it ideal for multi-data center deployments. For writes, a write concern can be specified to control write durability. The write concern enables the application to specify the number of replica set members that must apply the write before MongoDB acknowledges the write to the application. By providing a write concern, an application can be sure that when MongoDB acknowledges the write, the servers in one or more remote data centers have also applied the write. This ensures that database changes will not be lost in the event of node or a data center failure.

In addition, MongoDB addresses one of the potential downsides of a sharded database: less than 100% write availability. Since there is only one primary for each record, if that primary fails, then there is a period of time when writes to the partition cannot occur. MongoDB combines extremely fast failover times with retryable writes. With retryable writes, MongoDB provides automated support for retrying writes that have failed due to transient system errors such as network failures or primary elections, , therefore significantly simplifying application code.

The speed of MongoDB’s automated failover is another distinguishing feature that makes MongoDB ideally suited for multi-data center deployments. MongoDB is able to failover in 2-5 seconds (depending upon configuration and network reliability), when a node or data center fails or network split occurs. (Note, secondary reads can continue during the failover period.) After a failure occurs, the remaining replica set members will elect a new primary and MongoDB’s driver, upon which most applications are built, will automatically identify this new primary. The recovery process is automatic and writes continue after the failover process completes.

For reads, MongoDB provides two capabilities for specifying the desired level of consistency. First, when reading from secondaries, an application can specify a maximum staleness value (maxStalenessSeconds). This ensures that the secondary’s replication lag from the primary cannot be greater than the specified duration, and thus, guarantees the currentness of the data being returned by the secondary. In addition, a read can also be associated with a ReadConcern to control the consistency of the data returned by the query. For example, a ReadConcern of majority tells MongoDB to only return data that has been replicated to a majority of nodes in the replica set. This ensures that the query is only reading data that will not be lost due to a node or data center failure, and gives the application a consistent view of the data over time.

MongoDB 3.6 also introduced causal consistency – guaranteeing that every read operation within a client session will always see the previous write operation, regardless of which replica is serving the request. By enforcing strict, causal ordering of operations within a session, causal consistency ensures every read is always logically consistent, enabling monotonic reads from a distributed system – guarantees that cannot be met by most multi-node databases. Causal consistency allows developers to maintain the benefits of strict data consistency enforced by legacy single node relational databases, while modernizing their infrastructure to take advantage of the scalability and availability benefits of modern distributed data platforms.

Conclusion

In this post we have shown that sharded databases provide the best support for the replication, performance, consistency, and local-write, local-read requirements of active-active applications. The performance of distributed transaction databases is too slow and multi-master databases do not provide the required consistency guarantees. In addition, MongoDB is especially suited for multi-data center deployments due to its distributed architecture, fast failover and ability for applications to specify desired consistency and durability guarantees through Read and Write Concerns.

View MongoDB Architect Hub

Leaf in the Wild: Appsee Shapes the Mobile Revolution with Real-Time Analytics Service Powered by MongoDB

20 billion documents, 15TB of data scaled across a MongoDB cluster in the cloud. New apps each adding 1 billion+ data points per month

With mobile now accounting for more than 90% of all time spent online in some countries, delivering rich app experiences is essential. Appsee is a new generation of mobile analytics company providing business owners with deep insights into user behavior, enabling them to increase engagement, conversion, and monetization. Customers include eBay, AVIS, Virgin, Samsung, Argos, Upwork, and many more. Appsee is also featured in Google’s Fabric platform. Appsee relies on MongoDB to ingest the firehose of time-series session data collected from its customers’ mobile apps, and then makes sense of it all.

I met with Yoni Douek, CTO and co-founder of Appsee, to learn more.

Can you start by telling us a little bit about your company?

Appsee is a real-time mobile app analytics platform that provides what we call “qualitative analytics”. Our goal is to help companies understand exactly how users are using their app with a very deep set of tools, so that they can improve their app. Traditional analytics help you see numbers – we also provide reasons behind these numbers.

One of Appsee's key qualitative tools is session recordings and replay, which enable our customers to obtain the most accurate and complete picture of each user's in-app experience. We also offer touch heatmaps, which highlight exactly where users are interacting with each screen, and where users are encountering any usability issues. Our platform also provides real-time analytics, user flows, UX insights, crash statistics, and many more insights that help companies in optimizing their app.

Appsee sessiom replay
Figure 1: Appsee session replay provides developers with deep customer experience insights

Please describe how you’re using MongoDB

Our service has two distinct data storage requirements, both of which are powered by MongoDB.

Session database: storing activity from each user as they interact with the mobile application. Session data is captured as a time-series data stream into MongoDB, including which screens they visit, for how long, what they are tapping or swiping, crashes, and so on. This raw session data is modeled in a single document, typically around 6KB in size. Complete user sessions can be reconstructed from this single MongoDB document, allowing mobile app professionals to watch the session as a video, and to optimize their app experiences.

Real-time mobile analytics database: session data is aggregated and stored in MongoDB to provide in-depth analysis of user behavior. Through 50+ charts and graphs, app owners can track a range of critical metrics including total installs, app version adoption, conversion funnels and cohorts, user flows, crash analytics, average session times, user retention rates, and more.

We make extensive use of MongoDB’s aggregation pipeline, relying on it for matching, grouping, projecting, and sorting of raw session data, combined with the various accumulator and array manipulation expressions to transform and analyze data. MongoDB’s secondary indexes allow us to efficiently access data by any dimension and query pattern. We typically enforce three or four secondary indexes on each collection.

Appsee dashboard
Figure 2: Appsee dashboard provides qualitative analytics to mobile app owners

Why did you select MongoDB?

When we began developing the Appsee service, we had a long list of requirements for the database:

  • Elastic scalability, with almost infinite capacity for user and data growth
  • High insert performance required for time-series data, coupled with low latency read performance for analytics
  • Reliable data storage so we would never lose a single user session
  • Flexible data model so we could persist highly complex, rapidly changing data generated by new generations of mobile apps
  • Developer ease-of-use, to allow us to maximize the productivity of our team, and shorten time to market. This was especially important in the early days of the company, as at the time, we only had two developers
  • Support for rich, in-database analytics so we could deliver real-insights to app owners, without having to move data into dedicated analytics nodes or data warehousing infrastructure.

As we came from a relational database background, we initially thought about MySQL. But the restrictive data model imposed by its relational schema, and inability to scale writes beyond a single node made us realize it wouldn’t meet our needs.

We reviewed a whole host of NoSQL alternatives, and it soon became clear that the best multi-purpose database that met all of our requirements was MongoDB. Its ability to handle high velocity streams of time series data, and analyze it in place was key. And the database was backed by a vibrant and helpful community, with mature documentation that would reduce our learning curve.

Can you tell us what your MongoDB deployment looks like?

We currently run a 5-shard MongoDB cluster, with a 3-node replica set provisioned for each shard, providing self-healing recovery. We run on top of AWS, with CPU-optimized Linux-based instances.

We are on MongoDB 3.2, using the Python and C# drivers, with plans to upgrade to the latest MongoDB 3.4 release later in the year. This will help us take advantage of parallel chunk migrations for faster cluster balancing as we continue to elastically scale-out.

Can you share how MongoDB is performing for you?

MongoDB is currently storing 20 billion documents, amounting to 15TB of data, which we expect to double over the next 12 months. The cluster is currently handling around 50,000 operations per second, with a 50:50 split between reads and inserts. Through our load testing, we know we can support 2x growth on the same hardware footprint.

Can you share best practices for scaling?

My top recommendation would be to shard before you actually need to – this will ensure you have plenty of capacity to respond to sudden growth requirements when you need to. Don’t leave sharding until you are close to maximum utilization on your current setup.

To put our scale into context, every app that goes live with Appsee can send us 1 billion+ data points per month as soon it launches. Every few weeks we run a load test that simulates 2x of the data we are currently processing. From those tests, we adapt our shards, collections and servers to be able to handle that doubling in load.

How do you monitor your cluster?

We are using Cloud Manager to monitor MongoDB, and our own monitoring system based on Grafana, Telegraf, and Kapacitor for the rest of the application stack.

Appsee heat maps
Figure 3: Appsee heat maps enable app designed to visualize complex data sets

How are you measuring the impact of MongoDB on your business?

Speed to market, application functionality, customer experience, and platform efficiency.

  • We can build new features and functionality faster with MongoDB. When we hire new developers, MongoDB University training and documentation gets them productive within days.
  • MongoDB simplifies our data architecture. It is a truly multi-purpose platform – supporting high speed data ingest of time-series data, coupled the ability to perform rich analytics against that data, without having to use multiple technologies.
  • Our service is able to sustain high uptime. Using MongoDB’s distributed, self-healing replica set architecture, we deploy across AWS availability zones and regions for fault tolerance and zero downtime upgrades.
  • Each generation of MongoDB brings greater platform efficiency. For example, upgrading to the WiredTiger storage engine cut our storage consumption by 30% overnight.

MongoDB development is open and collaborative – giving us the opportunity to help shape the product. Through the MongoDB Jira project, we engage directly with the MongoDB engineering team, filing feature requests and bug reports. It is as though MongoDB engineers are an extension of our team!

Yoni, thanks for taking the time to share your experience with the company

To learn more about best practices for deploying and running MongoDB on AWS, download our guide.

Leaf in the Wild: Qumram Migrates to MongoDB to Deliver Single Customer View for Regulatory Compliance & Customer Experience

Every financial services organization is tasked with two, often conflicting, priorities:

  1. The need for rapid digital transformation
  2. Implementing compliance controls that go beyond internal systems, extending all the way to their digital borders.

However, capturing and analyzing billions of customer interactions in real time across web, social, and mobile channels is a major data engineering challenge. Qumram solves that challenge with its Q-Suite portfolio of on-premise and cloud services.

Qumram’s software is used by the most heavily regulated industries in the world to help organizations capture every moment of the customer’s journey. Every keystroke, every mouse movement, and every button click, across all digital channels. Then store it for years. As you can imagine this generates an extraordinary volume and variety of data. Some Qumram customers ingest and store multiple terabytes of this sensitive data every day.

Starting out with relational databases, Qumram quickly hit the scalability wall. After extensive evaluation of alternatives, the company selected MongoDB to provide a single source of truth for all customer interactions across any digital channel.

I met with Simon Scheurer, CTO of Qumram AG, to learn more.

Can you start by telling us a little bit about your company?

Qumram provides a single view of all customer interactions across an organization’s digital channels, helping our customers to ensure compliance, prevent fraud, and enrich the experience they deliver to their users. Our unique session recording, replay, and web archival solution captures every user interaction across web, mobile, and social channels. This means that any user session can be replayed at a moment’s notice, in a movie-like form, giving an exact replica of the activity that occurred, when, and for how long. It’s pretty rare to provide a solution that meets the needs of compliance and risk officers while also empowering marketing teams – but that is what our customers can do with Q-Suite, built on modern technologies like MongoDB.

Q-suite Figure 1: Q-Suite recording of all digital interactions for regulatory compliance

Most of our customers operate in the highly regulated financial services industry, providing banking and insurance services. Qumram customers include UBS, Basler Kantonalbank, Luzerner Kantonalbank, Russell Investments, and Suva.

How are you using MongoDB?

Our solution provides indisputable evidence of all digital interactions, in accordance with the global regulatory requirements of SEC, US Department of Labor (DOL), FTC, FINRA, ESMA, MiFID II, FFSA, and more. Qumram also enables fraud detection, and customer experience analysis that is used to enhance the customer journey through online systems – increasing conversions and growing sales.

Because of the critical nature of regulatory compliance, we cannot afford to lose a single user session or interaction – unlike competitors, our system provides lossless data collection for compliance-mandated recording.

We use MongoDB to ingest, store, and analyze the firehose of data generated by user interactions across our customer’s digital properties. This includes session metadata, and the thousands of events that are generated per session, for example, every mouse click, button selection, keystroke, and swipe. MongoDB stores events of all sizes, from those that are contained in small documents typically just 100-200 bytes, through to session and web objects that can grow to several megabytes each. We also use GridFS to store binary content such as screenshots, CSS, and HTML.

Capturing and storing all of the session data in a single database, rather than splitting content across a database and separate file system massively simplifies our application development and operations. With this design, MongoDB provides a single source of truth, enabling any session to be replayed and analyzed on-demand.

You started out with a relational database. What were the challenges you faced there?

We initially built our products on one of the popular relational databases, but we quickly concluded that there was no way to scale the database to support billions of sessions every year, with each session generating thousands of discrete events. Also, as digital channels grew, our data structures evolved to become richer and more complex. These structures were difficult to map into the rigid row and column format of a relational schema. So in Autumn 2014, we started to explore non-relational databases as an alternative.

What databases did you look at?

There was no shortage of choice, but we narrowed our evaluation down to Apache Cassandra, Couchbase, and MongoDB.

What drove your decision to select MongoDB? We wanted a database that would enable us to break free of the constraints imposed by relational databases. We were also looking for a technology that was best-in-class among modern alternatives. There were three drivers for selecting MongoDB:

  1. Flexible data model with rich analytics Session data is richly structured – there may be up to four levels of nesting and over 100 different attributes. These complex structures map well to JSON documents, allowing us to embed all related data into a single document, providing us two advantages:

    1. Boosting developer productivity by representing data in the same structure as objects in our application code.
    2. Making our application faster as we only need issue a single query to the database to replay a session. At the same time, we need to be able to analyze the data in position, without the latency of moving it to an analytics cluster. MongoDB’s rich query language and secondary indexes allow us to access data by single keys, ranges, full text search, graph traversals, and geospatial queries, through to complex aggregations.
  2. Scalability The ability to grow seamlessly by scaling the database horizontally across commodity servers deployed on-premise and in the cloud, while at the same time maintaining data consistency and integrity.

  3. Proven We surveyed customers across our target markets, and the overwhelming feedback was that they wanted us to use a database they were already familiar with. Many global financial institutions had already deployed MongoDB and didn’t want to handle the complexity that came from running yet another database for our application. They knew MongoDB could meet the critical needs of regulatory compliant services, and that it was backed by excellent technical support, coupled with extensive management tools and rock-solid security controls.

As a result, we began development on MongoDB in early 2015.

How do your customers deploy and integrate your solution?

We offer two deployment models: on-premise and as a cloud service.

Many of the larger financial institutions deploy the Q-Suite with MongoDB within their own data centers, due to data sensitivity. From our application, they can instantly replay customer sessions. We also expose the session data from MongoDB with a REST API, which allows them to integrate it with their back-office processes, such as records management systems and CRM suites, often using message queues such as Apache Kafka.

We are also rolling out the Q-Suite as a “Compliance-as-a-Service” offering in the cloud. This option is typically used by smaller banks and insurers, as well the FinTech community.

How do you handle analytics against the collected session data?

Our application relies heavily on the MongoDB aggregation pipeline for native, in-database analytics, allowing us to roll-up session data for analysis and reporting. We use the new$graphLookup operator for graph processing of the session data, identifying complex relationships between events, users, and devices. For example, we can detect if a user keeps returning to a loan application form to adjust salary in order to secure a loan that is beyond his or her capability to repay. Using MongoDB’s in-built text search along with geospatial indexes and queries, we can explore session data to generate behavioral insights and actionable fraud intelligence.

Doing all of this within MongoDB, rather than having to couple the database with separate search engines, graph data stores, and geospatial engines dramatically simplifies development and ongoing operations. It means our developers have a single API to program against, and operations teams have a single database to deploy, scale, and secure.

I understand you are also using Apache Spark. Can you tell us a little more about that?

We use the MongoDB Connector for Apache Spark to feed session data from the database into Spark processes for machine learning, and then persist the models back into MongoDB. We use Spark to generate user behavior analytics that are applied to both fraud detection, and for optimization of customer experience across digital channels.

We are also starting to use Spark with MongoDB for Natural Language Processing (NLP) to extract customer sentiment from their digital interactions, and other deep learning techniques for anti-money laundering initiatives.

What does a typical installation look like?

The minimum MongoDB configuration for Q-Suite is a 3-node replica set, though we have many customers running larger MongoDB clusters deployed across multiple active data centers for disaster recovery and data locality. Most customers deploy on Linux, but because MongoDB is multi-platform, we can also serve those institutions that run on Windows.

We support both MongoDB 3.2 and the latest MongoDB 3.4 release, which gives our users the new graph processing functionality and faceted navigation with full text search. We recommend customers use MongoDB Enterprise Advanced, especially to access the additional security functionality, including the Encrypted storage engine to protect data at rest.

For our Compliance-as-a-Service offering, we are currently evaluating the MongoDB Atlas managed service in the cloud. This would allow our teams to focus on the application, rather than operations.

What sort of data volumes are you capturing?

Capturing user interactions is a typical time-series data stream. A single MongoDB node can support around 300,000 sessions per day, with each session generating up to 3,000 unique events. To give an indication of scale in production deployments, one of our Swiss customers is ingesting multiple terabytes of data into MongoDB every day. Another in the US needs to retain session data for 10 years, and so they are scaling MongoDB to store around 3 trillion documents.

Of course, capturing the data is only part of the solution – we also need to expose it to analytics, without impacting write-volume. MongoDB replica sets enable us to separate out these two workloads within a single database cluster, simultaneously supporting transaction and analytics processing.

Funnel metrics Figure 2: Analysis of funnel metrics to monitor customer conversion through digital properties

How are you measuring the impact of MongoDB on your business?

Companies operating in highly regulated industries, from financial services to healthcare to communications, are facing a whole host of new government and industry directives designed to protect digital boundaries. The Q-Suite solution, backed by MongoDB, is enabling us to respond to our customers’ compliance requirements. By using MongoDB, we can accelerate feature development to meet new regulatory demands, and implement solutions faster, with lower operational complexity.

The security controls enforced by MongoDB further enable our customers to achieve regulatory compliance.

Simon, thanks for sharing your time and experiences with the MongoDB community

To learn more about cybersecurity and MongoDB, download our whitepaper Building the Next Generation of Threat Intelligence with MongoDB

10-Step Methodology to Creating a Single View of your Business: Part 3

Mat Keep
April 24, 2017
Technical

Welcome to the final part of our single view blog series

  • In Part 1 we reviewed the business drivers behind single view projects, introduced a proven and repeatable 10-step methodology to creating the single view, and discussed the initial “Discovery” stage of the project
  • In Part 2 we dove deeper into the methodology by looking at the development and deployment phases of the project
  • In this final part, we wrap up with the single view maturity model, look at required database capabilities to support the single view, and present a selection of case studies.

If you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper.

10-Step Single View Methodology

As a reminder, figure 1 shows the 10-step methodology to creating the single view.

Single View Methodology

Figure 1: Single view methodology

In parts 1 and 2 of the blog series, we stepped through each of the methodology’s steps. Lets now take a look at a roadmap for the single view – something we call the Maturity Model.

Single View Maturity Model

As discussed earlier in the series, most single view projects start by offering a read-only view of data aggregated from the source systems. But as projects mature, we have seen customers start to write to the single view. Initially they may start writing simultaneously to the source systems and single view to prove efficacy – before then writing to the single view first, and propagating updates back to the source systems. The evolution path of single view maturity is shown below.

Maturity model

Figure 2: Single view maturity model

What are the advantages of writing directly to the single view?

  1. Real-time view of the data. Users are consuming the freshest version of the data, rather than waiting for updates to propagate from the source systems to the single view.
  2. Reduced application complexity. Read and write operations no longer need to be segregated between different systems. Of course, it is necessary to then implement a change data capture process that pushes writes against the single view back to the source databases. However, in a well designed system, the mechanism need only be implemented once for all applications, rather than read/write segregation duplicated across the application estate.
  3. Enhanced application agility. With traditional relational databases running the source systems, it can take weeks or months worth of developer and DBA effort to update schemas to support new application functionality. MongoDB’s flexible data model with a dynamic schema makes the addition of new fields a runtime operation, allowing the organization to evolve applications more rapidly.

Figure 3 shows an architectural approach to synchronizing writes against the single view back to the source systems. Writes to the single view are pushed into a dedicated update queue, or directly into an ETL pipeline or message queue. Again, MongoDB consulting engineers can assist with defining the most appropriate architecture.

Writing to the single view

Figure 3: Writing to the single view

Required Database Capabilities to Support the Single View

The database used to store and manage the single view provides the core technology foundation for the project. Selection of the right database to power the single view is critical to determining success or failure.

Relational databases, once the default choice for enterprise applications, are unsuitable for single view use cases. The database is forced to simultaneously accommodate the schema complexity of all source systems, requiring significant upfront schema design effort. Any subsequent changes in any of the source systems’ schema – for example, when adding new application functionality – will break the single view schema. The schema must be updated, often causing application downtime. Adding new data sources multiplies the complexity of adapting the relational schema.

MongoDB provides a mature, proven alternative to the relational database for enterprise applications, including single view projects. As discussed below, the required capabilities demanded by a single view project are well served by MongoDB:

Flexible Data Model

MongoDB's document data model makes it easy for developers to store and combine data of any structure within the database, without giving up sophisticated validation rules to govern data quality. The schema can be dynamically modified without application or database downtime. If, for example, we want to start to store geospatial data associated with a specific customer event, the application simply writes the updated object to the database, without costly schema modifications or redesign.

MongoDB documents are typically modeled to localize all data for a given entity – such as a financial asset class or user – into a single document, rather than spreading it across multiple relational tables. Document access can be completed in a single MongoDB operation, rather than having to JOIN separate tables spread across the database. As a result of this data localization, application performance is often much higher when using MongoDB, which can be the decisive factor in improving customer experience.

Intelligent Insights, Delivered in Real Time

With all relevant data for our business entity consolidated into a single view, it is possible to run sophisticated analytics against it. For example, we can start to analyze customer behavior to better identify cross-sell and upsell opportunities, or risk of churn or fraud. Analytics and machine learning must be able to run across vast swathes of data stored in the single view. Traditional data warehouse technologies are unable to economically store and process these data volumes at scale. Hadoop-based platforms are unable to serve the models generated from this analysis, or perform ad-hoc investigative queries with the low latency demanded by real-time operational systems.

The MongoDB query language and rich secondary indexes enable developers to build applications that can query and analyze the data in multiple ways. Data can be accessed by single keys, ranges, text search, graph, and geospatial queries through to complex aggregations and MapReduce jobs, returning responses in milliseconds. Data can be dynamically enriched with elements such as user identity, location, and last access time to add context to events, providing behavioral insights and actionable customer intelligence. Complex queries are executed natively in the database without having to use additional analytics frameworks or tools, and avoiding the latency that comes from ETL processes that are necessary to move data between operational and analytical systems in legacy enterprise architectures.

Single view platform serving operational and analytical workloads

Figure 4: Single view platform serving operational and analytical workloads

MongoDB replica sets can be provisioned with dedicated analytics nodes. This allows data scientists and business analysts to simultaneously run exploratory queries and generate reports and machine learning models against live data, without impacting nodes serving the single view to operational applications, again avoiding lengthy ETL cycles.

Predictable Scalability with Always-on Availability

Successful single view projects tend to become very popular, very quickly. As new data sources and attributes, along with additional consumers such as applications, channels, and users are onboarded, so demands for processing and storage capacity quickly grow.

To address these demands, MongoDB provides horizontal scale-out for the single view database on low cost, commodity hardware using a technique called sharding, which is transparent to applications. Sharding distributes data across multiple database instances. Sharding allows MongoDB deployments to address the hardware limitations of a single server, such as bottlenecks in CPU, RAM, or storage I/O, without adding complexity to the application. MongoDB automatically balances single view data in the cluster as the data set grows or the size of the cluster increases or decreases.

MongoDB maintains multiple replicas of the data to maintain database availability. Replica failures are self-healing, and so single view applications remain unaffected by underlying system outages or planned maintenance. Replicas can be distributed across regions for disaster recovery and data locality to support global user bases.

Global distribution of the single view

Figure 5: Global distribution of the single view

Enterprise Deployment Model

MongoDB can be run on a variety of platforms – from commodity x86 and ARM-based servers, through to IBM Power and zSeries systems. You can deploy MongoDB onto servers running in your own data center, or public and hybrid clouds. With the MongoDB Atlas service, we can even run the database for you.

MongoDB Enterprise Advanced is the production-certified, secure, and supported version of MongoDB, offering:

  • Advanced Security. Robust access controls via LDAP, Active Directory, Kerberos, x.509 PKI certificates, and role-based access control to ensure a separation of privileges across applications and users. Data anonymization can be enforced by read-only views to protect sensitive, personally identifiable information. Data in flight and at rest can be encrypted to FIPS 140-2 standards, and an auditing framework for forensic analysis is provided.
  • Automated Deployment and Upgrades. With Ops Manager, operations teams can deploy and upgrade distributed MongoDB clusters in seconds, using a powerful GUI or programmatic API.
  • Point-in-time Recovery. Continuous backup and consistent snapshots of distributed clusters allow seamless data recovery in the event of system failures or application errors.

Single View in Action

MongoDB has been used in many single view projects. The following case studies highlight several examples.

MetLife: From Stalled to Success in 3 Months

In 2011, MetLife’s new executive team knew they had to transform how the insurance giant catered to customers. The business wanted to harness data to create a 360-degree view of its customers so it could know and talk to each of its more than 100 million clients as individuals. But the Fortune 50 company had already spent many years trying unsuccessfully to develop this kind of centralized system using relational databases.

Which is why the 150-year old insurer turned to MongoDB. Using MongoDB’s technology over just 2 weeks, MetLife created a working prototype of a new system that pulled together every single relevant piece of customer information about each client. Three months later, the finished version of this new system, called the 'MetLife Wall,' was in production across MetLife’s call centers.

The Wall collects vast amounts of structured and unstructured information from MetLife’s more than 70 different administrative systems. After many years of trying, MetLife solved one of the biggest data challenges dogging companies today. All by using MongoDB’s innovative approach for organizing massive amounts of data. You can learn more from the case study.

CERN: Delivering a Single View of Data from the LHC to Accelerate Scientific Research and Discovery

The European Organisation for Nuclear Research, known as CERN, plays a leading role in the fundamental studies of physics. It has been instrumental in many key global innovations and breakthroughs, and today operates the world's largest particle physics laboratory. The Large Hadron Collider (LHC) nestled under the mountains on the Swiss - Franco border is central to its research into origins of the universe.

Using MongoDB, CERN built a multi-data center Data Aggregation System accessed by over 3,000 physicists from nearly 200 research institutions across the globe. MongoDB provides the ability for researchers to search and aggregate information distributed across all of the backend data services, and bring that data into a single view.

MongoDB was selected for the project based on its flexible schema, providing the ability to ingest and store data of any structure. In addition, its rich query language and extensive secondary indexes gives users fast and flexible access to data by any query pattern. This can range from simple key-value look-ups, through to complex search, traversals and aggregations across rich data structures, including embedded sub-documents and arrays.

You can learn more from the case study.

Wrapping Up Part 3

That wraps up our 3-part blog series. Bringing together disparate data into a single view is a challenging undertaking. However, by applying the proven methodologies, tools, and technologies, organizations can innovate faster, with lower risk and cost.

Remember, if you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper.


Leaf in the Wild: World’s Most Installed Learning Record Store Migrates to MongoDB Atlas to Scale Data 5x, while Reducing Costs

Learning Locker moves away from ObjectRocket to scale its learning data warehouse, used by the likes of Xerox, Raytheon and U.K. Universities.

From Amazon’s recommendations to the Facebook News Feed, personalization has become ingrained in consumer experience, so it should come as no surprise that resourceful educators are now trying improve learning outcomes with that same concept. After all, no two students are identical in much the same way that no two consumers are exactly alike. Developing a truly personalized educational experience is no easy feat, but emerging standards like the xAPI are helping to make this lofty goal a reality.

xAPI is an emerging specification that enables communication between disparate learning systems in a way that standardizes learning data. That data could include things like a student’s attendance in classes, or participation in online tools, but can also stretch to performance measures in the real-world, how students apply their learning. This data-led approach to Learning Analytics is helping educators improve learning practices, tailor teaching and take early intervention if it looks like a student is moving in the wrong direction.

Learning Locker SaaS
Figure 1: Learning Locker’s software as a service (SaaS) Enterprise Dashboard

But the implications of this go far beyond the classroom, and increasingly companies are using these same techniques to support their employees development and to measure the impact of training on performance outcomes. Whilst educators are predicting the chances of a particular student dropping out, businesses can use these same tools to forecast organizational risk, based on compliance training and performance data, for example.

We recently spoke with James Mullaney, Lead Developer at HT2 Labs a company that is at the forefront of the learning-data movement. HT2 Labs’ flagship product, Learning Locker, is an open source data warehouse used by the likes of the Xerox, Raytheon and a wide-range of universities to prove the impact of training and to make more informed decisions on future learning design. To continue to scale the project, better manage their operations and reduce costs, Learning Locker migrated from ObjectRocket to database as a service MongoDB Atlas.

Tell us about HT2 Labs and Learning Locker.

HT2 Labs is the creator of Learning Locker, which is a data warehouse for learning activity data (commonly referred to as a Learning Record Store or LRS). We have a suite of other learning products that are all integrated; Learning Locker acts as the hub that binds everything together. Our LRS uses the xAPI, which is a specification developed in part by the U.S. Department of Defense to help track military training initiatives. It allows multiple learning technology providers to send data into a single data store in a common format

We started playing around with xAPI around four years ago as we were curious about the technology and had our own Social Learning Management System (LMS), Curatr. Today, Learning Locker receives learning events via an API, analyzes the data stored, and is instrumental in creating reports for our end customers.

Who is using Learning Locker?

The software is open source so our users range from hobbyists to enterprise companies, like Xerox, who use our LRS to track internal employee training.

Another example is Jisc, the R&D organization that advances technologies in UK Higher & Further Education.. Jisc are running one of the largest national-level initiatives to implement Learning Analytics across universities in the UK and our LRS is used to ingest data and act as a single source of data for predictive models. This increased level of insight into individual behavior allows Jisc to do some interesting things, such as predict and preempt student dropouts.

Visualized learning data
Figure 2: Visualizing learning data

How has Learning Locker evolved?

We’re currently on version two of Learning Locker. We’ve open sourced the product and we’ve also launched it as a hosted Software as a service (SaaS) product. Today we have clients using our LRS in on-premise installations and in the cloud. Each on-prem installation comes packaged with MongoDB. The SaaS version of Learning Locker typically runs in AWS supported by MongoDB Atlas, the managed MongoDB as a Service.

Tell us about your decision to go with MongoDB for the underlying database.

MongoDB was a very natural choice for us as the xAPI specification calls for student activities to be sent as JSON. These documents are immutable. For example, you might send a document that says, “James completed course XYZ.” You can’t edit that document to say that he didn’t complete it. You would have to send another document to indicate a change. This means that scale is very important as there is a constant stream of student activity that needs to be ingested and stored. We’ve been very happy with how MongoDB, with its horizontal scale-out architecture, is handling increased data volume; to be frank, MongoDB can handle more than our application can throw at it.

In fact, our use of MongoDB is actually award-winning: Last year we picked up the MongoDB Innovation Award for best open source project.

Beyond using the database for ingesting and storing data in Learning Locker, how else are you using MongoDB?

As mentioned earlier, our LRS runs analytics on the data stored and those analytics are then using in reporting for our end users. For running those queries, we use MongoDB’s aggregation framework and the associated aggregation APIs. This allows our end users to get quick reports on information they’re interested in, such as course completion rates, score distribution, etc.

Our indexes are also rather large compared to the data. We index on a lot of different fields using MongoDB’s secondary indexes. This is absolutely necessary for real-time analytics, especially when the end user wants to ask many different questions. We work closely with our clients to figure out the indexes that make the most sense based on the queries they want to run against the data.

Tell us about your decision to run MongoDB in the cloud. Did you start with MongoDB Atlas or were you using a third party vendor?

Our decision to use a MongoDB as a service provider was pretty simple — we wanted someone else to manage the database for us. Initially we were using ObjectRocket and that made sense for us at the time because we were hosting our application servers on Rackspace.

Interesting. Can you describe your early experiences with MongoDB Atlas and the migration process?

We witnessed the launch of MongoDB Atlas last year at MongoDB World 2016 and spun up our first cluster with Atlas in October. It became pretty clear early on that it would work for what we needed. First we migrated our Jisc deployment and our hosted SaaS product to MongoDB Atlas and we also moved our application servers to AWS for lower latency. The migration was completed in December with no issues.

Why did you migrate to MongoDB Atlas from ObjectRocket?

Cost was a major driving force for our migration from ObjectRocket. We’ve been growing and are now storing five times as much data in MongoDB Atlas at about the same costs.

ObjectRocket was also pretty opaque about what was happening in the background and that’s not the case with MongoDB Atlas, which gives you greater visibility and control. I can see, for example, exactly how much RAM I’m using at any point in time.

And finally, nobody is going to tell you that security isn’t important, especially in an industry where we’re responsible for handling potentially-sensitive student data. We were very happy with the native security features in MongoDB Atlas and the fact that we aren’t charged a percentage uplift for encryption, which was not the case with ObjectRocket.

Learning Locker Enterprise
Figure 3: Understanding student progress with Learning Locker Enterprise

Do you have any plans to integrate MongoDB with any other technologies to build more functionality for Learning Locker?

We’re looking into Hadoop, Spark, and Tableau for a few of our clients. MongoDB’s native connectors for Hadoop, Spark, and BI platforms come in handy for those projects.

Any advice for people looking into MongoDB and MongoDB Atlas?

Plan for scale. Think about what you’re doing right now and ask yourself, “Will this work when I have 100x more data? Can we afford this at 100x the scale?”

The MongoDB Atlas UI makes most things extremely easy, but remember that some things you can only do through the mongo shell. You should ensure your employees learn or retain the skills necessary to be dangerous in the CLI.

And this isn’t specific to just MongoDB, but think about the technology you’re partnering with and the surrounding community. For us, it’s incredibly important that MongoDB is a leader in the NoSQL space as it’s made it that much easier to talk about Learning Locker to prospective users and clients. We view it as a symbiotic relationship; if MongoDB is successful then so are we.

James, thanks for taking the time to share your experiences with the MongoDB community and we look forward to seeing you at MongoDB World 2017.

For deploying and running MongoDB, MongoDB Atlas is the best mix of speed, scalability, security, and ease-of-use.

Learn more about MongoDB Atlas





Leaf in the Wild: Powering Smart Factory IoT with MongoDB

Jason Ma
February 13, 2017
Customer Stories

BEET Analytics OEMs MongoDB for its Envision manufacturing IOT platform. MongoDB helps Envision deliver 1-2 orders of magnitude better performance than SQL Server, resulting in increased manufacturing throughput and reduced costs

Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

BEET Analytics Technology creates solutions to help the manufacturing industry transition to smart IOT factories for the next evolution of manufacturing. BEET’s Process Visibility System, Envision, makes the assembly line machine process visible and measurable down to every motion and event. Built on MongoDB, Envision is able to precisely analyze telemetry data streamed from sensors on the production line to help improve the manufacturing process.

At MongoDB World 2016, BEET Analytics was a recipient of a MongoDB Innovation Award, which recognizes organizations and individuals that took a giant idea and made a tangible impact on the world.

I had the opportunity to sit down with Girish Rao, Director of Core Development, to discuss how BEET Analytics harnesses MongoDB to power its Envision platform.

Can you tell us a little bit about BEET Analytics?

Founded in June 2011, BEET Analytics Technology is a smart manufacturing solution provider. We provide a Process Visibility System (PVS) built upon Envision, the software created by BEET. Envision monitors the automated assembly line for any potential issues in throughput, performance, and availability –- and alerts users about possible breakdowns before they occur. For our customers, one minute of lost production time can result in a significant loss of revenue, sothus we collect and monitor the vital details of an automated assembly line. This provides predictive analytics and insights that avoid unplanned downtime and help sustain higher manufacturing throughput.

Why did you decide to OEM MongoDB?

When we started using MongoDB about 4 years ago, it was not as well known as it is now – at least not in the manufacturing industry. Our strategy was to build a complete product with MongoDB embedded within our system. We could then bring our complete system, deploy it in the plant, and have it run out of the box. This helped minimize the burden on our customer plant’s IT department to manage multiple software and hardware products. This model has worked well for us to introduce our product into several customer plants. Not only have we been able to provide a seamless customer experience, but MongoDB’s expertise both in development and production has helped us to accelerate our own product development. Additionally, co-marketing activities that promote our joint solution have been extremely beneficial to us.

How does BEET Analytics Use MongoDB today?

The Envision platform consists of multiple data collectors, which are PC based devices that are deployed close to the assembly line and stream data from the Programmable Logic Controllers (PLC). The PLCs (0.05 - 0.08 second scan cycle) continuously monitor the “motion” of hundreds of moving parts in the manufacturing facility. Each “motion” is captured by data collectors and stored in MongoDB. The daily transactional data for an assembly line creates about 1-3 million MongoDB documents per day, and we typically keep between 3-6 months worth of data, which comes out to be about 500 million documents.

Can you describe your MongoDB deployment and how it’s configured?

Each data collector on the assembly line runs its own standalone MongoDB instance. For a medium sized assembly line, we will typically have 1-2 data collectors, while a larger assembly line can have 4-6 data collectors. The data collectors transfer the data through a web service up to a central repository that is backed by a MongoDB replica set and where the Envision application server runs. The central MongoDB replica set consists of a primary node, running Linux, and two secondaries that run Windows. We use Windows as a secondary because we also run Internet Information Services (IIS) for our application. This architecture is cost effective for us. In the future, we will probably run both the primary and secondary on Linux. We have failed over a few times to the secondary without any application downtime. Users interact with the application server through a browser to visualize the “heartbeat” of the entire manufacturing process. We use the MongoDB aggregation and map reduce framework to aggregate the data and create analytics reporting.

Were you using something different before MongoDB?

Our first version of the Envision platform was developed about 6 years ago using a Microsoft SQL Server database. SQL Server was okay up to a certain point, but we couldn’t scale up without using very expensive hardware. Our primary requirement was to support the throughput that our system needed without resorting to massive server arrays. In our internal benchmarks, MongoDB had 1-2 orders of magnitude better performance than SQL Server for the same hardware. At that point, we decided to build the solution using MongoDB.

Are there specific tools you use to manage your MongoDB deployment?

We currently use Ops Manager internally for development servers, and are looking to implement Ops Manager in production. Ops Manager has been extremely useful in helping us automate our deployment of MongoDB and ensuring we follow MongoDB best practices. It’s also invaluable that Ops Manager provides visibility into key metrics, so we are able to diagnose any potential problems before they happen.

Any best practices of deploying MongoDB that you can share with the community that you think is pertinent?

Understanding your dataset is a critical component. As we understood our dataset better, we were able to size the hardware more appropriately. Another important practice is indexing. Make sure you have index coverage for most of the queries to avoid full collection scans. MongoDB offers an extensive range of secondary indexes that you typically don’t get in a NoSQL database. Capped collections work really well for log type data that does not need to be saved for a long period of time. Finally, use a replica set to help you with performance, always-on availability, and scalability.

How are you measuring the impact of MongoDB?

MongoDB has allowed BEET to reduce the overall infrastructure cost and provide better value to customers. From a development perspective, MongoDB’s flexible data model with dynamic schema has allowed us to make application changes faster, and rapidly add new features to our product to maintain competitive advantage and better serve our customers.

What advice would you give for someone using MongoDB for their project?

MongoDB has best practices guides and whitepapers that are really helpful to ensure you follow the right guidelines. Also, we have been using Ops Manager in our development environment and it has been a huge advantage to troubleshoot any performance or setup issues. This is something we plan to implement in production and recommend other users to do as well.

Girish, thank you so much for taking the time to share your experiences with the MongoDB community.


Harness MongoDB for its IoT Solutions


Leaf in the Wild: How EG Built a Modern, Event-Driven Architecture with MongoDB to Power Digital Transformation

Mat Keep
January 30, 2017
Customer Stories

UK’s Leading Commercial Property Data Service Delivers 50x More Releases Per Month, Achieving Always-On Availability

The total value of the UK commercial property is estimated at close to £1.7 trillion1. Investment decisions on big numbers requires big data, especially in handling a big variety of multi-structured data. And that is why EG, the UK’s leading commercial property data service, turned to MongoDB.

I met with Chris Fleetwood, Senior Director of Technology, and Chad Macey, Principal Architect at EG. We discussed how they are using agile methodologies with devops, cloud computing, and MongoDB as the foundation for the company’s digital transformation – moving from a century old magazine, into a data driven technology service.

Can you start by telling us a little bit about your company?

Building on over 150 years of experience, EG (formerly Estates Gazette) delivers market-leading data & analytics, insight, and decision support tools covering the UK commercial property market. Our services are used by real estate agents, lawyers, investors, surveyors, and property developers. We enable them to make faster, better informed decisions, and to win more business in the property market. We offer a comprehensive range of market data products with information on hundreds of thousands of properties across the UK, accessible across multiple channels including print, digital, online, and events. EG is part of Reed Business Information, providing data and analytics to business professionals around the world.

What is driving digital transformation in your business?

Our business was built on print media, with the Estates Gazette journal serving as the authoritative source on commercial property across the UK for well over a century. Back in the 1990s, we were quick to identify the disruptive potential of the Internet, embracing it as a new channel for information distribution. Pairing our rich catalog of property data and market intelligence with new data sources from mobile and location services – and the ability to run sophisticated analytics across all of it in the cloud – we are accelerating our move into enriched market insights, complemented with decision support systems.

IT was once just a supporting function for the traditional print media side of our business, but digital has now become our core engine for growth and revenue.

Can you describe your platform architecture?

Our data products are built on a microservices-inspired architecture that we call “pods”. Each pod serves a specific data product and audience. For example, the agent pod provides market intelligence for each geographic area such as recent sales, estimated values, local amenities, and zone regulations, along with lists of potential buyers and renters. Meanwhile the surveyor pod will maintain data used to generate valuations of true market worth. The pods also implement the business rules that drive the workflow for each of our different user audiences.

The advantage of our pod-based architecture is that it improves on our deployment and operational capabilities, supporting the transition to continuous delivery – giving us faster time to market for new features demanded by our customers. Each pod is owned by a “squad” with up to half a dozen members, comprising devops engineers, architects, and product managers.

EG Pod Architecture

Figure 1: EG Pod Architecture

How are you using MongoDB in this architecture?

Each pod maintains a local instance of the data it needs, pulling from a centralized system of record that stores all property details, transactions, locations, market data, availability, and so on. The system of record – or the “data-core pod” as we call it – in addition to each of the data product pods all run on MongoDB.

MongoDB is at the center of our event driven architecture. All updates are committed to the system of record – for example, when a new property comes onto the market – which then uses the Amazon Web Services (AWS) SNS push notification and SQS message queue services to publish the update to all the other product pods. This approach means all pods are working with the latest copy of the data, while avoiding tight coupling between each pod and the core database.

Why did you select MongoDB?

Agility and time to market are critical. We decided to use a Javascript-based technology stack that allows consistent developer experience from the client, to the server, through to the database, without having to deal with any translations between layers.

We evaluated multiple database options as part of our platform modernization:

  • Having used relational databases extensively in the past, we knew the pain and friction involved with having to define a schema in the database, and then re-implement that same schema again in an ORM at the application layer. And we would need to repeat this process for each pod we developed, and for each change in the data model as we evolved application functionality.
  • We also took a look at an alternative NoSQL document database, but it did not provide the development speed we needed as we found it was far too complex and difficult to use.

As the M in the MEAN stack, we knew MongoDB would work well with Javascript and Node.js. I spun up a local instance on my laptop, and was up and running in less that 5 minutes, and productive within the hour. I judge all technology by my one hour rule. Basically, if within an hour I can start to understand and be productive with the technology, that tells me it's got a really good developer experience, supported by comprehensive documentation. If it's harder than that, I’m not likely to get along with that technology in the longer term. We didn’t look back from that point onwards – we put MongoDB through its paces to ensure it delivered the schema flexibility, query functionality, performance, and scale we needed, and it passed all of our tests.

Can you describe your MongoDB deployment?

We run MongoDB Enterprise Advanced on AWS. We get access to MongoDB support, in addition to advanced tooling. We are in the process of installing Ops Manager to take advantage of fine-grained monitoring telemetry delivered to our devops team. This insight enables them to continue to scale the service as we onboard more data products and customers.

MongoDB Compass is a new tool we’ve started evaluating. The schema visualizations can help our developers to explore and optimize each pod’s data model. The new integrated geospatial querying capability is especially valuable for our research teams. They can use the Compass GUI to construct sophisticated queries with a simple point and click interface, returning results graphically, and as sets of JSON documents. Without this functionality, our valuable developer resource would have been tied up creating the queries for them.

We will also be upgrading to the latest MongoDB release to take advantage of the MongoDB Encrypted storage engine to extend our security profile, and prepare for the new EU General Data Protection Regulation (GDPR) market legislation that is coming into force in 2018.

How is MongoDB performing for you?

MongoDB has been rock solid for us. With very few operational issues our team is able to focus on building new products. The flexible data model, rich query language, and indexing makes development super-fast. Geospatial search is the starting point for the user experience – a map is the first thing a customer sees when they access our data products. MongoDB’s geospatial queries and indexes allow our customers to easily navigate market data by issuing polygon searches that display properties within specific coordinates of interest.

Navigating property market data with geospatial search UI

Figure 2: Navigating property market data with sophisticated geospatial search UI

We also depend on the MongoDB aggregation pipeline to generate the analytics and insights the business, and our customers, rely on. For example, we can quickly generate averages for rents achieved in a specific area, aggregated against the total number of transactions over a given time period. Each MongoDB release enriches the aggregation pipeline, so we always have new classes of analytics we can build on top of the database.

How are you measuring the impact of MongoDB on your business?

It’s been a core part of our team transitioning from being a business support function to being an enabler of new revenue streams. With our pod-based architecture powered by MongoDB, we can get products to market faster, and release more frequently.

With relational databases, we were constrained to one new application deployment per month. With MongoDB, we are deploying 50 to 60 times per month. With MongoDB’s self-healing replica set architecture, we’ve delivered improved uptime to deliver always-on availability to the business.

Chris and Chad, thanks for taking the time to share your experiences with the MongoDB community

To learn more about microservices, download our new whitepaper:

Microservices and the Evolution of Building Modern Apps



1. http://www.bpf.org.uk/about-real-estate

Leaf in the Wild: How Loopd uses MongoDB to Power its Advanced Location-Tracking Platform for Conferences

Leo Zheng
November 10, 2016
Customer Stories

Conferences can be incredibly hectic experiences for everyone involved. You have attendees wanting to meet and exchange information, sponsors and exhibitors looking to maximize foot traffic to their booths, and the conference hosts trying to get a sense of how they can optimize their event and if it was all worth it in the end.

While sponsors usually do get a lead list immediately after an event for their troubles, attendees often struggle to remember who they actually spoke to and event hosts are often left in the dark about what they can do to maximize the returns on their investments. Enter Loopd, an advanced events engagement platform.

I sat down with their CEO, Brian Friedman, to understand how they’re using MongoDB to help conference attendees and event hosts separate the signal from the noise.

Tell us about Loopd.
Loopd provides physical intelligence for corporate events. We help corporate marketers learn how people interact with each other, with their company, and with their company's products. The Loopd event engagement system is the industry's only bi-directional CRM solution that enables the exchange of content and contact information passively and automatically. We equip conference attendees with Loopd wearable badges, which can be used to easily exchange contact information or gain entry into sessions. Through our enterprise IoT analytics and sensors, we then gather and interpret rich data so that marketers have a more sophisticated understanding of business relationships and interactions at conferences, exhibits and product activation events.

Some of our clients include Intel, Box, Twilio, and MongoDB.

loopd bluetooth badges

Bluetooth LE Loopd Badges

How are you using MongoDB?
We use MongoDB to store millions of datapoints from connected advertising and Bluetooth LE Loopd Badges on the conference floor. All of the attendee movement data captured by the Loopd Badge at an event can be thought of as time series data associated with location information. We track each Loopd Badge’s location and movement path in real time during the event. As a result, we handle heavy write operations during an event to make sure any and all calculations are consistent, timely, and accurate.

We also use the database for real-time analysis. For example, we calculate the number of attendee visits & returns, and average time durations in near real time. We use the aggregation framework in MongoDB to make this happen.

What did you use before MongoDB?
Before MongoDB, we used PostgreSQL as our main data store. We used Redis as a temporary data buffer queue for storing new movement data. The data was dumped, inserted, and updated into rows in the SQL database once per second. The raw location data was read and parsed from the SQL database into a user-readable format. We needed a temporary buffer because the high volume of insert and update requests drained available resources.

What challenges did you face with PostgreSQL?
With PostgreSQL, we needed a separate Redis caching server to buffer write and update operations before storing them in the database, which added architectural and operational complexity. It also wasn’t easy to scale as it’s not designed to be deployed across multiple instances.

How did MongoDB help you resolve those challenges?
When we switched to MongoDB from PostgreSQL, our write throughput significantly increased, removing the need for a separate caching server in between the client and the database. We were able to halve our VM resource consumption (CPU power and memory), which translated to significant cost savings. As a bonus, our simplified underlying architecture is now much easier to manage.

Finally, one of the great things about MongoDB is its data model flexibility, which allows us to rapidly adapt our schema to support new application demands, without the need to incur downtime or manage complex schema migrations.

Please describe your MongoDB deployment.
We typically run one replica set per event. The database size depends on the event — for MongoDB World 2016, we generated about 2 million documents over the course of a couple of days. We don’t shard our MongoDB deployments yet but having that ability in our back pocket will be very important for us going forward.

At the moment, all of our read queries are executed on the secondaries in the replica set, which means write throughput isn’t impacted by read operations. The smallest analytics window in our application is a minute, which means we can tolerate any eventual consistency from secondary reads.

Our MongoDB deployments are hosted in Google Cloud VM instances. We’re exploring using containers but they’re currently not in use for any production environments. We’re also evaluating Spark and Hadoop for doing some more interesting things with the data we have in MongoDB.

What version of MongoDB are you running?
We use MongoDB 3.2. We find the added document validation feature very valuable for checking data types. While we will still perform application-level error validation, we appreciate this added level of security.

What advice do you have for other companies looking to start with MongoDB?
MongoDB is flexible, scalable, and quite developer and DBA friendly, even if you’re used to RDBMS.

We would recommend familiarizing yourself with the basic concepts of MongoDB first, heavily leaning on the community during learning. I’d also recommend reading the production notes to optimize system configuration operational parameters.

Brian, thanks for taking the time to share your story with the MongoDB community.

Thinking of migrating to MongoDB from a relational database? Learn more from our guide:

Download RDBMS Migration Guide