GIANT Stories at MongoDB

MongoDB University MCP Spotlight: Nestor Campos

MongoDB

Releases

As part of MongoDB University’s 2 Year Anniversary, we are sharing stories from our MongoDB University classes to showcase how they got started with MongoDB and where they’ve gone since graduation

This is a guest post by Nestor Campos, a software engineer and consultant from Chile who is a Certified MongoDB Developer, who will share his experience learning MongoDB through MongoDB University.

As a software engineer, I know I always need to keep my up with new technologies to stay relevant in the market.

After several years working with relational database, I found that they were not meeting the needs for all my projects, so after doing some research, I found the “non-relational”(NoSQL) databases and started experimenting with them. I decided to work with MongoDB, for its simplicity, ease of use and its wide range of use cases.

As soon as I started with MongoDB, MongoDB University opened, and I enrolled in the first course, M101P. It was difficult at first, but I learned quickly and working with MongoDB became more natural to me every day of the course.

After passing the M101 course, it was time to validate my work and get certified. I signed up to take the Developer Certification exam, studied for several months, and completed the course successfully. I was very excited and after completing the course, was motivated to continue working with MongoDB.

All of the courses and certification I’ve completed ​​through MongoDB University have helped give a new vision to the projects I work on, and opened a range of possibilities for my the entire organization.

The knowledge I built through MongoDB University allowed me to give back to the community and get ahead at work. After getting certified, I taught some workshops on Big Data. Starting next week, we are starting a large project that will involve collecting and manipulating data from social networks, and MongoDB was already defined as the core database to meet our objectives. Now I feel like a real Big Data engineer.

I am now certified as a MongoDB Developer and my goal is to become a certified MongoDB DBA next year. I hope more community members will take advantage of MongoDB certification and share the benefits of MongoDB within their organizations and to their clients.

How to Perform Fuzzy-Matching with Mongo Connector and Elastic Search

MongoDB

Releases

By Luke Lovett, Python Engineer at MongoDB

Introduction

Suppose you’re running MongoDB. Great! Now you can find exact matches to all the queries you can throw at the database. Now, imagine that you’re also building a text-search feature into your application. It has to draw words out of misspelled noise, and results may match on synonyms, too! For this daunting task you’ve chosen to use one of the Lucene-based projects, Elasticsearch or Solr. But now you have a problem– How will this tool search through your documents stored in MongoDB? And how will you keep the contents of the search engine up-to-date?

Mongo Connector fills a gap between MongoDB and some of the best search tools out there, such as Elasticsearch and Solr. It is not only capable of exporting data from your MongoDB replica set or sharded cluster to these systems, but also keeps your data consistent between these systems: as you insert, update, and remove documents in MongoDB, these changes are soon reflected on the other side through Mongo Connector. You may even use Mongo Connector to stream changes performed on one replica set primary to another, thus simulating a “multi-master” cluster.

When Mongo Connector saw its first release in August of 2012, it was very simplistic in its capabilities and lacked fault tolerance. I’ve been working on Mongo Connector since November, 2013 with the help of the MongoDB Python team, and I’m glad to say that Mongo Connector has come a long way in terms of the features it provides and (especially) stability. This post will show off some of these new features and give an example of how to replicate operations from MongoDB to Elasticsearch, an open-source search engine, using Mongo Connector. At the end of this post, we’ll be able to make fuzzy-match text queries against data streaming into Elasticsearch.

Getting our Dataset

For this post, we’ll be pulling in posts from the popular link aggregation website, Reddit. We recently added safe encoding of data types supported by MongoDB (i.e., BSON types) to types external database drivers (in this case, elasticsearch-py) can handle. This makes it safe to use for replicating documents whose content we may not have much control over (e.g., from web scraping). Using this script that pulls new posts from reddit, we’ll stream new Reddit posts to MongoDB:

./reddit2mongo --mongo-host localhost --mongo-port 27017

As the post is processed, you should see the first 20 characters of the title. This is (I admit, slowly, thanks to Reddit API limits) emulating the inserts into MongoDB that your application is making.

Firing up the Connector

Next, we’ll start Mongo Connector. To download and install Mongo Connector, you can use pip:

pip install mongo-connector

For this demonstration, we’ll assume that you already have Elasticsearch set up and running on your local machine, listening on port 9200. You can start replicating from MongoDB to Elasticsearch using the following command:

mongo-connector -m localhost:27017 -t localhost:9200 -d mongo_connector/doc_managers/elastic_doc_manager.py

Of course, if we only wanted to perform text search on post titles and text, we can restrict what fields are passed through to Elasticsearch using the –fields option. This way, we can minimize the amount of data we are actually duplicating:

mongo-connector -m localhost:27017 -t localhost:9200 --fields title,text -d mongo_connector/doc_managers/elastic_doc_manager.py

Just as you see the Reddit posts printed to STDOUT by reddit2mongo, you should see output coming from Mongo Connector logging the fact that each document has been forwarded to ES at about the same time! What a beautiful scene!

Searching, Elastically

Now we’re ready to use Elasticsearch to perform fuzzy text queries on our dataset as it arrives from MongoDB. Because we’re streaming directly from Reddit’s website, I can’t really say what results you’ll find in your dataset, but as this particular corner of the internet seems to love cats almost as much as we love search engines, it’s probably safe to say that a query for kitten will get you somewhere:

curl -XPOST ‘http://localhost:9200/reddit.posts/_search’ -d’{
  "query": {
    "match": {
      "title": {
        "query": "kitten",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}’

Because we’re performing a fuzzy search, we can even do a search for the non-word kiten. Since most people aren’t too careful with their spelling, you can imagine how powerful this feature is when performing text searches based directly on a user’s input:

curl -XPOST ‘http://localhost:9200/reddit.posts/_search’ -d’{
  "query": {
    "match": {
      "title": {
        "query": "kiten",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}’

The fuzziness parameter determines the maximum “edit distance” the text query can be in order to match a field. The prefix_length parameter says that results have to match the first letter of the query. This article offers a great explanation of how this works. This search yielded the same results for me as its properly-spelled version.

More than just Inserts

Although our demo was just taking advantage of continuously streaming documents from MongoDB to Elasticsearch, Mongo Connector is more than just an import/export tool. When you update or delete documents in MongoDB, those operations are replicated to your other systems as well, keeping all systems in-sync with the current primary of a replica set. If the primary fails over and a rollback occurs, Mongo Connector can detect these and do the Right Thing to maintain consistency regardless.

Recap

The really great thing about this is that we’re performing operations in MongoDB and Elasticsearch at the same time. Without a tool like Mongo Connector, we would have to use a tool like mongoexport to dump data from MongoDB periodically into JSON, then upload this data into an empty Elasticsearch index, so we don’t have previously-deleted documents hanging around. This would probably be an enormous hassle, and we would lose the near real-time capability of our ES-powered search engine.

Although Mongo Connector has improved substantially since its first release, it’s still an experimental project and has a ways to go before official support by MongoDB, Inc. However, I am committed to answering questions as well as reviewing feature requests and bug reports reported to Mongo Connector’s issues page on Github. Also be sure to check out the full documentation on its Github wiki page.

Resources

Setting up Java Applications to Communicate with MongoDB, Kerberos and SSL

MongoDB

Releases

By Alex Komyagin, Technical Services Engineer at MongoDB

Setting up Kerberos authentication and SSL encryption in a MongoDB Java application is not as simple as other languages. In this post, I’m going to show you how to create a Kerberos and SSL enabled Java application that communicates with MongoDB.

My original setup consists of the following:

1) KDC server:

kdc.mongotest.com

kerberos config file (/etc/krb5.conf):

[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = MONGOTEST.COM
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[realms]
 MONGOTEST.COM = {
  kdc = kdc.mongotest.com
  admin_server = kdc.mongotest.com
 }

[domain_realm]
 .mongotest.com = MONGOTEST.COM
 mongotest.com = MONGOTEST.COM

KDC has the following principals:

  • gssapitest@MONGOTEST.COM - user principle (for java app)
  • mongodb/rhel64.mongotest.com@MONGOTEST.COM - service principle (for mongodb server)

2) MongoDB server:

rhel64.mongotest.com

MongoDB version: 2.6.0

MongoDB config file:

dbpath=<some path>
logpath=<some path>
fork=true
auth = true
setParameter = authenticationMechanisms=GSSAPI
sslOnNormalPorts = true
sslPEMKeyFile = /etc/ssl/mongodb.pem

This server also has the global environment variable $KRB5_KTNAME set to the keytab file exported from KDC.

Application user is configured in the admin database like this:

{ "_id" : "$external.gssapitest@MONGOTEST.COM", "user" : "gssapitest@MONGOTEST.COM", "db" : "$external", "credentials" : { "external" : true }, "roles" : [ { "role" : "readWrite", "db" : "test" } ] }

3) Application server: has stock OS with krb5 installed

All servers are running with RHEL6.4 onboard.

Now let’s talk about how to create a Java application with Kerberos and SSL enabled, and that will run on the application server. Here is the sample code that we will use (SSLApp.java):

import com.mongodb.*;
import javax.net.ssl.SSLSocketFactory;
import java.util.Arrays;

public class SSLApp {
   public static void main(String args[])  throws Exception {

            MongoClient m = new MongoClient(new MongoClientURI("mongodb://gssapitest%40MONGOTEST.COM@rhel64.mongotest.com/?authMechanism=GSSAPI&ssl=true"));

            DB db = m.getDB( "test" );
            DBCollection c = db.getCollection( "test");

            System.out.println( c.findOne() );
        }
    }

Download the java driver:

wget http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.12.1/mongo-java-driver-2.12.1.jar

Install java and jdk:

sudo yum install java-1.7.0 sudo yum install java-1.7.0-devel

Create a certificate store for Java and store the server certificate there, so that Java knows who it should trust:

keytool -importcert -file mongodb.crt -alias mongoCert -keystore firstTrustStore

(mongodb.crt is just a public certificate part of mongodb.pem)

Copy kerberos config file to the application server: /etc/krb5.conf or “`C:\WINDOWS\krb5.ini&rdquo; (otherwise you’ll have to specify kdc and realm as Java runtime options)

Use kinit to store the principal password on the application server:

kinit gssapitest@MONGOTEST.COM

As an alternative to kinit, you can use JAAS to cache kerberos credentials.

Compile and run the Java program

javac -cp ../mongo-java-driver-2.12.1.jar SSLApp.java
java -cp .:../mongo-java-driver-2.12.1.jar -Djavax.net.ssl.trustStore=firstTrustStore -Djavax.net.ssl.trustStorePassword=changeme -Djavax.security.auth.useSubjectCredsOnly=false SSLApp

It is important to specify useSubjectCredsOnly=false, otherwise you’ll get the “No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)” exception from Java. As we discovered, this is not strictly necessary in all cases, but it is if you are relying on kinit to get the service ticket.

The Java driver needs to construct MongoDB service principal name in order to request the Kerberos ticket. The service principal is constructed based on the server name you provide (unless you explicitly asked to canonicalize server name). For example, if I change rhel64.mongotest.com to the host IP address in the connection URI, I would be getting Kerberos exceptions No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)]. So be sure you specify the same server host name as you used in the Kerberos principal (). Adding -Dsun.security.krb5.debug=true to Java runtime options helps a lot in debugging kerberos auth issues.

These steps should help simplify the process of connecting Java applications with SSL. Before deploying any application with MongoDB, be sure to read through our 12 tips for going into production and the Security Checklist which outlines recommended security measures to protect your MongoDB installation. More information on configuring MongoDB Security can be found in the MongoDB Manual.

For further questions, feel free to reach out to the MongoDB team through google-groups.

Now Available, By Popular Demand: Production Support for MongoDB Community Edition

Meghan Gill

Company

Editor's Note: We no longer offer a standalone support option. If you're looking for a way to run MongoDB according operations best practices, try our hosted solution MongoDB Atlas.

Over the past few years, we’ve heard from many users that they love our Community Edition, and wish there was a simple way to buy support for it. Now there is.

We’re pleased to announce the availability of Production Support for MongoDB, making it easier for you to get expert guidance from the same team that builds MongoDB. Our Production Support offering is now available as a standalone service -- separate from our MongoDB Enterprise software. This means that Community Edition users now have access to our world-class team of support engineers.

Here’s what you get with Production Support:

  • 24 x 7 x 365 access to world-class support
  • 2-Hour SLA
  • Proactive and consultative advice from our support engineers

When you purchase Production Support, our global support organization becomes an extension of your team, available to guide you through every stage of the application lifecycle. You’ll rest easy knowing that should you have a critical database issue, you won’t need to solve it alone. Our support engineers will be by your side, helping you run your app running smoothly.

Even better, Production Support goes beyond the typical “break/fix” scenario. Our highly-experienced support team offers consultative, proactive assistance on topics ranging from schema design, index optimization, performance testing and scaling out. We support thousands of MongoDB systems with simple and complex deployment topologies. We can help ensure that you are following best practices and getting the best possible performance from MongoDB.

When you log a ticket with MongoDB Production Support, you can expect a response from someone with deep expertise. Our Technical Services Engineers typically have 10+ years of experience in IT and complete 9 months of intensive MongoDB training before answering tickets. Our customers are consistently impressed by the depth and breadth of knowledge across the team.

If you love the Community Edition and want support, your wait is over. Now you can access the same high quality support that our Enterprise customers have enjoyed for years. Contact us to learn more.

EA Scores With MongoDB-based FIFA Online 3

MongoDB

case study, EA, Business

Think the World Series is big? Or the Super Bowl? Neither comes close to the billions of people that tune in to watch the World Cup, soccer's (football to everyone else outside North America) quadrennial event.

But what about the most played game in your household? That’s likely Electronic Arts’ EA Sports FIFA, the world's best-selling sports video game franchise. EA Sports FIFA offers otherwise average athletes the chance to take on and beat the world’s best, weaving intricate passing plays and mastering Messi-esque dribbling with the flick of a controller.

All without leaving the comfort of their couch.

Not everyone chooses to play FIFA on their XBox or Playstation, however. Throughout Asia one of the most popular ways to bend it like Beckham is with EA’s FIFA Online 3. The massively multiplayer online soccer game is the most popular sports game in Korea, allowing players to choose to play and customize a team from any of over 30 leagues and 15,000 real world players.

Players like Ronaldo. Like Özil. Like Ibrahimovic. Or Park Ji-sung. (More on him later.)

Because EA FIFA Online 3, developed by Spearhead, one of EA’s leading development studios, needs to scale to millions of players, Spearhead built the game to run on MongoDB, the industry's most scalable database. EA already runs 250 MongoDB servers, spread across 80 shards. As EA FIFA Online 3 continues to grow in popularity, EA expects MongoDB's autosharding and other features to make it simple to scale.

Not content to win accolades on the field, EA FIFA Online 3 has also garnered honors from the industry, most recently winning a MongoDB Innovation Award, due to its creative use of MongoDB.

Even better, EA's Spearhead, recipient of the $2,500 award for its work, donated it to the Park Ji-sung JS Foundation. Park, who played for years for Manchester United and tormented Arsenal defenses, “congratulate[d] Spearhead on the great performance of FIFA Online 3,” performance enabled by its underlying MongoDB data infrastructure.

In addition to EA, Kixeye and a variety of other gaming companies use MongoDB to improve the gaming experience.

Getting Started with MongoDB and Java: Part II

MongoDB

Releases

By Trisha Gee, Java Engineer and Advocate at MongoDB

In the last article, we covered the basics of installing and connecting to MongoDB via a Java application. In this post, I’ll give an introduction to CRUD (Create, Read, Update, Delete) operations using the Java driver. As in the previous article, if you want to follow along and code as we go, you can use these tips to get the tests in the Getting Started project to go green.

Creating documents

In the last article, we introduced documents and how to create them from Java and insert them into MongoDB, so I’m not going to repeat that here. But if you want a reminder, or simply want to skip to playing with the code, you can take a look at Exercise3InsertTest.

Querying

Putting stuff in the database is all well and good, but you’ll probably want to query the database to get data from it.

In the last article we covered some basics on using find() to get data from the database. We also showed an example in Exercise4RetrieveTest. But MongoDB supports more than simply getting a single document by ID or getting all the documents in a collection. As I mentioned, you can query by example, building up a query document that looks a similar shape to the one you want.

For the following examples I’m going to assume a document which looks something like this:

person = {
  _id: "anId",
  name: "A Name",
  address: {
    street: "Street Address",
    city: "City",
    phone: "12345"
  }
  books: [ 27464, 747854, ...]
}  

Find a document by ID

To recap, you can easily get a document back from the database using the unique ID:

DBCursor cursor = collection.find(new BasicDBObject("_id", "theId"));

…and you get the values out of the document (represented as a DBObject) using a Map-like syntax:

(String) cursor.one().get("name")

In the above example, because you’ve queried by ID (and you knew that ID existed), you can be sure that the cursor has a single document that matches the query. Therefore you can use cursor.one() to get it.

Find all documents matching some criteria

In the real world, you won’t always know the ID of the document you want. You could be looking for all the people with a particular name, for example.

In this case, you can create a query document that has the criteria you want:

DBCursor results = collection.find(new BasicDBObject("name", "The name I want to find"));

You can find out the number of results:

results.size();

and you can, naturally, iterate over them:

for (DBObject result : results) {
    // do something with each result
}

A note on batching

The cursor will fetch results in batches from the database, so if you run a query that matches a lot of documents, you don’t have to worry that every document is loaded into memory immediately. For most queries, the first batch returned will be 101 documents. But as you iterate over the cursor, the driver will automatically fetch further batches from the server. So you don’t have to worry about managing batching in your application. But you do need to be aware that if you iterate over the whole of the cursor (for example to put it into a List), you will end up fetching all the results and putting them in memory.

You can get started with Exercise5SimpleQueryTest.

Selecting Fields

Generally speaking, you will read entire documents from MongoDB most of the time. However, you can choose to return just the fields that you care about (for example, you might have a large document and not need all the values). You do this by passing a second parameter into the find method that’s another DBObject defining the fields you want to return. In this example, we’ll search for people called “Smith”, and return only the name field. To do this we pass in a DBObject representing {name: 1}:

DBCursor results = collection.find(new BasicDBObject("name", "SomeName"), 
                                   new BasicDBObject("name", 1));

You can also use this method to exclude fields from the results. Maybe we might want to exclude an unnecessary subdocument from the results - let’s say we want to find everyone called “Smith”, but we don’t want to return the address. We do this by passing in a zero for this field name, i.e. {address: 0}:

DBCursor results = collection.find(new BasicDBObject("name", "SomeName"),
                                   new BasicDBObject("address", 0));

With this information, you’re ready to tackle Exercise6SelectFieldsTest

Query Operators

As I mentioned in the previous article, your fields can be one of a number of types, including numeric. This means that you can do queries for numeric values as well. Let’s assume, for example, that our person has a numberOfOrders field, and we wanted to find everyone who had ordered more than, let’s say, 10 items. You can do this using the $gt operator:

DBCursor results = collection.find(new BasicDBObject("numberOfOrders", new BasicDBObject("$gt", 10)));

Note that you have to create a further subdocument containing the $gt condition to use this operator. All of the query operators are documented, and work in a similar way to this example.

You might be wondering what terrible things could happen if you try to perform some sort of numeric comparison on a field that is a String, since the database supports any type of value in any of the fields (and in Java the values are Objects so you don’t get the benefit of type safety). So, what happens if you do this?

DBCursor results = collection.find(new BasicDBObject("name", new BasicDBObject("$gt", 10)));

The answer is you get zero results (assuming all your documents contain names that are Strings), and you don’t get any errors. The flexible nature of the document schema allows you to mix and match types and query without error.

You can use this technique to get the test in Exercise7QueryOperatorsTest to go green - it’s a bit of a daft example, but you get the idea.

Querying Subdocuments

So far we’ve assumed that we only want to query values in our top-level fields. However, we might want to query for values in a subdocument - for example, with our person document, we might want to find everyone who lives in the same city. We can use dot notation like this:

DBObject findLondoners = new BasicDBObject("address.city", "London");
collection.find(findLondoners));

We’re not going to use this technique in a query test, but we will use it later when we’re doing updates.

Familiar methods

I mentioned earlier that you can iterate over a cursor, and that the driver will fetch results in batches. However, you can also use the familiar-looking skip() and limit() methods. You can use these to fix up the test in Exercise8SkipAndLimitTest.

A last note on querying: Indexes

Like a traditional database, you can add indexes onto the database to improve the speed of regular queries. There’s extensive documentation on indexes which you can read at your own leisure. However, it is worth pointing out that, if necessary, you can programmatically create indexes via the Java driver, using createIndexes. For example:

collection.createIndex(new BasicDBObject("fieldToIndex", 1));

There is a very simple example for creating an index in Exercise9IndexTest, but indexes are a full topic on their own, and the purpose of this part of the tutorial is to merely make you aware of their existence rather than provide a comprehensive tutorial on their purpose and uses.

Updating values

Now you can insert into and read from the database. But your data is probably not static, especially as one of the benefits of MongoDB is a flexible schema that can evolve with your needs over time.

In order to update values in the database, you’ll have to define the query criteria that states which document(s) you want to update, and you’ll have to pass in the document that represents the updates you want to make.

There are a few things to be aware of when you’re updating documents in MongoDB, once you understand these it’s as simple as everything else we’ve seen so far.

Firstly, by default only the first document that matches the query criteria is updated.

Secondly, if you pass in a document as the value to update to, this new document will replace the whole existing document. If you think about it, the common use-case will be: you retrieve something from the database; you modify it based on some criteria from your application or the user; then you save the updated document to the database.

I’ll show the various types of updates (and point you to the code in the test class) to walk you through these different cases.

Simple Update: Find a document and replace it with an updated one

We’ll carry on using our simple Person document for our examples. Let’s assume we’ve got a document in our database that looks like:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  address: {
    street: "123 Fake St",
    city: "Faketon",
    phone: "5559991234"
  }
  books: [ 27464, 747854, ...]
} 

Maybe Jo goes into witness protection and needs to change his/her name. Assuming we’ve got jo populated in a DBObject, we can make the appropriate changes to the document and save it into the database:

DBObject jo =                                       // get the document representing jo
jo.put("name", "Jo In Disguise");                   // replace the old name with the new one
collection.update(new BasicDBObject("_id", "jo"),   // find jo by ID
                  jo);                              // set the document in the DB to the new document for Jo

You can make a start with Exercise10UpdateByReplacementTest.

Update Operators: Change a field

But sometimes you won’t have the whole document to replace the old one, sometimes you just want to update a single field in whichever document matched your criteria.

Let’s imagine that we only want to change Jo’s phone number, and we don’t have a DBObject with all of Jo’s details but we do have the ID of the document. If we use the $set operator, we’ll replace only the field we want to change:

collection.update(new BasicDBObject("_id", "jo"),
                  new BasicDBObject("$set", new BasicDBObject("phone", "5559874321")));

There are a number of other operators for performing updates on documents, for example $inc which will increment a numeric field by a given amount.

Now you can do Exercise11UpdateAFieldTest

Update Multiple

As I mentioned earlier, by default the update operation updates the first document it finds and no more. You can, however, set the multi flag on update to update everything.

So maybe we want to update everyone in the database to have a country field, and for now we’re going to assume all the current people are in the UK:

collection.update(new BasicDBObject(),
                  new BasicDBObject("$set", new BasicDBObject("country", "UK")), false, true);

The query parameter is an empty document which finds everything; the second boolean (set to true) is the flag that says to update all the values which were found.

Now we’ve learnt enough to complete the two tests in Exercise12UpdateMultipleDocumentsTest

Upsert

Finally, the last thing to mention when updating documents is Upsert (Update-or-Insert). This will search for a document matching the criteria and either: update it if it’s there; or insert it into the database if it wasn’t.

Like “update multiple”, you define an upsert operation with a magic boolean. It shouldn’t come as a surprise to find it’s the first boolean param in the update statement (since “multi” was the second):

collection.update(query, personDocument, true, false);

Now you know everything you need to complete the test in Exercise13UpsertTest

Removing from the database

Finally the D in CRUD - Delete. The syntax of a remove should look familiar now we’ve got this far, you pass a document that represents your selection criteria into the remove method. So if we wanted to delete jo from our database, we’d do:

collection.remove(new BasicDBObject("_id", "jo"));

Unlike update, if the query matches more than one document, all those documents will be deleted (something to be aware of!). If we wanted to remove everyone who lives in London, we’d need to do:

collection.remove(new BasicDBObject("address.city", "London"));

That’s all there is to remove, you’re ready to finish off Exercise14RemoveTest

Conclusion

Unlike traditional databases, you don’t create SQL queries in MongoDB to perform CRUD operations. Instead, operations are done by constructing documents both to query the database, and to define the operations to perform.

While we’ve covered what the basics look like in Java, there’s loads more documentation on all the core concepts in the MongoDB documentation:

How Medtronic Manages Machine Data in MongoDB

Matt Asay

Business

While many think Big Data is all about “big,” the reality for most organizations is that data variety is a far thornier problem to tackle.

Just ask Medtronic.

Medical equipment maker Medtronic, perhaps best known for its pacemakers, offers devices and therapies that address more than 30 diseases. Last year the company served 9 million patients and this year the company announced that it serves a patient in some way every three seconds. In addition, Medtronic collects more than 30 million data samples about its devices every day.

Matthew Chimento, principal test engineer and project manager at Medtronic, notes that more than 150 data collection and processing steps have been added to Medtronic’s manufacturing process in the last three years, and 40% of all of Medtronic’s stored data has been collected in the last two. Humans aren’t great at collecting data, but machines are, and “we have a lot of machines.”

Now if only those machines all spoke the same language.

Data Variety: Problem And Opportunity

Unfortunately, with a proliferation of machines comes a proliferation of different data types. And while the media likes to talk about “Big Data” as if it were all about volume, companies like Medtronic realize Big Data is primarily a matter of data variety, as a NewVantage Partners survey discovered:

Furthermore, for regulatory reasons, Medtronic must save device data for 10 years after the last implant of the device. Since those devices can last 20 years, some data is 30 years old, which means that Medtronic must contend with information spread across a multitude of obscure database systems, in a wide variety of formats.

Does Your Data Speak MongoDB?

To manage this data complexity, Medtronic turned to MongoDB.

About two years ago, Chimento’s colleague, Jeff Lemmerman, a senior software engineer at Medtronic, heard about MongoDB. Intrigued by the NoSQL database and its potential to help Medtronic tame its ever-changing data requirements, Lemmerman launched a proof of concept, which “basically consisted of choosing one battery model that we manufacture.” When the battery goes through electrolyte fill – a step in the manufacturing process – all of the component data is loaded into MongoDB. “This is a very simple place to start,” he said.

Lemmerman has high hopes for the next steps with MongoDB. He hopes to begin loading manufacturing data on every component Medtronic makes directly into MongoDB, and aggregating that data into a device-level view, and MongoDB’s data model will make that easy. “You’re trying to facilitate analysis across components, and you really want simple, fast queries … instead of doing those nasty joins that we saw in my relational example, I’m able to find the complete history for a component with a very simple query.”

What Can MongoDB Do For You?

Like Medtronic, your data changes constantly as business requirements change. And, like Medtronic and most enterprises, you likely use a relational database to manage that data. For reasons noted above, as well as here, an RDBMS is a poor fit for data that changes often or for applications that need to scale.

We therefore invite you to check out the RDBMS to MongoDB Migration Guide to determine how best to migrate data from your RDBMS to MongoDB.

Getting Started with MongoDB and Java: Part I

MongoDB

Releases

By Trisha Gee, Java Engineer and Advocate at MongoDB

Java is one of the most popular programming languages in the MongoDB Community. For new users, it’s important to provide an overview of how to work with the MongoDB Java driver and how to use MongoDB as a Java developer.

In this post, which is aimed at Java/JVM developers who are new to MongoDB, we’re going to give you a guide on how to get started, including:

  • Installation
  • Setting up your dependencies
  • Connecting
  • What are Collections and Documents?
  • The basics of writing to and reading from the database
  • An overview of some of the JVM libraries

Installation

The installation instructions for MongoDB are extensively documented, so I’m not going to repeat any of that here. If you want to follow along with this “getting started” guide, you’ll want to download the appropriate version of MongoDB and unzip/install it. At the time of writing, the latest version of MongoDB is 2.6.3, which is the version I’ll be using.

A note about security

In a real production environment, of course you’re going to want to consider authentication. This is something that MongoDB takes seriously and there’s a whole section of documentation on security. But for the purpose of this demonstration, I’m going to assume you’ve either got that working or you’re running in “trusted mode” (i.e. that you’re in a development environment that isn’t open to the public).

Take a look around

Once you’ve got MongoDB installed and started (a process that should only take a few minutes), you can connect to the MongoDB shell. Most of the MongoDB technical documentation is written for the shell, so it’s always useful to know how to access it, and how use it to troubleshoot problems or prototype solutions.

When you’ve connected, you should see something like

MongoDB shell version: 2.6.3                           
connecting to: test
> _  

Since you’re in the console, let’s take it for a spin. Firstly we’ll have a look at all the databases that are there right now:

> show dbs

Assuming this is a clean installation, there shouldn’t be much to see:

> show dbs
admin  (empty)
local  0.078GB
>

That’s great, but as you can see there’s loads of documentation on how to play with MongoDB from the shell. The shell is a really great environment for trying out queries and looking at things from the point-of-view of the server. However, I promised you Java, so we’re going to step away from the shell and get on with connecting via Java.

Getting started with Java

First, you’re going to want to set up your project/IDE to use the MongoDB Java Driver. These days IDEs tend to pick up the correct dependencies through your Gradle or Maven configuration, so I’m just going to cover configuring these.

At the time of writing, the latest version of the Java driver is 2.12.3 - this is designed to work with the MongoDB 2.6 series.

Gradle

You’ll need to add the following to your dependencies in build.gradle:

compile 'org.mongodb:mongo-java-driver:2.12.3'

Maven

For maven, you’ll want:

<dependencies>
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>mongo-java-driver</artifactId>
        <version>2.12.3</version>
    </dependency>
</dependencies>

Alternatively, if you’re really old-school and like maintaining your dependencies the hard way, you can always download the JAR file.

If you don’t already have a project that you want to try with MongoDB, I’ve created a series of unit tests on github which you can use to get a feel for working with MongoDB and Java.

Connecting via Java

Assuming you’ve resolved your dependencies and you’ve set up your project, you’re ready to connect to MongoDB from your Java application.

Since MongoDB is a document database, you might not be surprised to learn that you don’t connect to it via traditional SQL/relational DB methods like JDBC. But it’s simple all the same:

MongoClient mongoClient = new MongoClient(new MongoClientURI("mongodb://localhost:27017"));

Where I’ve put mongodb://localhost:27017, you’ll want to put the address of where you’ve installed MongoDB. There’s more detailed information on how to create the correct URI, including how to connect to a Replica Set, in the MongoClientURI documentation.

If you’re connecting to a local instance on the default port, you can simply use:

MongoClient mongoClient = new MongoClient();

Note that this does throw a checked Exception, UnknownHostException. You’ll either have to catch this or declare it, depending upon what your policy is for exception handling.

The MongoClient is your route in to MongoDB, from this you’ll get your database and collections to work with (more on this later). Your instance of MongoClient (e.g. mongoClient above) will ordinarily be a singleton in your application. However, if you need to connect via different credentials (different user names and passwords) you’ll want a MongoClient per set of credentials.

It is important to limit the number of MongoClient instances in your application, hence why we suggest a singleton - the MongoClient is effectively the connection pool, so for every new MongoClient, you are opening a new pool. Using a single MongoClient (and optionally configuring its settings) will allow the driver to correctly manage your connections to the server. This MongoClient singleton is safe to be used by multiple threads.

One final thing you need to be aware of: you want your application to shut down the connections to MongoDB when it finishes running. Always make sure your application or web server calls MongoClient.close() when it shuts down.

Try out connecting to MongoDB by getting the test in Exercise1ConnectingTest to pass.

Where are my tables?

MongoDB doesn’t have tables, rows, columns, joins etc. There are some new concepts to learn when you’re using it, but nothing too challenging.

While you still have the concept of a database, the documents (which we’ll cover in more detail later) are stored in collections, rather than your database being made up of tables of data. But it can be helpful to think of documents like rows and collections like tables in a traditional database. And collections can have indexes like you’d expect.

Selecting Databases and Collections

You’re going to want to define which databases and collections you’re using in your Java application. If you remember, a few sections ago we used the MongoDB shell to show the databases in your MongoDB instance, and you had an admin and a local.

Creating and getting a database or collection is extremely easy in MongoDB:

DB database = mongoClient.getDB("TheDatabaseName");

You can replace "TheDatabaseName" with whatever the name of your database is. If the database doesn’t already exist, it will be created automatically the first time you insert anything into it, so there’s no need for null checks or exception handling on the off-chance the database doesn’t exist.

Getting the collection you want from the database is simple too:

DBCollection collection = database.getCollection("TheCollectionName");

Again, replacing "TheCollectionName" with whatever your collection is called.

If you’re playing along with the test code, you now know enough to get the tests
in Exercise2MongoClientTest to pass.

An introduction to documents

Something that is, hopefully, becoming clear to you as you work through the examples in this blog, is that MongoDB is different from the traditional relational databases you’ve used. As I’ve mentioned, there are collections, rather than tables, and documents, rather than rows and columns.

Documents are much more flexible than a traditional row, as you have a dynamic schema rather than an enforced one. You can evolve the document over time without incurring the cost of schema migrations and tedious update scripts. But I’m getting ahead of myself.

Although documents don’t look like the tables, columns and rows you’re used to, they should look familiar if you’ve done anything even remotely JSON-like. Here’s an example:

person = {
  _id: "jo",
  name: "Jo Bloggs",
  age: 34,
  address: {
    street: "123 Fake St",
    city: "Faketon",
    state: "MA",
    zip: &#x201C;12345&#x201D;
  }
  books: [ 27464, 747854, ...]
}  

There are a few interesting things to note:

  1. Like JSON, documents are structures of name/value pairs, and the values can be one of a number of primitive types, including Strings and various number types.
  2. It also supports nested documents - in the example above, address is a subdocument inside the person document. Unlike a relational database, where you might store this in a separate table and provide a reference to it, in MongoDB if that data benefits from always being associated with its parent, you can embed it in its parent.
  3. You can even store an array of values. The books field in the example above is an array of integers that might represent, for example, IDs of books the person has bought or borrowed.

You can find out more detailed information about Documents in the documentation.

Creating a document and saving it to the database

In Java, if you wanted to create a document like the one above, you’d do something like:

List<Integer> books = Arrays.asList(27464, 747854);
DBObject person = new BasicDBObject("_id", "jo")
                            .append("name", "Jo Bloggs")
                            .append("address", new BasicDBObject("street", "123 Fake St")
                                                         .append("city", "Faketon")
                                                         .append("state", "MA")
                                                         .append("zip", 12345))
                            .append("books", books);

At this point, it’s really easy to save it into your database:

MongoClient mongoClient = new MongoClient();
DB database = mongoClient.getDB("Examples");
DBCollection collection = database.getCollection("people");
    
collection.insert(person);

Note that the first three lines are set-up, and you don’t need to re-initialize those every time.

Now if we look inside MongoDB, we can see that the database has been created:

> show dbs
Examples  0.078GB
admin     (empty)
local     0.078GB
> _

…and we can see the collection has been created as well:

> use Examples
switched to db Examples
> show collections
people
system.indexes
> _ 

…finally, we can see the our person, “Jo”, was inserted:

> db.people.findOne()
{
    "_id" : "jo",
    "name" : "Jo Bloggs",
        "age": 34,
    "address" : {
        "street" : "123 Fake St",
        "city" : "Faketon",
        "state" : "MA",
        "zip" : "12345"
    },
    "books" : [
        27464,
        747854
    ]
}
> _

As a Java developer, you can see the similarities between the Document that’s stored in MongoDB, and your domain object. In your code, that person would probably be a Person class, with simple primitive fields, an array field, and an Address field.

So rather than building your DBObject manually like the above example, you’re more likely to be converting your domain object into a DBObject. It’s best not to have the MongoDB-specific DBObject class in your domain objects, so you might want to create a PersonAdaptor that converts your Person domain object to a DBObject:

public static final DBObject toDBObject(Person person) {
    return new BasicDBObject("_id", person.getId())
                     .append("name", person.getName())
                     .append("address", new BasicDBObject("street", person.getAddress().getStreet())
                                                  .append("city", person.getAddress().getTown())
                                                  .append("phone", person.getAddress().getPhone()))
                     .append("books", person.getBookIds());
}

As before, once you have the DBObject, you can save this into MongoDB:

collection.insert(PersonAdaptor.toDBObject(myPerson));

Now you’ve got all the basics to get the tests in Exercise3InsertTest to pass.

Getting documents back out again

Now you’ve saved a Person to the database, and we’ve seen it in the database using the shell, you’re going to want to get it back out into your Java application. In this post, we’re going to cover the very basics of retrieving a document - in a later post we’ll cover more complex querying.

You’ll have guessed by the fact that MongoDB is a document database that we’re not going to be using SQL to query. Instead, we query by example, building up a document that looks like the document we’re looking for. So if we wanted to look for the person we saved into the database, “Jo Bloggs”, we remember that the _id field had the value of “jo”, and we create a document that matches this shape:

DBObject query = new BasicDBObject("_id", "jo");
DBCursor cursor = collection.find(query);

As you can see, the find method returns a cursor for the results. Since _id needs to be unique, we know that if we look for a document with this ID, we will find only one document, and it will be the one we want:

DBObject jo = cursor.one();

Earlier we saw that documents are simply made up of name/value pairs, where the value can be anything from a simple String or primitive, to more complex types like arrays or subdocuments. Therefore in Java, we can more or less treat DBObject as a Map<String, Object>. So if we wanted to look at the fields of the document we got back from the database, we can get them with:

(String)cursor.one().get("name");

Note that you’ll need to cast the value to a String, as the compiler only knows that it’s an Object.

If you’re still playing along with the example code, you’re now ready to take on all the tests in Exercise4RetrieveTest

Overview of JVM Libraries

So far I’ve shown you the basics of the official Java Driver, but you’ll notice that it’s quite low-level - you have to do a lot of taking things out of your domain objects and poking them into MongoDB-shaped DBObjects, and vice-versa. If this is the level of control you want, then the Java driver makes this easy for you. But if it seems like this is extra work that you shouldn’t have to do, there are plenty of other options for you.

The tools I’m about to describe all use the MongoDB Java Driver at their core to interact with MongoDB. They provide a high-level abstraction for converting your domain objects into MongoDB documents, whilst also giving you a way to get to the underlying driver as well in case you need to use it at a lower level.

Morphia

Morphia is a really lightweight ODM (Object Document Mapper), so it’s similar to ORMs like Hibernate. Documents can be in a fairly similar shape to your Java domain objects, so this mapping can be automatic, but Morphia allows you point the mapper in the right direction.

Morphia is open source, and has contributors from MongoDB. Sample code and documentation can be found here.

Spring Data

Another frequently used ODM is Spring Data. This supports traditional relational and non-relational databases, including MongoDB. If you’re already using Spring in your application, this should be a familiar way to work.

As always with Spring projects, there’s a lot of really great documentation, including a Getting Started guide with example code.

MongoJack

If you’re working with web services or something else that supports JSON, and you’re using Jackson to work with this data, it probably seems like a waste to be turning it from this form into a Java object and then into a MongoDB DBObject. But MongoJack might make your job easier, as it’s designed to map JSON objects directly into MongoDB. Take a look at the example code and documentation.

Jongo

This is another Jackson-based ODM, but provides an interesting extra in the form of supporting queries the way you’d write them in the shell. Documentation and example code is available on the website.

Grails MongoDB GORM

The Grails web application framework also supports its own Object-Relational Mapping (GORM), including support for MongoDB. More documentation for this plugin can be found here.

Casbah

This isn’t an ODM like the other tools mentioned, but the officially supported Scala driver for MongoDB. Like the previous libraries, it uses the MongoDB Java Driver under the covers, but it provides a Scala API for application developers to work with. If you like working with Scala but are searching for an async solution, consider ReactiveMongo, a community-supported driver that provides async and non-blocking operations.

Other libraries and tools

This is far from an extensive list, and I apologise if I’ve left a favourite out. But we’ve compiled a list of many more libraries for the JVM, which includes community projects and officially supported drivers.

Conclusion

We’ve covered the basics of using MongoDB from Java - we’ve touched on what MongoDB is, and you can find out a lot more detailed information about it from the manual; we’ve installed it somewhere that lets us play with it; we’ve talked a bit about collections and documents, and what these look like in Java; and we’ve started inserting things into MongoDB and getting them back out again.

If you haven’t already started playing with the test code, you can find it in this github repository. And if you get desperate and look hard enough, you’ll even find the answers there too.

Finally, there are more examples of using the Java Driver in the Quick Tour, and there is example code in github, including examples for authentication.

If you want to learn more, try our 7-week online course, “Intro to MongoDB and Java”.

Try it out, and hopefully you’ll see how easy it is to use MongoDB from Java.

Read Part II


If you're looking for more ways resources to learn MongoDB, view our Getting Started Kit.

MongoDB & The Soundwave Music Map

MongoDB

Releases

By David Lynch: Principal Engineer at Soundwave

Soundwave is a smartphone app that tracks music as quickly as it is played. Soundwave tracks each user’s music-listening habits across multiple platforms and streaming services1, creating a listening profile. Users follow each other, facilitating listening, sharing, discovery, and discussion of music old and new, popular and niche.

MongoDB is our database of choice. We track around 250,000 plays per day from a user base that has grown to over 1 million in the past 13 months. Soundwave has users in all time zones, making availability critical. MongoDB replica sets provide us fault tolerance and redundancy, allowing us to perform scheduled maintenance without affecting users.

We consider responsiveness to be a key part of the Soundwave user experience so we use a sharded configuration that allows us to add compute and storage resources as needed.

We use indexes liberally, and app usage patterns require maintaining a fairly large working set in memory. Sharding allows us to deploy a larger number of relatively smaller, and disproportionately cheaper, AWS instances2 while maintaining the snappy user experience we require.

The volume of data we process today is significant, but not huge. It was important for us to architect a system from the outset that would scale easily and economically.

The complexity of data retrieval is moderate but well suited to MongoDB. Our App has been featured a couple of times by both Apple and Google. At our peak rate, we handle 30,000 new sign-ups per day. Our MongoDB configuration handles this load comfortably. In fact, with respect to scale and schema, our deployment is pretty boring and by the book.

Music Map

One of Soundwaves’ most compelling features is the Music Map. Our users can draw a free-form enclosing polygon on a Google Map of any area, at any zoom level, and see what music was recently played in that area. This is captured in the screenshot below. These constraints required some interesting engineering outside the scope of the MongoDB playbook.

We’re kind of proud of our implementation. Many location aware apps with similar features reduce the user experience to some form of circular $geoNear around a point. In our minds, this is not as interesting as a randomly drawn polygon - nor is it a great user experience to promise polygons and return some other approximation of what a user wanted.

Constraints

We set a target maximum 95th percentile latency of 1.5 seconds for the map feature. From a client perspective, we decided anything over this wait time is just boring for the user. We discovered that users perform searches that are either pretty-zoomed-out - like Ireland - or-pretty-zoomed in - like Landsdowne Road.

For lots of reasons, we have clusters of data around certain areas. Big cities typically have lots of dense data points, but some cities are better represented than others. This makes pretty-zoomed-out and pretty-zoomed-in hard to quantify. It follows also that zoom-level of the map is somewhat useless as an optimization datum. We wanted all our location data to be available to the user - there are a few plays in Antarctica, but they happened a while ago - it would be a shame to cut the search space by time and miss this.

Schema Design

Our deployment contains around 90 million play actions, each one represented by a single document. For any interaction with the app, for example capturing a play, there is an associated geo-coordinate and document. Our first implementation of this feature leveraged the 2d index type with a legacy location data point associated with each action document.

{
  actionType : "PLAY",
  ..
  location : { lon :  Double , lat : Double },
  time : ISODate(..) 
}  

Our geo-query therefore took the form of a $within statement using a polygon built of points on the free form finger-drag.

db.action.find({ location: { $geoWithin: { $polygon:
            [ [ -5.577327758073807, 53.92722913826002 ],
              [ -5.628384947776794, 53.94341400538505 ],
              ..
              [ -7.603911347687244, 53.48786460176994 ] ] }
            }
}).limit(1000).sort( "time": 1});

The supporting index was constructed as follows.

db.location.ensureIndex({ location : "2d", time : 1 })

This actually worked quite nicely out of the box. For legacy queries, MongoDB $within is not too fussy about self-intersecting nor is it concerned about the uniqueness of each point, or the completeness of the polygon.

The query, however was dog-slow. For dense areas at city sized zoom levels, we could not get response times under multiple seconds for our queries. For areas where significant numbers of plays were happening, hinting an index on action-time performed much better than 2d in all cases.

For areas of low activity, this strategy didn’t work particularly well. Searches at the country or continent scale didn’t complete in any reasonable time - 10s of seconds - and began to affect the latency of other parts of the app. MongoDB’s Explain showed scans over half, if not more, of the records in the collection for city sized searches, yet also confirmed that the 2d index was used. This led us to conclude that the 2d index may not be powerful enough for our needs.

Migration to GeoJSON & 2dsphere

After some hints from the MongoDB community and migration to MongoDB 2.4, we decided to move to the 2dsphere index. This required a rework of our schema and a rework of our query engine. Firstly, we needed GeoJSON points.

{
  "location" : {
    "type" : "Point",
    "coordinates" : [  Double,  Double ]
   }
}

Next the index

db.action_loc.ensureIndex({ location : "2dsphere", time  : 1 })

Alert readers will notice the emergence of a new collection, action_loc, in our schema. Version 2.4 2dsphere indexes do not respect the sparse property, but this was changed in 2.6. Roughly 65% of our Action collection consists of documents that do not have an associated location.

Creating the 2dsphere index on the action collection results in 57 million documents indexed on location with no benefit. In practice this resulted in an index roughly 10GB in size. Moving to a collection where all records had locations resulted in a 7.3Gb decrease in the size of the index. Given that reads dominate writes in our application, we were happy to incur the latency of one additional write per play to the action_loc collection when location is available for specific actions.

GeoJSON support has some tougher constraints for queries. There can be no duplicate points. Polygons must form a closed linear ring. In practice, our client application sometimes produced multiple points i.e. when the polygon self-intersected, and didn’t provide a closed linear ring ever. Finally, self-intersecting polygons, which are a common side effect of finger-led free-form drawing and interesting random searches, are not acceptable. Technically, a self-intersecting polygon could be considered as a geometry containing n distinct non self-intersecting polygons, multiple polygon geometries are not supported on MongoDB 2.4 but were added in 2.5.1.

The JTS Java Library helped us fix all this at the application tier. De-duplication and closing of linear rings was trivial however, supporting self-intersecting polygons was a little trickier. The final solution involved calculation of the convex hull of the search polygon. This guaranteed a closed linear ring around any polygon geometry, therefore removing the chance of self-intersection. As illustrated in the graphic above it does, however, result in a larger result set than intended.

Rather than show these points and confuse the user, we cull them from the response. This preserves the bounding-box user experience we were after. In other words, users get what they would expect. With regard to wasted resources, this is not ideal in theory, but works well in practice. Users typically draw self-intersecting polygons by accident3. The 2nd polygon is usually an order of magnitude smaller than the larger of two polygons that result. Therefore the wasted search space is typically small.

Once migrated, city and street level searches were an order of magnitude faster in generating results. We attribute this to the 2dsphere index and the s2 cursor. It still remained, though, that low zoom level searches, at state and country level, pushed the 95th percentile latency beyond 1.5 seconds on average and were often a painfully slow user experience.

Optimization

We limit results of music map searches to 100 plays. If there are more than 100, we show the 100 most recent.

Queries at a low zoom level are usually dominated by plays that have happened recently. Tests showed that for low-zoom level wide-area searches an inverse time index was much faster at finding records than the 2dsphere index. Hence we created two indexes on location, one based on time, the other one 2dsphere.

However, In our experience, MongoDB’s query planner always chose the 2dsphere index over the time-based index for $geoWithin queries, even when tests showed that the time-based index would provide faster results in a specific case.

MongoDB’s query planner works by periodically using multiple indexes in parallel and seeing which one returns faster. The problem is that most of the time, the right answer is to use the 2dsphere index and MongoDB only periodically re-evaluates this choice. If the best choice of index varies for two queries that have the same signature, MongoDB will make the wrong choice some percentage of the time. MongoDB makes an index choice that works more often than not, but leaves the remaining cases unacceptably slow.

Some conjunction of polygon-area and pre-generated metrics of data-point density for the area was explored but quickly abandoned due to difficulty of implementation.

Finally a simple strategy of parallel execution was used. We run the query two ways, in parallel, one query hinting the 2dsphere index, the other hinting the time index. The winner is the query that returns results the fastest. If the loser runs for longer than 10 seconds it is forcibly killed to save wasted resources.

In production, we found that if we did not kill long-running speculative queries, the resulting load on the database adversely affected the read latency for other queries, and thus user experience.

With MongoDB 2.6 its possible to set the user defined time limit for queries and so this step could be avoided. At the time of writing, we were not using this feature in production, and we needed a Javascript worker that periodically killed queries on the collection running over 10 seconds. The function is below.

function cleanUpLong(max_running_secs) {
    var currentOps = db.currentOp({
            $and: [{ op: "query" },
             { msg: { $exists: false }   }, // not a sharding operation!
                { 
                 secs_running: { $gt: max_running_secs }
            },
            {"ns": "backstage.action_loc" }]
    });
    currentOps.inprog.forEach(function (op) {
        db.killOp(op.opid)
    })
}

Some care is needed to ensure the operations on the collection that are long running are not related to the redistribution of chunks across shards.

Explaining Results

In production, our inverse time index beats our 2dsphere index roughly one third of the time, typically for pretty zoomed-out searches. Intuitively, this makes sense, because a continent is large and encompasses a significant portion of the geographic space. For example, if you evaluate the 1000 most recent items, 100 of them are likely to also satisfy your geographical constraint if your geographic constraint is the size of a continent.

Using $geoWithin when very zoomed out performs poorly. Again looking at the example of a continent at least 10% of all plays are probably within the continent, but as we only want to return the most recent data, we will wind up discarding nearly all the data we pull from the database as we search for the most recent data from the continent.

In the first example [Tbl. 1], the 2dsphere index wins by a huge margin. The cells covered by the areas that are considered in the query do not have very many recent plays in relation to the global set of recent plays. Therefore, the thread using time index needs to work much harder to fill the 100 pin bucket from a global perspective, scanning a factor of 40 more records to do the same work. Meanwhile, the query that uses the 2dsphere index finds that a relatively high proportion of the plays that satisfy the geographic constraint also satisfy the time constraint.

In the second example [Tbl. 2], the probability that a recent play is also geographically compatible is high, causing the time based index to win comfortably, scanning half the records of 2dsphere and bringing us back under our desired latency limit.

To reiterate, we always use the result from the faster query. The result from the slower query is discarded if it returns prior to being killed. By speculatively going down the path of both queries, we are able to get a result back to the user within our requirements without analyzing the query and its parameters. We don’t need to understand whether it’s zoomed out or zoomed in.

Parallel Index Usage

The graphic below shows a few days’ worth of queries and the respective index used. The data-points have been plotted for both winning and losing queries. For many of the of the points shown, results were not used in user responses, but latency was measured anyway.

The graph serves to show that it is the combination of indexes, that is both the red and the green points, are responsible for the p95 latency shown in the graphic below. Moreover, it illustrates roughly how the p95 might be affected should we use only one of the indexes in question. Also, the proportion of 2dsphere searches to time searches can be more roughly deduced.

We process roughly 0.4 map searches on average per second. The p95 response time is well under 1.5 seconds. We are able to achieve these results even though we are performing every query two ways, for up to 10 seconds. The graph shown below shows these results over a week in March.

Conclusion

MongoDB geo-search feature enables one of Soundwave’s most unique features: Music Maps. The Music Map is one of the main reasons we chose MongoDB as a primary data store.

Compared to older features like automatic sharded cluster rebalancing and aggregation, geo-search and supporting documentation are still somewhat immature. That said, with some extra work we managed to build a reasonably complex search feature without having to deploy a separate GIS stack.

Furthermore, MongoDB’s strategy of incorporating the S2 library and more GeoJSON geometries leaves us confident that we can easily build upon our map feature, now and in the future. This allows us to keep location awareness front and center as a feature of our app. Hopefully this article will clear up a few pain points for users implementing similar features.

footnotes

1. Soundwave is available for iOS and Android. Installation details at http://www.soundwave.com

2. We use r3.4xlarge instances.

3. We observed queries on self-intersecting polygons about half the time. This is mostly an accident; it’s hard to draw an exact linear-ring with your finger. However, sometimes people just want to draw figures of 8.

A Mobile-First, Cloud-First Stack at Pearson

Pearson, the global online education leader, has a simple yet grand mission: to educate the world; to have 1 billion students around the globe touching their content on a regular basis.

They are growing quickly, especially to emerging markets where the primary way to consume content is via mobile phones. But to reach global users, they need to deploy in a multitude of private and public data centers around the globe. This demands a mobile-first, cloud-first platform, with the underlying goal to improve education efficacy.

In 2018, Pearson will be announcing to the public markets what percentage of revenue is associated with the company’s efficacy. There’s no question; that’s a bold move. As a result, apps have to be built in a way to measure how users are interacting with them.

Front and center in Pearson’s strategy is MongoDB.

With MongoDB, as Pearson CTO Aref Matin told the audience at MongoDB World (full video presentation here), Pearson was able to replace silos of double-digit, independent platforms with a consolidated platform that would allow for measuring efficacy.

“A platform should be open, usable by all who want to access functionality and services. But it’s not a platform until you’ve opened up APIs to the external world to introduce new apps on top of it,” declared Matin.

A key part of Pearson’s redesigned technology stack, MongoDB proved to be a good fit for a multitude of reasons, including its agility and scalability, document model and ability to perform fast reads and ad hoc queries. Also important to Matin was the ability to capture the growing treasure trove of unstructured data, such as peer-to-peer and social interactions that are increasingly part of education.

So far, Pearson has leveraged MongoDB for use cases such as:

  • Identity and access management for 120 million user accounts, with nearly 50 million per day at peak;
  • Adaptive learning and analytics to detect, in near real-time, what content is most effective and identify areas for improvement; and
  • The Pearson Activity Framework (akin to a “Google DoubleClick” according to Matin), which collects data on how users interact with apps and feeds the analytics engine.

All of this feeds into Matin’s personal vision of increasing the pace of learning.

“Increasing the pace of learning will be a a disruptive force,” said Matin. “If you can reduce the length of time spent on educating yourself, you can learn a lot more and not spend as much on it. That will help us be able to really educate the world at a more rapid pace.”


**Sign up to receive videos and content from MongoDB World.**