Sam Weaver

4 results

In case you missed it: plugins, table view, auto-complete, and more in MongoDB Compass

We’ve released several new versions of MongoDB Compass in the past few months, and we’re excited about the new features that we’ve introduced. Read on for details or download the latest version here . If you’re new to Compass, the best way to learn to use it is with the free online tutorial: M001: MongoDB Basics . In this series of online videos and hands-on exercises, you will use Compass to explore MongoDB data models, learn the MongoDB query language, and deploy and connect to MongoDB clusters in Atlas , MongoDB's fully managed cloud service. Compass plugins: choose your own adventure With the introduction of a new plugin API, MongoDB Compass is fully extensible. From examining database users and roles to generating sample data, from viewing GridFS files to checking sharding status – if there’s a specific feature you need that’s not yet available in Compass, you can build a plugin for it. And if you need it, it might be useful to others as well! Plugins can be shared with the community and added to any build of Compass 1.11 or later. You can learn more about creating plugins for Compass here or work through a tutorial to build an example plugin. View & manipulate documents in a table view Documents can now be viewed and edited easily in a new table view, which allows for a quick visual comparison between records: More auth options: X.509 We added x.509 support so our customers now have full coverage of auth options when it comes to connecting to production deployments of MongoDB. (Authentication options already include username/password, Kerberos, and LDAP.) Type queries faster and store them for later Typing queries became quicker and easier with an intelligent autocomplete bar that matches brackets and completes field names for you. There’s also a new button for query history: use it to review queries you’ve run, run them again, or save common queries as favorites. Free Compass Community version With the launch of MongoDB 3.6 , we introduced a new distribution called Compass Community, which contains a subset of Compass functionality, but doesn’t require a paid subscription to use in production. Compass Community has the core building blocks you need to get started with MongoDB: CRUD, indexes, explain plans, along with the new plug-in API. You can get Compass Community from the download center . It also comes as one of the components of the MongoDB Community Server download. Read-only Compass If you want to view your data with Compass but don’t need to edit it (or allow other developers to edit it!), you have a new option: a read-only build of Compass. No need to stress about unintended edits with this version, now available in the download center . We hope you enjoy this latest release!

January 16, 2018

The New MongoDB Connector for Apache Spark In Action: Building a Movie Recommendation Engine

Introduction We are delighted to announce general availability of the new, native MongoDB Connector for Apache Spark . It provides higher performance, greater ease of use, and access to more advanced Spark functionality than other connectors. With certification from Databricks, the company founded by the creators of Apache Spark project, developers can focus on building modern, data driven applications, knowing that the connector provides seamless integration and complete API compatibility between Spark processes and MongoDB. Written in Scala, Apache Spark&#x2019;s native language, the Connector provides a more natural development experience for Spark users. The connector exposes all of Spark&#x2019;s libraries, enabling MongoDB data to be materialized as Dataframes and Datasets for analysis with machine learning, graph, streaming and SQL APIs, further benefiting from automatic schema inference. The Connector also takes advantage of MongoDB&#x2019;s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs &#x2013; for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL datastores that do not offer either secondary indexes or in-database aggregations. In these cases, Apache Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst. To maximize performance across large, distributed data sets, the Spark connector is aware of data locality in a MongoDB cluster. RDDs are automatically processed on workers co-located with the associated MongoDB shard to minimize data movement across the cluster. The nearest read preference can be used to route Spark queries to the closest physical node in a MongoDB replica set, thus reducing latency. &#x201C;Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today,&#x201D; -- Reynold Xin, co-founder and chief architect of Databricks To demonstrate how to use the connector, we&#x2019;ve created a tutorial that uses MongoDB together with Apache Spark&#x2019;s machine learning libraries to build a movie recommendation system. This example presumes you have familiarity with Spark. If you are new to Spark but would like to learn the basics of using Spark and MongoDB together, we encourage you to check out our new MongoDB University Course . Getting started To get started please ensure you have downloaded and installed Apache Spark. Note: this tutorial uses Spark v.1.6 with hadoop. You will also need to have MongoDB running on localhost listening on the default port (27017). You can follow the documentation to get MongoDB up and running. The complete code can be found in the github repository. Ensure you have downloaded the data and imported it with mongorestore. You can find instructions on using mongorestore here . Tutorial To illustrate how to use MongoDB with Apache Spark, here is a simple tutorial that uses Spark machine learning to generate a list of movie recommendations for a user. Here is what we will outline in this tutorial: How to read data from MongoDB into Spark. The data will contain a list of different user ratings of various movies. The data will also contain a list of personal ratings for a handful of movies for a particular user. Using the machine learning ALS library for Spark, we will generate some personalized recommendations for a particular user based on the movie ratings of other people in the dataset. Once the recommendations have been generated, we shall save them back to MongoDB. Ready? Let&#x2019;s get started! As Spark plays particularly nicely with Scala, this tutorial will use Scala code snippets. A Python example can be found in the github repository. Throughout each step of this tutorial we will flesh out the following code template in order to get a working example by the end. package example import org.apache.log4j.{Level, Logger} import import import{ParamGridBuilder, TrainValidationSplit} import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import com.mongodb.spark.MongoSpark import com.mongodb.spark.config.{ReadConfig, WriteConfig} /** Represents a Users movie rating */ case class UserMovieRating(user_id: Int, movie_id: Int, rating: Double) object MovieRecommendation { /** Run this main method to see the output of this quick example or copy the code into the spark shell @param args takes an optional single argument for the connection string @throws Throwable if an operation fails */ def main(args: Array[String]): Unit = { } /** Gets or creates the Spark Context */ def getSparkContext(): SparkContext = { } } 1. Setting up Spark Before we can do any work with Apache Spark we must first set up the Spark environment and assign the SparkContext. The SparkContext represents the connection to a Spark cluster and can be used to create RDD&#x2019;s and DataFrames. We declare a name for the application and assign how much memory to assign to the worker process. Let&#x2019;s flesh out the getSparkContext() method first. /** * Gets or creates the Spark Context */ def getSparkContext(): SparkContext = { val conf = new SparkConf() .setMaster("local[*]") .setAppName("MovieRatings") val sc = SparkContext.getOrCreate(conf) sc.setCheckpointDir("/tmp/checkpoint/") sc } local[*] will run Spark locally with as many worker threads as logical cores on your machine. setCheckpointDir sets a directory under which RDDs are going to be checkpointed should the operations fill up memory and need to spill to disk. We&#x2019;re building out this example on our laptops, but if you&#x2019;re running on a cluster the directory must be a valid HDFS path. 2. Setting up reading and writing to MongoDB We&#x2019;ll want to also make sure that we are reading data from MongoDB into a DataFrame. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. This means that we can do some nice SELECT operations on DataFrames so we can apply a SQLContext to our SparkContext in order to be able to query the DataFrame with SQL. We&#x2019;ll also want to make sure that we are saving data back into MongoDB once we are done processing it in Spark. Our userId 0 is going to be the person for whom we will generate movie recommendations for. The URI in this example assumes MongoDB is running on localhost ( def main(args: Array[String]): Unit = { // Set up configurations val sc = getSparkContext() val sqlContext = SQLContext.getOrCreate(sc) val readConfig = ReadConfig(Map("uri" -> "mongodb://")) val writeConfig = WriteConfig(Map("uri" -> "mongodb://")) val userId = 0 // Load the movie rating data val movieRatings = MongoSpark.load(sc, readConfig).toDF[UserMovieRating]</code></pre> 3. Creating a machine learning model for movie recommendations We are going to use the ALS (alternating least squares) library for Apache Spark to learn our dataset in order to make predictions for a user. You can learn more about how ALS generates predictions in the Spark documentation . // Create the ALS instance and map the movie data val als = new ALS() .setCheckpointInterval(2) .setUserCol("user_id") .setItemCol("movie_id") .setRatingCol("rating") We can build a grid of parameters in order to get the most accurate model possible. We&#x2019;ll probably want to define some variables that we can use to try different permutations during the training: // We use a ParamGridBuilder to construct a grid of parameters to search over. // TrainValidationSplit will try all combinations of values and determine best model using the ALS evaluator. val paramGrid = new ParamGridBuilder() .addGrid(als.regParam, Array(0.1, 10.0)) .addGrid(als.rank, Array(8, 10)) .addGrid(als.maxIter, Array(10, 20)) .build() For training purposes, we must also split our complete data set up into smaller partitions, known as the training, validation and test data. In this case, we can use 80% of the data for training and the rest can be used to validate the model. val trainedAndValidatedModel = new TrainValidationSplit() .setEstimator(als) .setEvaluator(new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating").setPredictionCol("prediction")) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8) Once we have our data set split up and we have trained our model, we can explore which model had the best fit for our data: // Calculating the best model val bestModel = 4. Combine our personal ratings with the rest of the data set Once we have our model, we will want to use the personal ratings and combine them with the rest of the dataset in order to train a new model based on the complete set: // Combine the datasets val userRatings = MongoSpark.load(sc, readConfig.copy(collectionName = "personal_ratings")).toDF[UserMovieRating] val combinedRatings = movieRatings.unionAll(userRatings) // Retrain using the combinedRatings val combinedModel =, bestModel.extractParamMap())</code></pre> 5. Get user recommendations Now we are ready to generate user recommendations. To get user recommendations, we have to make sure our data set only includes movies that have NOT yet been rated by the user. We also want to make sure that the data set doesn&#x2019;t contain any duplicates. We create a new DataFrame to hold user recommendations. // Get user recommendations import sqlContext.implicits._ val unratedMovies = movieRatings.filter(s"user_id != $userId").select("movie_id").distinct().map(r => (userId, r.getAs[Int]("movie_id"))).toDF("user_id", "movie_id") val recommendations = combinedModel.transform(unratedMovies) // Convert the recommendations into UserMovieRatings val userRecommendations = => UserMovieRating(0, r.getAs[Int]("movie_id"), r.getAs[Float]("prediction").toInt)).toDF()</code></pre> 6. Save recommendations to MongoDB Once we have our recommendations generated, it makes sense to save them back into MongoDB for fast lookup in the future: // Save to MongoDB"overwrite"), writeConfig) 7. Don&#x2019;t forget to clean up Finally, let&#x2019;s clean up the Spark context when we are finished with it. If you are running on Databricks you don&#x2019;t need to do this step. sc.stop() 8. Running the code You can run the code by using the script in the github repo which will automatically pull down the connector from the online repository. $ ./ At the end of the execution you should have a new collection of user recommendations stored into MongoDB: > db.personal_ratings.find() { "_id" : ObjectId("57226a50a45eff77e4dc3fce"), "user_id" : "0", "movie_id" : "1", "rating" : "4" } { "_id" : ObjectId("57226a50a45eff77e4dc3fcf"), "user_id" : "0", "movie_id" : "2", "rating" : "4" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd0"), "user_id" : "0", "movie_id" : "16", "rating" : "5" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd1"), "user_id" : "0", "movie_id" : "19", "rating" : "3" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd2"), "user_id" : "0", "movie_id" : "47", "rating" : "4" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd3"), "user_id" : "0", "movie_id" : "70", "rating" : "4" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd4"), "user_id" : "0", "movie_id" : "163", "rating" : "5" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd5"), "user_id" : "0", "movie_id" : "173", "rating" : "1" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd6"), "user_id" : "0", "movie_id" : "356", "rating" : "5" } { "_id" : ObjectId("57226a50a45eff77e4dc3fd7"), "user_id" : "0", "movie_id" : "364", "rating" : "5" } > That&#x2019;s it! You just created a program that gets and stores data with MongoDB, processes it in Spark and creates intelligent recommendations for users. Ready to get started? You can find the entire example on github You can download MongoDB from You can download Spark from here You can read the MongoDB-Spark connector documentation Sign up for our MongoDB University Course Sign up for the webinar. Register now: Introducing the Spark Connector for MongoDB About the Author - Sam Weaver Sam is the Product Manager for Developer Experience at MongoDB based in New York. Prior to MongoDB, he worked at Red Hat doing technical presales on Linux, Virtualisation and Middleware. Originally from Cheltenham, England; he received his Bachelors in Computer Science from Cardiff University. Sam has also cycled from London to Paris, competed in several extreme sports tournaments such as ToughMudder, and swam with great white sharks.

June 28, 2016

Getting Started with MongoDB Compass

MongoDB’s flexible schema and rich document structure allow developers to quickly build applications with rich data structures. However, this flexibility can also make it difficult to understand the structure of the data in an existing database. Until now, if you wanted to understand the structure of your data, you would have to use the MongoDB shell to issue queries and view data at the command line. There has to be a better way -- enter MongoDB Compass . What is MongoDB Compass? MongoDB 3.2 introduces MongoDB Compass -- a graphical tool that allows you to easily analyse and understand your database schema, as well as allowing you to visually construct queries, all without having to know MongoDB’s query syntax: MongoDB Compass was built to address 3 main goals: Schema discovery Data discovery Visual construction of queries Schema Discovery Compass displays the data types of fields in a collection’s schema. The example below is taken from a mock dataset that I use when test driving Compass. It reports that there are documents in the collection that contain a field last_login with the type date: Compass also displays a percentage breakdown for fields with varying data types across documents. In this example, 81% of documents store phone_no as a string, and the remaining 19% store it as a number: For sparse fields, where some documents omit a value, Compass displays the percentage of missing values as “undefined.” Here, the age field is missing in 40% of the sampled documents. This is exceptionally useful to understand whether your application is storing data the way that you expect it to. Imagine the case where you have a field showing a mix of strings and numbers - perhaps there is an application bug somewhere that has crept in and is storing data with a different type than it should be? Data Discovery Compass is able to show histograms to represent the data frequency and distribution within a collection. For example, here is a data set containing the age of users. We can see the minimum age is 16, the maximum age is 56 and the most popular age is late 30’s (the exact value is shown by hovering over the bar itself). Here’s another example using a field that stores names. Compass will display a random selection of string values for the field: Visual Construction of Queries Do you want an easier way to type out a MongoDB query? Charts in Compass are fully interactive. Clicking on a chart value or bar will automatically build a MongoDB query that matches the selected range in the interface. In the example below, clicking on the “JFK” bar builds a query matching all documents whose departureAirportFsCode field matches “JFK”: Clicking on other field values adds the field and range to the selection, creating a more complex query. Continuing with our example, we can select a particular flightId in addition to departures from JFK Airport. Once you hit the Apply button, Compass will execute the query and bring back the results! It’s as easy as it sounds. You can be building queries with a few clicks of a button in no time at all. One final thing to mention - we didn’t forget about the JSON. Documents can be examined in the document viewing pane. This can be expanded by clicking on the Document Viewer icon on the right-hand side of the page: I know you must be wondering - where can I get this thing?! Well, MongoDB Compass is available in the download center on . It comes included for production use with our subscriptions, both MongoDB Professional and MongoDB Enterprise Advanced. MongoDB Compass can also be used for free in a development environment. This is only version 1.0 of Compass - there is lots of great functionality to come. I’m super excited to be part of the Compass team and I can’t wait for the next set of releases. Give MongoDB Compass a try today. Download MongoDB Compass About the author - Sam Weaver Sam Weaver is the Product Manager for Developer Experience at MongoDB based in New York. Prior to MongoDB, he worked at Red Hat doing technical presales on Linux, Virtualisation, and Middleware. Originally from Cheltenham, England, he received his Bachelors in Computer Science from Cardiff University. Sam has also cycled from London to Paris, competed in several extreme sports tournaments such as ToughMudder, and swam with great white sharks.

January 20, 2016