Jeremy Mikola

5 results

A Consistent CRUD API for Next Generation MongoDB Drivers

One of the more notable challenges with maintaining a suite of drivers across many languages has been following individual language idioms while still keeping their APIs consistent with each other. For example, the Ruby driver should feel like any other Ruby library when it comes to design and naming conventions. At the same time, the behavior for API calls should be the same across all of the drivers. Towards the end of 2014, a handful of MongoDB driver developers started working on a CRUD API specification for our next generation drivers . The CRUD acronym refers to create, read, update, and delete operations, which are commonly found on each driver's Collection interface. In truth, the spec covers a bit more than those four methods: Create Read Update Delete Count Replace Aggregate Distinct Bulk, One or Many Find and Modify For obvious reasons, we decided to do without the full CRUDCRADBOoMFaM acronym and stick with CRUD. Compared to the Server Selection and SDAM specifications, which deal with internal driver behavior, the CRUD API is a high-level specification; however, the goal of improving consistency across our drivers is one and the same. To ensure that multiple language viewpoints were considered in drafting the spec, the team included Craig Wilson (C#), Jeff Yemin (Java), Tyler Brock (C and C++), and myself (representing PHP and other dynamic languages). What's in a Name? There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton The spec's position on function and option names perhaps best illustrates the balancing act between language idiomaticity and cross-driver consistency. While the spec is flexible on style (e.g. snake_case or camelCase, common suffixes), certain root words are non-negotiable. The spec doesn't attempt to define an exhaustive list of permitted deviations, but it does provide a few examples for guidance: batchSize and batch_size are both acceptable, but batchCount is not since "batch" and "size" are root words. maxTimeMS can be abbreviated as maxTime if the language provides a data type with millisecond precision (e.g. TimeSpan in C#), but maximumTime is too verbose. If a driver's find() method needs a typed options class (e.g. Java) in lieu of a hash literal (e.g. JavaScript) or named parameters (e.g. Python), FindOptions or FindArgs are both OK, but QueryParams would be inconsistent. Some languages may prefer to prefix a boolean options with "is" or "has", so a bulk write's ordered option could be named isOrdered . Several Options for Handling Options In addition to naming conventions, the spec acknowledges that each language has its own conventions for expressing optional parameters to functions. Ruby and Python support named parameters, JavaScript and PHP might use hash literals, C++ or C# may use an options class, and Java could opt for a fluent builder class. Ultimately, we decided not to require method overloading, since it was only supported by a few languages. Required parameters, such as the fieldName for a distinct command or the pipeline for an aggregation, must always be positional arguments on the CRUD method. This ensures that all drivers will present a consistent public API for each method and their essential inputs. Query Modifiers and Cursor Flags The query API found in our legacy drivers differentiates between query modifiers and wire protocol flags. Commonly used query modifiers include $orderBy , for sorting query results, or $hint , for suggesting an index. Wire protocol flags, on the other hand, might be used to instruct the server to create a tailable cursor. Depending on the driver, these options might be specified via arguments to find() or any of various setter methods on a mutable Cursor object. The CRUD API now enforces consistent naming for these options and ensures they will all be specified in the same manner, be it an options structure for find() or a fluent interface. Ultimately, users should never have to think about whether these query options are modifiers within the query document or bit flags at the protocol level. That distinction is an implementation detail of today's server API. Similar to how MongoDB 2.6 introduced write commands and deprecated write operations in the wire protocol, we expect a future version of the server to do the same for queries. In fact, progress for find and getMore commands has already begun in SERVER-15176 . By abstracting away these details in the CRUD API, we can achieve a bit of future-proofing for our drivers and the applications that use them. A Step Towards Self-documenting Code One of the common pain points with our legacy API, especially for beginners, was that update operations affected only a single document by default while deletes would remove everything matching the criteria. The inconsistency around the name of this limit option (is it multi , multiple , or justOne ?) was icing on the cake. This is definitely something we wanted to fix in the CRUD spec, but one has to tread carefully when changing the behavior of methods that can modify or delete data. In the interest of not surprising any users by silently changing defaults, we opted to define some new, more descriptive methods: deleteOne(filter) deleteMany(filter) replaceOne(filter, replacement, options) updateOne(filter, update, options) updateMany(filter, update, options) The most striking change is that we've moved the limit option into the name of each method. This allows drivers to leave their existing update() and delete() (or remove() ) methods as-is. Secondly, delete operations will now require a filter option, which means it will take a bit more effort to inadvertently wipe out a collection ( deleteMany({}) instead of remove() ). And lastly, we wanted to acknowledge that the difference between replacing an entire document and updating specific fields in one or many documents. By having each method check if the document contains atomic modifiers, we hope to help users avoid the mistake of clobbering an entire document when they expected to modify specific fields, or vice versa. Less is More Some things are better left unsaid. While the CRUD spec contains a lot of detail, there are a few subjects which aren't addressed: Read preferences Write concerns Fluent API for bulk writes Explaining queries With regard to read preferences and write concerns, we noted that not every driver allows those options to be specified on a per-operation basis. For some, read preferences and write concerns are only set on the Client, Database, or Collection objects. Nevertheless, the spec happily permits drivers to support additional options on its read and write methods. The Bulk API , which first appeared in the MongoDB shell and select drivers around the time MongoDB 2.6 was released, was left alone. The CRUD spec defines a single bulkWrite() method, that receives an array of models each describing the parameters for insert, update, or delete operations. We felt this method was more versatile, as it does not impose a fluent API (with all of its method calls) upon the user, nor does it hide the list of operations within a builder object. Users can create, examine, or modify the list however they like before executing it through the new method, or even re-use it entirely in a subsequent call. Lastly, we spent a fair amount of time discussing (and bikeshedding) the API for explaining queries, aggregation pipelines, and any other operations that might be supported by MongoDB 3.0 and beyond (e.g. SERVER-10448 ). Ultimately, we determined that explain is not a typical use case for drivers, in contrast to the shell. We also did not want to effectively double the public API of the CRUD specification by defining explainable variants of each method. That said, all drivers will continue to provide the necessary tools to execute explains (either through queries or command execution). Wrapping Up If you're interested in digging deeper into any of the topics discussed in this article (and some that weren't, such as error reporting), do give the CRUD API spec a look. We've also published a set of standardized acceptance tests in YAML and JSON formats, which are being used by many of our next generation drivers that implement the spec. To learn more about what's new in MongoDB 3.0, download the white paper below. See What's New In 3.0 About the Author - Jeremy Jeremy Mikola is a software engineer at MongoDB's NYC office. As a member of the driver and evangelism team, he helps develop the PHP driver and contributes to various OSS projects, such as Doctrine ODM and React PHP. Jeremy lives in Hoboken, NJ and is known to enjoy a good sandwich. API: Application programming interface CRUD : Create read update delete JSON: JavaScript object notation SDAM : Server discovery and monitoring YAML: YAML ain't markup language

April 16, 2015

Call for Feedback: The New PHP and HHVM Drivers

In the beginning Kristina created the MongoDB PHP driver. Now the PECL mongo extension was new and untested, write operations tended to be fire-and-forget, and Boolean parameters made more sense than $options arrays. And Kristina said, "Let there be MongoCollection," and there was basic functionality. Since the PHP driver first appeared on the scene, MongoDB has gone through many changes. Replica sets and sharding arrived early on, but things like the aggregation framework and command cursors were little more than a twinkle in Eliot's eye at the time. The early drivers were designed with many assumptions in mind: write operations and commands were very different; the largest replica set would have no more than a dozen nodes; cursors were only returned by basic queries. In 2015, we know that these assumptions no longer hold true. Beyond MongoDB's features, our ecosystem has also changed. When the PHP driver, a C extension, was first implemented, there wasn't yet a C driver that we could utilize. Therefore, the 1.x PHP driver contains its own BSON and connection management C libraries. HHVM , an alternative PHP runtime with its own C++ extension API, also did not exist years ago, nor was PHP 7.0 on the horizon. Lastly, methods of packaging and distributing libraries have changed. Composer has superseded PEAR as the de facto standard for PHP libaries and support for extensions (currently handled by PECL) is forthcoming. During the spring of 2014, we worked with a team of students from Facebook's Open Academy program to prototype an HHVM driver modeled after the 1.x API. The purpose of that project was twofold: research HHVM's extension API and determine the feasibility of building a driver atop libmongoc (our then new C driver) and libbson . Although the final result was not feature complete, the project was a valuable learning experience. The C driver proved quite up to the task, and HNI, which allows an HHVM extension to be written with a combination of PHP and C++, highlighted critical areas of the driver for which we'd want to use C. This all leads up to the question of how best to support PHP 5.x, HHVM, and PHP 7.0 with our next-generation driver. Maintaining three disparate, monolithic extensions is not sustainable. We also cannot eschew the extension layer for a pure PHP library, like mongofill , without sacrificing performance. Thankfully, we can compromise! Here is a look at the architecture for our next-generation PHP driver: At the top of this stack sits a pure PHP library, which we will distribute as a Composer package. This library will provide an API similar to what users have come to expect from the 1.x driver (e.g. CRUD methods, database and collection objects, command helpers) and we expect it to be a common dependency for most applications built with MongoDB. This library will also implement common specifications , in the interest of improving API consistency across all of the drivers maintained by MongoDB (and hopefully some community drivers, too). Sitting below that library we have the lower level drivers (one per platform). These extensions will effectively form the glue between PHP and HHVM and our system libraries (libmongoc and libbson). These extensions will expose an identical public API for the most essential and performance-sensitive functionality: Connection management BSON encoding and decoding Object document serialization (to support ODM libraries) Executing commands and write operations Handling queries and cursors By decoupling the driver internals and a high-level API into extensions and PHP libraries, respectively, we hope to reduce our maintainence burden and allow for faster iteration on new features. As a welcome side effect, this also makes it easier for anyone to contribute to the driver. Additionally, an identical public API for these extensions will make it that much easier to port an application across PHP runtimes, whether the application uses the low-level driver directly or a higher-level PHP library. GridFS is a great example of why we chose this direction. Although we implemented GridFS in C for our 1.x driver, it is actually quite a high-level specification. Its API is just an abstraction for accessing two collections: files (i.e. metadata) and chunks (i.e. blocks of data). Likewise, all of the syntactic sugar found in the 1.x driver, such as processing uploaded files or exposing GridFS files as PHP streams, can be implemented in pure PHP. Provided we have performant methods for reading from and writing to GridFS' collections – and thanks to our low level extensions, we will – shifting this API to PHP is win-win. Earlier I mentioned that we expect the PHP library to be a common dependency for most applications, but not all. Some users may prefer to stick to the no-frills API offered by the extensions, or create their own high-level abstraction (akin to Doctrine MongoDB for the 1.x driver), and that's great! Hannes has talked about creating a PHP library geared for MongoDB administration, which provides an API for various user management and ops commands. I'm looking forward to building the next major version of Doctrine MongoDB ODM directly atop the extensions. While we will continue to maintain and support the 1.x driver and its users for the foreseeable future, we invite everyone to check out our next-generation driver and consider it for any new projects going forward. You can find all of the essential components across GitHub and JIRA: Project GitHub JIRA PHP Library mongodb/mongo-php-library PHPLIB PHP 5.x Driver (phongo) mongodb/mongo-php-driver PHPC HHVM Driver (hippo) mongodb/mongo-hhvm-driver HHVM The existing PHP project in JIRA will remain open for reporting bugs against the 1.x driver, but we would ask that you use the new projects above for anything pertaining to our next-generation drivers. If you're interested in hearing more about our upcoming PHP and HHVM drivers, Derick Rethans is presenting a new talk entitled One Extension, Two Engines at php[tek] 2015 in May. About the Author - Jeremy Jeremy Mikola is a software engineer at MongoDB's NYC office. As a member of the driver and evangelism team, he helps develop the PHP driver and contributes to various OSS projects, such as Doctrine ODM and React PHP. Jeremy lives in Hoboken, NJ and is known to enjoy a good sandwich.

March 10, 2015

2dsphere, GeoJSON, and Doctrine MongoDB

By Jeremy Mikola, 10gen software engineer and maintainer of Doctrine MongoDB ODM . It seems that GeoJSON is all the rage these days. Last month, Ian Bentley shared a bit about the new geospatial features in MongoDB 2.4 . Derick Rethans, one of my PHP driver teammates and a renowned OpenStreetMap aficionado, recently blogged about importing OSM data into MongoDB as GeoJSON objects. A few days later, GitHub added support for rendering .geojson files in repositories, using a combination of Leaflet.js , MapBox , and OpenStreetMap data. Coincidentally, I visited a local CloudCamp meetup last week to present on geospatial data, and for the past two weeks I’ve been working on adding support for MongoDB 2.4’s geospatial query operators to Doctrine MongoDB . Doctrine MongoDB is an abstraction for the PHP driver that provides a fluent query builder API among other useful features. It’s used internally by Doctrine MongoDB ODM , but is completely usable on its own. One of the challenges in developing the library has been supporting multiple versions of MongoDB and the PHP driver. The introduction of read preferences last year is one such example. We wanted to still allow users to set slaveOk bits for older server and driver versions, but allow read preferences to apply for newer versions, all without breaking our API and abiding by semantic versioning . Now, the setSlaveOkay() method in Doctrine MongoDB will invoke setReadPreference() if it exists in the driver, and fall back to the deprecated setSlaveOkay() driver method otherwise. Query Builder API Before diving into the geospatial changes for Doctrine MongoDB, let’s take a quick look at the query builder API. Suppose we had a collection, test.places , with some OpenStreetMap annotations ( key=value strings) stored in a tags array and a loc field containing longitude/latitude coordinates in MongoDB’s legacy point format (a float tuple) for a 2d index. Doctrine’s API allows queries to be constructed like so: $connection = new \Doctrine\MongoDB\Connection(); $collection = $connection->selectCollection('test', 'places'); $qb = $collection->createQueryBuilder() ->field('loc') ->near(-73.987415, 40.757113) ->maxDistance(0.00899928); ->field('tags') ->equals('amenity=restaurant'); $cursor = $qb->getQuery()->execute(); This above example executes the following query: { "loc": { "$near": [-73.987415, 40.757113], "$maxDistance": 0.00899928 }, "tags": "amenity=restaurant" } This simple query will return restaurants within half a kilometer of 10gen’s NYC office at 229 West 43rd Street. If only it was so easy to find good restaurants near Times Square! Supporting New and Old Geospatial Queries When the new 2dsphere index type was introduced in MongoDB 2.4, operators such $near and $geoWithin were changed to accept GeoJSON geometry objects in addition to their legacy point and shape arguments. $near was particularly problematic because of its optional $maxDistance argument. As shown above, $maxDistance previously sat alongside $near and was measured in radians. It now sits within $near and is measured in meters. Using a 2dsphere index and GeoJSON points, the same query takes on a whole new shape: { "loc": { "$near": { "$geometry": { "type": "Point", "coordinates" [-73.987415, 40.757113] }, "$maxDistance": 500 } }, "tags": "amenity=restaurant" } This posed a hurdle for Doctrine MongoDB’s query builder, because we wanted to support 2dsphere queries without drastically changing the API. Unfortunately, there was no obvious way for near() to discern whether a pair of floats denoted a legacy or GeoJSON point, or whether a number signified radians or meters in the case of maxDistance() . I also anticipated we might run into a similar quandry for the $geoWithin builder method, which accepts an array of point coordinates. Method overloading seemed preferable to creating separate builder methods or introducing a new “mode" parameter to handle 2dsphere queries. Although PHP has no language-level support for overloading, it is commonly implemented by inspecting an argument’s type at runtime. In our case, this would necessitate having classes for GeoJSON geometries (e.g. Point, LineString, Polygon), which we could differentiate from the legacy geometry arrays. Introducing a GeoJSON Library for PHP A cursory search for GeoJSON PHP libraries turned up php-geojson , from the MapFish project, and geoPHP . I was pleased to see that geoPHP was available via Composer (PHP’s de facto package manager), but neither library implemented the GeoJSON spec in its entirety. This seemed like a ripe opportunity to create such a library, and so geojson was born a few days later. At the time of this writing, 2dsphere support for Doctrine’s query builder is still being developed ; however, I envision it will take the following form when complete: use GeoJson\Geometry\Point; // ... $qb = $collection->createQueryBuilder() ->field('loc') ->near(new Point([-73.987415, 40.757113])) ->maxDistance(0.00899928); ->field('tags') ->equals('amenity=restaurant'); All of the GeoJson classes implement JsonSerializable , one of the newer interfaces introduced in PHP 5.4, which will allow Doctrine to prepare them for MongoDB queries with a single method call . One clear benefit over the legacy geometry arrays is that the GeoJson library performs its own validation. When a Polygon is passed to geoWithin() , Doctrine won’t have to worry about whether all of its rings are closed LineStrings; the library would catch such an error in the constructor. This helps achieve a separation of concerns , which in turn increases the maintainability of both libraries. I look forward to finishing up 2dsphere support for Doctrine MongoDB in the coming weeks. In the meantime, if you happen to fall in the fabled demographic of PHP developers in need of a full GeoJSON implementation, please give geojson a look and share some feedback.

June 24, 2013

MongoQP: MongoDB Slow Query Profiler

Two times a year 10gen’s Drivers and Innovations team gather together for a face to face meeting to work together and setting goals for the upcoming six months. This year the team broke up into teams for an evening hackathon. MongoQP , a query profiler, was one of the hacks presented by Jeremy Mikola, PHP Evangelist at 10gen. Logging slow queries is essential for any database application, and MongoDB makes doing so relatively painless with its database profiler. Unfortunately, making sense of the system.profile collection and tying its contents back to your application requires a bit more effort. The heart of mongoqp  (Mongo Query Profiler) is a bit of map/reduce JS that aggregates those queries by their BSON skeleton (i.e. keys preserved, but values removed). With queries reduced to their bare structure, any of their statistics can be aggregated, such as average query time, index scans, counts, etc. As a fan of Genghis , a single-file MongoDB admin app, I originally intended to contribute a new UI with the profiler results, but one night was not enough time to wrap my head around Backbone.js and develop the query aggregation. Instead, I whipped up a quick frontend using the Silex  PHP micro-framework. But with the hack day deadline no longer looming, there should be plenty of time to get this functionality ported over to Genghis. Additionally, the map/reduce JS may also show up in Tyler Brock’s mongo-hacker  shell enhancement package. While presenting mongoqp to my co-workers, I also learned about Dan Crosta’s professor , which already provides many of the features I hoped to implement, such as incremental data collection. I think there is still a benefit to developing the JS innards of mongoqp and getting its functionality ported over to other projects, but I would definitely encourage you to check out professor if you’d like a stand-alone query profile viewer. Contributions welcome through Github . 

December 5, 2012