Announcing PyMongo 3.0

< View all blog posts
MongoDB
April 08, 2015
Category: Company

PyMongo 3.0 is a partial rewrite of the Python driver for MongoDB. More than six years after the first release of the driver, this is the biggest release in PyMongo’s history. Bernie Hackett, Luke Lovett, Anna Herlihy, and I are proud of its many improvements and eager for you to try it out. I will shoehorn the major improvements into four shoes: conformance, responsiveness, robustness, and modernity.

Conformance

The motivation for PyMongo’s overhaul is to supersede or remove its many idiosyncratic APIs. We want you to have a clean interface that is easy to learn and, more importantly, closely matches the interfaces of our other drivers.

CRUD API

Mainly, “conformance” means we have implemented the same interface for create, read, update, and delete operations as the other drivers have, as standardized in the Driver CRUD API. My colleague Jeremy Mikola will explain the full spec in a separate article soon, but here is the gist. The familiar old methods work the same in PyMongo 3, but they are deprecated:

  • save()
  • insert()
  • update()
  • remove()
  • find_and_modify()

These methods were vaguely named. For example, “update” updates or replaces some or all matching documents depending on its arguments. The arguments to “save” and “remove” are likewise finicky, and the many options for “find_and_modify” are intimidating. Other MongoDB drivers do not have exactly the same arguments in the same order for all these methods. If you or other developers on your team are using a driver from a different language, it makes life a lot easier to have consistent interfaces.

The new CRUD API names its methods like “update_one”, “insert_many”, “find_one_and_delete”: they say what they mean and mean what they say. Even better, all MongoDB drivers have exactly the same methods with the same arguments. See the spec for details.

One Client Class

In the past we had three client classes: Connection for any one server, and ReplicaSetConnection to connect to a replica set. We also had a MasterSlaveConnection that could distribute reads to slaves in a master-slave set. In November 2012 we created new classes, MongoClient and MongoReplicaSetClient, with better default settings, so now PyMongo had five clients. Even more confusingly, MongoClient could connect to a set of mongos servers and do hot failover.

As I wrote on my blog, the fine distinctions between the client classes baffled users. And the set of clients we provided did not conform with other drivers. But since PyMongo is among the most-used of all Python libraries we waited long, and thought hard, before making major changes.

The day has come. MongoClient is now the one and only client class for a single server, a set of mongoses, or a replica set. It includes the functionality that had been split into MongoReplicaSetClient: it can connect to a replica set, discover all its members, and monitor the set for stepdowns, elections, and reconfigs. MongoClient now also supports the full ReadPreference API. MongoReplicaSetClient lives on for a time, for compatibility’s sake, but new code should use MongoClient exclusively. The obsolete Connection, ReplicaSetConnection, and MasterSlaveConnection are removed.

The options you pass to MongoClient in the URI or as arguments now completely control the client’s behavior:

This is exciting because PyMongo applications are now so easy to deploy: your code simply loads a MongoDB URI from an environment variable or config file and passes it to a MongoClient. Code and configuration are cleanly separated. You can move smoothly from your laptop to a test server to the cloud, simply by changing the URI.

Non-Conforming Features

PyMongo 2 had some quirky features it did not share with other drivers. For one, we had a “copy_database” method that only one other driver had, and which almost no one used. It was hard to maintain and we believe you want us to focus on the features you use, so we removed it.

A more pernicious misfeature was the “start_request” method. It bound a thread to a socket, which hurt performance without actually guaranteeing monotonic write consistency. It was overwhelmingly misused, too: new PyMongo users naturally called “start_request” before starting a request, but in truth the feature had nothing to do with its name. For the history and details, including some entertaining (in retrospect) tales of Python threadlocal bugs, see my article on the removal of start_request.

Finally, the Python team rewrote our distributed-systems internals to conform to the new standards we have specified for all our drivers. But if you are a Python programmer you may care only a little that the new code conforms to a spec; it is more interesting to you that the new code is responsive and robust.

Responsiveness

PyMongo 3’s MongoClient can connect to a single server, a replica set, or a set of mongoses. It finds servers and reacts to changing conditions according to the Server Discovery And Monitoring spec, and it chooses which server to use for each operation according to the Server Selection Spec. David Golden and I explained these specs in general in the linked articles, but I can describe PyMongo’s implementation here.

Replica Set Discovery And Monitoring

In PyMongo 2, MongoReplicaSetClient used a single background thread to monitor all replica set members in series. So a slow or unresponsive member could block the thread for some time before the thread moved on to discover information about the other members, like their network latencies or which member is primary. If your application was waiting for that information—say, to write to the new primary after an election—these delays caused unneeded seconds of downtime.

When PyMongo 3’s new MongoClient connects to a replica set it starts one thread per mongod server. The threads fan out to connect to all members of the set in parallel, and they start additional threads as they discover more members. As soon as any thread discovers the primary, your application is unblocked, even while the monitor threads collect more information about the set. This new design improves PyMongo’s response time tremendously. If some members are slow or down, or you have many members in your set, PyMongo’s discovery is still just as fast.

I will explain and demonstrate the new design in my MongoDB World talk, “Drivers And High Availability: Deep Dive.

Mongos Load-Balancing

Our multi-mongos behavior is improved, too. A MongoClient can connect to a set of mongos servers:

The behavior in PyMongo 2 was “high availability”: the client connected to the lowest-latency mongos in the list, and used it until a network error prompted it to re-evaluate their latencies and reconnect to one of them. If the driver chose unwisely at first, it stayed pinned to a higher-latency mongos for some time. In PyMongo 3, the background threads monitor the client’s network latency to all the mongoses continuously, and the client distributes operations evenly among those with the lowest latency. See "mongos Load Balancing" for more information.

Throughput

Besides PyMongo’s improved responsiveness to changing conditions in your deployment, its throughput is better too. We have written a faster and more memory efficient pure python BSON module, which is particularly important for PyPy, and made substantial optimizations in our C extensions.

Robustness

Disconnected Startup
The first change you may notice is, MongoClient’s constructor no longer blocks while connecting. It does not raise ConnectionFailure if it cannot connect:

The constructor returns immediately and launches the connection process on background threads. Of course, foreground operations might time out:

Meanwhile, the client’s background threads keep trying to reach the server. This is a big win for web applications that use PyMongo—in a crisis, your app servers might be restarted while your MongoDB servers are unreachable. Your applications should not throw an exception at startup, when they construct the client object. In PyMongo 3 the client can now start up disconnected; it tries to reach your servers until it succeeds.

On the other hand if you wrote code like this to check if mongod is up:

This will not work any more, since the constructor never throws ConnectionFailure now. Instead, choose how long to wait before giving up by setting “serverSelectionTimeoutMS”:

One Monitor Thread Per Server

Even during regular operations, connections may hang up or time out, and servers go down for periods; monitoring each on a separate thread keeps PyMongo abreast of changes before they cause errors. You will see fewer network exceptions than with PyMongo 2, and the new driver will recover much faster from the unexpected.

Thread Safety

Another source of fragility in PyMongo 2 was APIs that were not designed for multithreading. Too many of PyMongo’s options could be changed at runtime. For example, if you created a database handle:

… and changed the handle’s read preference on a thread, the change appeared in all threads:

Making these options mutable encouraged such mistakes, so we made them immutable. Now you configure handles to databases and collections using thread-safe APIs:

Modernity

Last, and most satisfying to the team, we have completed our transition to modern Python.

While PyMongo 2 already supported the latest version of Python 3, it did so tortuously by executing “auto2to3” on its source at install time. This made it too hard for the open source community to contribute to our code, and it led to some absurdly obscure bugs. We have updated to a single code base that is compatible with Python 2 and 3. We had to drop support for the ancient Pythons 2.4 and 2.5; we were encouraged by recent download statistics to believe that these zombie Python versions are finally at rest.



If you’re interested in learning more about the architecture of MongoDB, download our guide:

Download the Architecture Guide
comments powered by Disqus