Testing and Packaging a Python Library
Rate this tutorial
This the second tutorial in a series! Feel free to check out the first tutorial if you like, but it's not necessary if you want to just read on.
This tutorial is loosely based on the second episode of a new livestream I host, called "Coding with Mark." I'm streaming on Wednesdays at 2 p.m. GMT (that's 9 a.m. Eastern or 6 a.m. Pacific, if you're an early riser!). If that time doesn't work for you, you can always catch up by watching the recording!
Currently, I'm building an experimental data access layer library that should provide a toolkit for abstracting complex document models from the business logic layer of the application that's using them.
Testing is easier when the code you're testing is relatively standalone and can be tested in isolation. Sadly, code that works with data within MongoDB is at the other end of the spectrum — it's an integration test by definition because you're testing your integration with MongoDB.
You have two options when writing test that works with MongoDB:
- Mock out MongoDB, so instead of working with MongoDB, your code works with an object that just looks like MongoDB but doesn't really store data. mongomock is a good solution if you're following this technique.
- Work directly with MongoDB, but ensure the database is in a known state before your tests run (by loading test data into an empty database) and then clean up any changes you make after your tests are run.
The first approach is architecturally simpler — your tests don't run against MongoDB, so you don't need to configure or run a real MongoDB server. On the other hand, you need to manage an object that pretends to be a
MongoClient
, or a Database
, or a Collection
, so that it responds in accurate ways to any calls made against it. And because it's not a real MongoDB connection, it's easy to use those objects in ways that don't accurately reflect a real MongoDB connection.My preferred approach is the latter: My tests will run against a real MongoDB instance, and I will have the test framework clean up my database after each run using transactions. This makes it harder to run the tests and they may run more slowly, but it should do a better job of highlighting real problems interacting with MongoDB itself.
Before I ran in and decided to write my own plugin for pytest, I decided to see what others have done before me. I am building my own ODM, after all — there's only so much room for Not Invented Here™ in my life. There are two reasonably popular pytest integrations for use with MongoDB: pytest-mongo and pytest-mongodb. Sadly, neither did quite what I wanted. But they both look good — if they do what you want, then I recommend using them.
Pytest-mongo is a pytest plugin that enables you to test code that relies on a running MongoDB database. It allows you to specify fixtures for the MongoDB process and client, and it will spin up a MongoDB process to run tests against, if you configure it to do so.
Pytest-mongo is a pytest plugin that enables you to test code that relies on a database connection to a MongoDB and expects certain data to be present. It allows you to specify fixtures for database collections in JSON/BSON or YAML format. Under the hood, it uses mongomock to simulate a MongoDB connection, or you can use a MongoDB connection, if you prefer.
Both of these offer useful features — especially the ability to provide fixture data that's specified in files on disk. Pytest-mongo even provides the ability to clean up the database after each test! When I looked a bit further, though, it does this by deleting all the collections in the test database, which is not the behavior I was looking for.
I want to use MongoDB transactions to automatically roll back any changes that are made by each test. This way, the test won't actually commit any changes to MongoDB, and only the changes it would have made are rolled back, so the database will be efficiently left in the correct state after each test run.
I'm going to use pytest's fixtures feature to provide both a MongoDB connection object and a transaction session to each test that requires them. Behind the scenes, each fixture object will clean up after itself when it is finished.
Fixtures in pytest are defined as functions, usually in a file called
conftest.py
. The thing that often surprises people new to fixtures, however, is that pytest will magically provide them to any test function with a parameter with the same name as the fixture. It's a form of dependency injection and is probably easier to show than to describe:As well as pytest providing fixture values to test functions, it will also do the same with other fixture functions. I'll be making use of this in the second fixture I write.
Fixtures are called once for their scope, and by default, a fixture's scope is "function" which means it'll be called once for each test function. I want my "session" fixture to be called (and rolled back) for each function, but it will be much more efficient for my "mongodb" client fixture to be called once per session — i.e., at the start of my whole test run.
The final bit of pytest fixture theory I want to explain is that if you want something cleaned up after a scope is over — for example, when the test function is complete — the easiest way to accomplish this is to write a generator function using yield instead of return, like this:
I don't know about you, but despite the magic, I really like this setup. It's nice and consistent, once you know how to use it.
The first fixture I need is one that returns a MongoClient instance that is connected to a MongoDB cluster.
Incidentally, MongoDB Atlas Serverless clusters are perfect for this as they don't cost anything when you're not using them. If you're only running your tests a few times a day, or even less, then this could be a good way to save on hosting costs for test infrastructure.
I want to provide configuration to the test runner via an environment variable,
MDB_URI
, which will be the connection string provided by Atlas. In the future, I may want to provide the connection string via a command-line flag, which is something you can do with pytest, but I'll leave that to later.As I mentioned before, the scope of the fixture should be "session" so that the client is configured once at the start of the test run and then closed at the end. I'm actually going to leave clean-up to Python, so I won't do that explicitly myself.
Here's the fixture:
The above code means that I can write a test that reads from a MongoDB cluster:
As I mentioned, the fixture above is fine for reading from an existing database, but any changes made to the data would be persisted after the tests were finished. In order to correctly clean up after the test run, I need to start a transaction before the test run and then abort the transaction after the test run so that any changes are rolled back. This is how Django's test runner works with relational databases!
In MongoDB, to create a transaction, you first need to start a session which is done with the
start_session
method on the MongoClient object. Once you have a session, you can call its start_transaction
method to start a transaction and its abort_transaction
method to roll back any database updates that were run between the two calls.One warning here: You must provide the session object to all your queries or they won't be considered part of the session you've started. All of this together looks like this:
That's not too bad. Now, I'll show you how to wrap up that logic in a fixture.
The fixture takes the code above, replaces the middle with a
yield
statement, and wraps it in a fixture function:This time, I haven't specified the scope of the fixture, so it defaults to "function" which means that the
abort_transaction
call will be made after each test function is executed.Just to be sure that the test fixture both rolls back changes but also allows subsequent queries to access data inserted during the transaction, I have a test in my
test_docbridge.py
file:Note that the calls to
insert_one
and find_one
both provide the rollback_session
fixture value as a session
argument. If you forget it, unexpected things will happen!Packaging a Python library has always been slightly daunting, and it's made more so by the fact that these days, the packaging ecosystem changes quite a bit. At the time of writing, a good back end for building Python packages is hatchling from the Hatch project.
In broad terms, for a simple Python package, the steps to publishing your package are these:
- Describe your package.
- Build your package.
- Push the package to PyPI.
Before you go through these steps, it's worth installing the following packages into your development environment:
- build - used for installing your build dependencies and packaging your project
- twine - used for securely pushing your packages to PyPI
You can install both of these with:
First, you need to describe your project. Once upon a time, this would have required a
setup.py
file. These days, pyproject.toml
is the way to go. I'm just going to link to the pyproject.toml
file in GitHub. You'll see that the file describes the project. It lists pymongo
as a dependency. It also states that "hatchling.build" is the build back end in a couple of lines toward the top of the file.It's not super interesting, but it does allow you to do the next step...
Once you've described your project, you can build a distribution from it by running the following command:
Once the wheel and gzipped tarballs have been created, they can be published to PyPI (assuming the library name is still unique!) by running Twine:
And that's it! I don't know about you, but I always go and check that it really worked.
This week's work has been super satisfying. I know this work done upfront to make tests run in transactions and to publish the library, while it's still relatively simple, will pay for itself in time.
In my next tutorial, I'm going to start looking at data embedded in documents. I'm going to extend the docbridge framework so that it can deal with embedded arrays, as work toward what I really want to get done next. I'm going to teach docbridge that arrays aren't always entirely contained within a document --- sometimes they're subsets, and sometimes they're extended references!
I'm really excited about some of the abstraction building blocks I have planned, so make sure to read my next tutorial, or if you prefer, join me on the livestream at 2 p.m. GMT on Wednesdays!
Top Comments in Forums
There are no comments on this article yet.