BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right

Coding With Mark: Abstracting Joins & Subsets in Python

Mark Smith11 min read • Published Mar 19, 2024 • Updated Mar 19, 2024
MongoDBAggregation FrameworkPython
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
This tutorial will talk about MongoDB design patterns — specifically, the Subset Pattern — and show how you can build an abstraction in your Python data model that hides how data is actually modeled within your database.
This is the third tutorial in a series! Feel free to check out the first tutorial or second tutorial if you like, but it's not necessary if you want to just read on.

Coding with Mark?

This tutorial is loosely based on some episodes of a livestream I host, called "Coding with Mark." I'm streaming on Wednesdays at 2 p.m. GMT (that's 9 a.m. ET or 6 a.m. PT, if you're an early riser!). If that time doesn't work for you, you can always catch up by watching the recordings!
Currently, I'm building an experimental data access layer library that should provide a toolkit for abstracting complex document models from the business logic layer of the application that's using them.
You can check out the code in the project's GitHub repository!

Setting the scene

The purpose of docbridge, my Object-Document Mapper, is to abstract the data model used within MongoDB from the data model used by a Python program. With a codebase of any size, you need something like this because otherwise, every time you change your data model (in your database), you need to change the object model (in your code). By having an abstraction layer, you localize all of this mapping into a single area of your codebase, and that's then the only part that needs to change when you change your data model. This ability to change your data model really allows you to take advantage of the flexibility of MongoDB's document model.
In the first tutorial, I showed a very simple abstraction, the FallbackField, that would try various different field names in a document until it found one that existed, and then would return that value. This was a very simple implementation of the Schema Versioning pattern.
In this tutorial, I'm going to abstract something more complex: the Subset Pattern.

The Subset Pattern

MongoDB allows you to store arrays in your documents, natively. The values in those arrays can be primitive types, like numbers, strings, dates, or even subdocuments. But sometimes, those arrays can get too big, and the Subset Pattern describes a technique where the most important subset of the array (often just the first few items) is stored directly in the embedded array, and any overflow items are stored in other documents and looked up only when necessary.
This solves two design problems: First, we recommend that you don't store more than 200 items in an array, as the more items you have, the slower the database is at traversing the fields in each document. Second, the subset pattern also answers a question that I've seen many times when we've been teaching data modeling: "How do I stop my array from growing so big that the document becomes bigger than the 16MB limit?" While we're on the subject, do avoid your documents getting this big — it usually implies that you could improve your data model, for example, by separating out data into separate documents, or if you're storing lots of binary data, you could keep it outside your database, in an object store.

Implementing the SequenceField type

Before delving into how to abstract a lookup for the extra array items that aren't embedded in the source document, I'll first implement a wrapper type for a BSON array. This can be used to declare array fields on a Document class, instead of the Field type that I implemented in previous articles.
I'm going to define a SequenceField to map a document's array into my access layer's object model. The core functionality of a SequenceField is you can specify a type for the array's items, and then when you iterate through the sequence, it will return you objects of that type, instead of just yielding the type that's stored in the document.
A concrete example would be a social media API's UserProfile class, which would store a list of Follower objects. I've created some sample documents with a Python script using Faker. A sample document looks like this:
I can model this data using two classes — one for the top-level Profile data, and one for the summary data for that profile's followers (embedded in the array).
If I want to loop through all the followers of a profile instance, each item should be a Follower instance:
This behavior can be implemented in a similar way to the Field class, by implementing it as a descriptor, with a __get__ method that, in this case, yields a Follower constructed for each item in the underlying BSON array. The code looks a little like this:
That's a lot of code, but quite a lot of it is duplicated from Field - I'll fix that with some inheritance at some point. The most important part is near the end:
In the concrete example above, this would resolve to something like this fictional code:

Adding in the extra followers

The dataset I've created for working with this only stores the first 20 followers within a profile document. The rest are stored in a "followers" collection, and they're bucketed to store up to 20 followers per document, in a field called "followers." The "user_id" field says who the followers belong to. A single document in the "followers" collection looks like this:
A document containing a "followers" field that contains some more followers for the user with a "user_id" of "4"
The Bucket Pattern is a technique for putting lots of small subdocuments together in a bucket document, which can make it more efficient to retrieve documents that are usually retrieved together, and it can keep index sizes down. The downside is that it makes updating individual subdocuments slightly slower and more complex.

How to query documents in buckets

I have a collection where each document contains an array of followers — a "bucket" of followers. But what I want is a query that returns individual follower documents. Let's break down how this query will work:
  1. I want to look up all the documents for a particular user_id.
  2. For each item in followers — each item is a follower — I want to yield a single document for that follower.
  3. I want to restructure each document so that it only contains the follower information, not the bucket information.
This is what I love about aggregation pipelines — once I've come up with those steps, I can often convert each step into an aggregation pipeline stage.
Step 1: Look up all the documents for a particular user:
Note that this stage has hard-coded the value "4" for the "user_id" field. I'll explain later how dynamic values can be inserted into these queries. This outputs a single document, a bucket, containing many followers, in a field called "followers":
Step 2: Yield a document for each follower — the $unwind stage can do exactly this:
This instructs MongoDB to return one document for each item in the "followers" array. All of the document contents will be included, but the followers array will be replaced with the single follower subdocument each time. This outputs several documents, each containing a single follower in the "followers" field:
Step 3: Restructure the document, pulling the "follower" value up to the top-level of the document. There's a special stage for doing this — $replaceRoot:
Adding the stage above results in each document containing a single follower, at the top level:
Putting it all together, the query looks like this:
I've explained the query that I want to be run each time I iterate through the followers field in my data abstraction library. Now, I'll show you how to hide this query (or whatever query is required) away in the SequenceField implementation.

Abstracting out the Lookup

Now, I would like to change the behavior of the SequenceField so that it does the following:
  • Iterate through the embedded subdocuments and yield each one, wrapped by type (the callable that wraps each subdocument.)
  • If the user gets to the end of the embedded array, make a query to look up the rest of the followers and yield them one by one, also wrapped by type.
First, I'll change the __init__ method so that the user can provide two extra parameters:
  • The collection that contains the extra documents, superset_collection
  • The query to run against that collection to return individual documents, superset_query
The result looks like this:
The query will have to be provided as a callable, i.e., a function, lambda expression, or method. The reason for that is that generating the query will usually need access to some of the state of the document (in this case, the user_id, to construct the query to look up the correct follower documents.) The callable is stored in the Field instance, and then when the lookup is needed, it calls the callable, passing it the Document that contains the Field, so the callable can look up the user "_id" in the wrapped _doc dictionary.
Now that the user can provide enough information to look up the extra followers (the superset), I changed the __get__ method to perform the lookup when it runs out of embedded followers. To make this simpler to write, I took advantage of laziness. Twice! Here's how:
Laziness Part 1: When you execute a query by calling find or aggregate, the query is not executed immediately. Instead, the method immediately returns a cursor. Cursors are lazy — which means they don't do anything until you start to use them, by iterating over their contents. As soon as you start to iterate, or loop, over the cursor, it then queries the database and starts to yield results.
Laziness Part 2: Most of the functions in the core Python itertools module are also lazy, including the chain function. Chain is called with one or more iterables as arguments and then only starts to loop through the later arguments when the earlier iterables are exhausted (meaning the code has looped through all of the contents of the iterable.)
These can be combined to create a single iterable that will never request any extra followers from the database, unless the code specifically requests more items after looping through the embedded items:
The real code is a bit more flexible, because it supports both find and aggregate queries. It recognises the type because find queries are provided as dicts, and aggregate queries are lists.
I've added quite a few comments to the code above, so hopefully you can see the relationship between the simplified code above it and the real code here.

Using the SequenceField to declare relationships

Implementing Profile and Follower is now a matter of providing the query (wrapped in a lambda expression) and the collection that should be queried.
An application that used the above Profile definition could look up the Profile with "user_id" of "4" and then print out the user names of all their followers with some code like this:
See how the extra query is now part of the type's mapping definition and not the code dealing with the data? That's the kind of abstraction I wanted to provide when I started building this experimental library. I have more plans, so stick with me! But before I implement more data abstractions, I first need to implement updates — that's something I'll describe in my next tutorial.


This is now the third tutorial in my Python data abstraction series, and I'll admit that this was the code I envisioned when I first came up with the idea of the docbridge library. It's been super satisfying to get to this point, and because I've been developing the whole thing with test-driven development practices, there's already good code coverage.
If you're looking for more information on aggregation pipelines, you should have a look at Practical MongoDB Aggregations — or now, you can buy an expanded version of the book in paperback.
If you're interested in the abstraction topics and Python code architecture in general, you can buy the Architecture Patterns with Python book, or read it online at
I livestream most weeks, usually at 2 p.m. UTC on Wednesdays. If that sounds interesting, check out the MongoDB YouTube channel. I look forward to seeing you there!

Facebook Icontwitter iconlinkedin icon
Rate this tutorial

Leafsteroid Resources

Jun 22, 2023 | 1 min read

How to Model Your Documents for Vector Search

Apr 10, 2024 | 4 min read

Getting Started with MongoDB and FastAPI

Mar 11, 2024 | 7 min read

Real-Time Location Updates with MongoDB Stitch, Change Streams, and Mapbox

Feb 03, 2023 | 5 min read
Table of Contents