Coding With Mark: Abstracting Joins & Subsets in Python

This tutorial will talk about MongoDB design patterns — specifically, the Subset Pattern — and show how you can build an abstraction in your Python data model that hides how data is actually modeled within your database.

This is the third tutorial in a series! Feel free to check out the first tutorial or second tutorial if you like, but it's not necessary if you want to just read on.

Coding with Mark?

This tutorial is loosely based on some episodes of a livestream I host, called "Coding with Mark." I'm streaming on Wednesdays at 2 p.m. GMT (that's 9 a.m. ET or 6 a.m. PT, if you're an early riser!). If that time doesn't work for you, you can always catch up by watching the recordings!

Currently, I'm building an experimental data access layer library that should provide a toolkit for abstracting complex document models from the business logic layer of the application that's using them.

You can check out the code in the project's GitHub repository!

Setting the scene

The purpose of docbridge, my Object-Document Mapper, is to abstract the data model used within MongoDB from the data model used by a Python program. With a codebase of any size, you need something like this because otherwise, every time you change your data model (in your database), you need to change the object model (in your code). By having an abstraction layer, you localize all of this mapping into a single area of your codebase, and that's then the only part that needs to change when you change your data model. This ability to change your data model really allows you to take advantage of the flexibility of MongoDB's document model.

In the first tutorial, I showed a very simple abstraction, the FallbackField, that would try various different field names in a document until it found one that existed, and then would return that value. This was a very simple implementation of the Schema Versioning pattern.

In this tutorial, I'm going to abstract something more complex: the Subset Pattern.

The Subset Pattern

MongoDB allows you to store arrays in your documents, natively. The values in those arrays can be primitive types, like numbers, strings, dates, or even subdocuments. But sometimes, those arrays can get too big, and the Subset Pattern describes a technique where the most important subset of the array (often just the first few items) is stored directly in the embedded array, and any overflow items are stored in other documents and looked up only when necessary.

This solves two design problems: First, we recommend that you don't store more than 200 items in an array, as the more items you have, the slower the database is at traversing the fields in each document. Second, the subset pattern also answers a question that I've seen many times when we've been teaching data modeling: "How do I stop my array from growing so big that the document becomes bigger than the 16MB limit?" While we're on the subject, do avoid your documents getting this big — it usually implies that you could improve your data model, for example, by separating out data into separate documents, or if you're storing lots of binary data, you could keep it outside your database, in an object store.

Implementing the SequenceField type

Before delving into how to abstract a lookup for the extra array items that aren't embedded in the source document, I'll first implement a wrapper type for a BSON array. This can be used to declare array fields on a Document class, instead of the Field type that I implemented in previous articles.

I'm going to define a SequenceField to map a document's array into my access layer's object model. The core functionality of a SequenceField is you can specify a type for the array's items, and then when you iterate through the sequence, it will return you objects of that type, instead of just yielding the type that's stored in the document.

A concrete example would be a social media API's UserProfile class, which would store a list of Follower objects. I've created some sample documents with a Python script using Faker. A sample document looks like this:

Code Snippet

I can model this data using two classes — one for the top-level Profile data, and one for the summary data for that profile's followers (embedded in the array).

Code Snippet

If I want to loop through all the followers of a profile instance, each item should be a Follower instance:

Code Snippet

This behavior can be implemented in a similar way to the Field class, by implementing it as a descriptor, with a __get__ method that, in this case, yields a Follower constructed for each item in the underlying BSON array. The code looks a little like this:

Code Snippet

class SequenceField:
    """
    Allows an underlying array to have its elements wrapped in
    Document instances.
    """

def __init__(
        self,
        type,
        field_name=None,
    ):
        self._type = type
        self.field_name = field_name

def __set_name__(self, owner, name):
        """
        Called when the enclosing Document subclass (owner) is defined.
        """
        self.name = name  # Store the attribute name.

# If a field-name mapping hasn't been provided,
        # the BSON field will have the same name as the attribute name.
        if self.field_name is None:
            self.field_name = name

def __get__(self, ob, cls):
        """
        Called when the SequenceField attribute is accessed on the enclosed
        Document subclass.
        """
        try:
            # Lookup the field in the BSON, and return an array where each item
            # is wrapped by the class defined as type in __init__:
            return [
                self._type(item, ob._db)
                for item in ob._doc[self.field_name]
                ]
        except KeyError as ke:
            raise ValueError(
                f"Attribute {self.name!r} is mapped to missing document property {self.field_name!r}."
            ) from ke

That's a lot of code, but quite a lot of it is duplicated from Field - I'll fix that with some inheritance at some point. The most important part is near the end:

Code Snippet

In the concrete example above, this would resolve to something like this fictional code:

Code Snippet

Adding in the extra followers

The dataset I've created for working with this only stores the first 20 followers within a profile document. The rest are stored in a "followers" collection, and they're bucketed to store up to 20 followers per document, in a field called "followers." The "user_id" field says who the followers belong to. A single document in the "followers" collection looks like this:

The Bucket Pattern is a technique for putting lots of small subdocuments together in a bucket document, which can make it more efficient to retrieve documents that are usually retrieved together, and it can keep index sizes down. The downside is that it makes updating individual subdocuments slightly slower and more complex.

How to query documents in buckets

I have a collection where each document contains an array of followers — a "bucket" of followers. But what I want is a query that returns individual follower documents. Let's break down how this query will work:

I want to look up all the documents for a particular user_id.
For each item in followers — each item is a follower — I want to yield a single document for that follower.
I want to restructure each document so that it only contains the follower information, not the bucket information.

This is what I love about aggregation pipelines — once I've come up with those steps, I can often convert each step into an aggregation pipeline stage.

Step 1: Look up all the documents for a particular user:

Code Snippet

Note that this stage has hard-coded the value "4" for the "user_id" field. I'll explain later how dynamic values can be inserted into these queries. This outputs a single document, a bucket, containing many followers, in a field called "followers":

Code Snippet

{
  "user_name": "@tanya15",
  "full_name": "Deborah White",
  "birth_date": {
    "$date": "1999-07-06T00:00:00.000Z"
  },
  "email": "deanjacob@yahoo.com",
  "bio": "Music conference able doctor degree debate. Participant usually above relate.",
  "user_id": "4",
  "follower_count": 59,
  "followers": [
    {
      "_id": {
        "$oid": "657072b66731c9e580e9dda6"
      },
      "user_id": "58",
      "user_name": "@rduncan",
      "bio": "Rich beautiful color life. Relationship instead win join enough board successful."
    },
    {
      "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid.",
      "_id": {
        "$oid": "657072b66731c9e580e9dd99"
      },
      "user_id": "45",
      "user_name": "@paynericky"
    },
    {
      "_id": {
        "$oid": "657072b76731c9e580e9ddba"
      },
      "user_id": "78",
      "user_name": "@tiffanyhicks",
      "bio": "Sign writer win. Look television official information laugh. Lay plan effect break expert message during firm."
    },
   . . .
  ],
  "_id": {
    "$oid": "657072b56731c9e580e9dd70"
  }
}

Step 2: Yield a document for each follower — the $unwind stage can do exactly this:

Code Snippet

This instructs MongoDB to return one document for each item in the "followers" array. All of the document contents will be included, but the followers array will be replaced with the single follower subdocument each time. This outputs several documents, each containing a single follower in the "followers" field:

Code Snippet

# First document:
{
  "bio": "Music conference able doctor degree debate. Participant usually above relate.",
  "follower_count": 59,
  "followers": {
    "_id": {
      "$oid": "657072b66731c9e580e9dda6"
    },
    "user_id": "58",
    "user_name": "@rduncan",
    "bio": "Rich beautiful color life. Relationship instead win join enough board successful."
  },
  "user_id": "4",
  "user_name": "@tanya15",
  "full_name": "Deborah White",
  "birth_date": {
    "$date": "1999-07-06T00:00:00.000Z"
  },
  "email": "deanjacob@yahoo.com",
  "_id": {
    "$oid": "657072b56731c9e580e9dd70"
  }
}

# Second document
{
  "_id": {
    "$oid": "657072b56731c9e580e9dd70"
  },
  "full_name": "Deborah White",
  "email": "deanjacob@yahoo.com",
  "bio": "Music conference able doctor degree debate. Participant usually above relate.",
  "follower_count": 59,
  "user_id": "4",
  "user_name": "@tanya15",
  "birth_date": {
    "$date": "1999-07-06T00:00:00.000Z"
  },
  "followers": {
    "_id": {
      "$oid": "657072b66731c9e580e9dd99"
    },
    "user_id": "45",
    "user_name": "@paynericky",
    "bio": "Picture day couple democratic morning. Environment manage opportunity option star food she. Occur imagine population single avoid."
  }

# . . . More documents follow

Step 3: Restructure the document, pulling the "follower" value up to the top-level of the document. There's a special stage for doing this — $replaceRoot:

Code Snippet

Adding the stage above results in each document containing a single follower, at the top level:

Code Snippet

Putting it all together, the query looks like this:

Code Snippet

I've explained the query that I want to be run each time I iterate through the followers field in my data abstraction library. Now, I'll show you how to hide this query (or whatever query is required) away in the SequenceField implementation.

Abstracting out the Lookup

Now, I would like to change the behavior of the SequenceField so that it does the following:

Iterate through the embedded subdocuments and yield each one, wrapped by type (the callable that wraps each subdocument.)
If the user gets to the end of the embedded array, make a query to look up the rest of the followers and yield them one by one, also wrapped by type.

First, I'll change the __init__ method so that the user can provide two extra parameters:

The collection that contains the extra documents, superset_collection
The query to run against that collection to return individual documents, superset_query

The result looks like this:

Code Snippet

The query will have to be provided as a callable, i.e., a function, lambda expression, or method. The reason for that is that generating the query will usually need access to some of the state of the document (in this case, the user_id, to construct the query to look up the correct follower documents.) The callable is stored in the Field instance, and then when the lookup is needed, it calls the callable, passing it the Document that contains the Field, so the callable can look up the user "_id" in the wrapped _doc dictionary.

Now that the user can provide enough information to look up the extra followers (the superset), I changed the __get__ method to perform the lookup when it runs out of embedded followers. To make this simpler to write, I took advantage of laziness. Twice! Here's how:

Laziness Part 1: When you execute a query by calling find or aggregate, the query is not executed immediately. Instead, the method immediately returns a cursor. Cursors are lazy — which means they don't do anything until you start to use them, by iterating over their contents. As soon as you start to iterate, or loop, over the cursor, it then queries the database and starts to yield results.

Laziness Part 2: Most of the functions in the core Python itertools module are also lazy, including the chain function. Chain is called with one or more iterables as arguments and then only starts to loop through the later arguments when the earlier iterables are exhausted (meaning the code has looped through all of the contents of the iterable.)

These can be combined to create a single iterable that will never request any extra followers from the database, unless the code specifically requests more items after looping through the embedded items:

Code Snippet

The real code is a bit more flexible, because it supports both find and aggregate queries. It recognises the type because find queries are provided as dicts, and aggregate queries are lists.

Code Snippet

def __get__(self, ob, cls):
    if self.superset_query is None:
        # Use an empty sequence if there are no extra items.
        # It's still iterable, like a cursor, but immediately exits.
        superset = []
    else:
        # Call the superset_query callable to obtain the generated query:
        query = self.superset_query(ob)

# If the query is a mapping, it's a find query, otherwise it's an
        # aggregation pipeline.
        if isinstance(query, Mapping):
            superset = ob._db.get_collection(self.superset_collection).find(query)
        elif isinstance(query, Iterable):
            superset = ob._db.get_collection(self.superset_collection).aggregate(
                query
            )
        else:
            raise Exception("Returned was not a mapping or iterable.")

try:
        # Return an iterable that first yields all the embedded items, and

return chain(
            [self._type(item, ob._db) for item in ob._doc[self.field_name]],
            (self._type(item, ob._db) for item in superset),
        )
    except KeyError as ke:
        raise ValueError(
            f"Attribute {self.name!r} is mapped to missing document property {self.field_name!r}."
        ) from ke

I've added quite a few comments to the code above, so hopefully you can see the relationship between the simplified code above it and the real code here.

Using the SequenceField to declare relationships

Implementing Profile and Follower is now a matter of providing the query (wrapped in a lambda expression) and the collection that should be queried.

Code Snippet

An application that used the above Profile definition could look up the Profile with "user_id" of "4" and then print out the user names of all their followers with some code like this:

Code Snippet

See how the extra query is now part of the type's mapping definition and not the code dealing with the data? That's the kind of abstraction I wanted to provide when I started building this experimental library. I have more plans, so stick with me! But before I implement more data abstractions, I first need to implement updates — that's something I'll describe in my next tutorial.

Conclusion

This is now the third tutorial in my Python data abstraction series, and I'll admit that this was the code I envisioned when I first came up with the idea of the docbridge library. It's been super satisfying to get to this point, and because I've been developing the whole thing with test-driven development practices, there's already good code coverage.

If you're looking for more information on aggregation pipelines, you should have a look at Practical MongoDB Aggregations — or now, you can buy an expanded version of the book in paperback.

If you're interested in the abstraction topics and Python code architecture in general, you can buy the Architecture Patterns with Python book, or read it online at CosmicPython.com

I livestream most weeks, usually at 2 p.m. UTC on Wednesdays. If that sounds interesting, check out the MongoDB YouTube channel. I look forward to seeing you there!

MongoDB

Coding With Mark: Abstracting Joins & Subsets in Python

Coding with Mark?

Setting the scene

The Subset Pattern

Implementing the SequenceField type

Adding in the extra followers

How to query documents in buckets

Abstracting out the Lookup

Using the SequenceField to declare relationships

Conclusion

Related

Getting Started with MongoDB and AWS Codewhisperer

Using Rust Web Development Frameworks with MongoDB

MongoDB Schema Design Best Practices

How to Use the Union All Aggregation Pipeline Stage in MongoDB 4.4

Table of Contents