Paging with the Bucket Pattern - Part 2

In the previous part of Paging with the Bucket Pattern, we reviewed the powerful bucket pattern and ways of leveraging it for easy and efficient paging. This part will dive deeper into paging and target easy ways of creating buckets.

We left off discussing ways to quickly generate a list of stock trades using a single document with multiple trades in an array. A page displaying 20 trades per page can store 1000 trades in 50 documents, simply fetching the document that corresponds to page being displayed. Displaying the 21st trade involves fetching the 2nd document.

The diligent reader may notice skip appeared again. We already discussed the drawbacks of using skip. But recall that the slowness associated with skipping documents only manifests when the server must find a large number of documents to return the requested limit. Using the bucket pattern reduces the number of documents that exist for a single customer significantly. Skipping significantly fewer documents containing multiple trades will not experience the same performance issues as skipping a large number of single trade documents.

But what if the user wants to see 100 trades per page? Storing only 20 objects per bucket is fairly limiting. The solution is simple. Store ten buckets of 100 objects instead. Nothing says each bucket can't contain more objects than necessary! In fact, store all 1000 documents in a single bucket. That way, the exact same bucket can be used forty times for displays of 25 trades, twenty times for displays of 50 trades, ten times for displays of 100 trades, etc.

The next logical question is, "why not simply store everything in a single document?" There are several reasons. First, having a single massive document takes up a lot of memory. That's memory that must be allocated on the server to fetch the document and on the client when the document is received. Second, large documents are slower going across the network. Third, MongoDB imposes a (quite intentional) 16MB document size cap. Pick a logical document size and stick with it. Use buckets of 100, 500, or 1000. Larger buckets like 10,000 may make sense for some use cases with small objects, but anything larger may be excessive.

Great! We've found a reasonable bucket size of 1000 trades and can display a history from any page quickly and efficiently.

But how is data inserted into these buckets? It's a lot less work than you may think. Exploring an example will make this easy.

Let's say a new stock purchase occurred and we want to record a history of that trade. Here is the insert statement:

db.history.updateOne({ "_id": /^7000000_/, "count": { $lt: 1000 } },
    { 
    	"$push": { 
            "history": {
            	"type": "buy",
            	"ticker": "MDB",
            	"qty": 25,
            	"date": ISODate("2018-11-02T11:43:10")
    		} },
    	"$inc": { "count": 1 },
    	"$setOnInsert": { "_id": "7000000_1541184190" }
    },
    { upsert: true })

There's a lot to unpack in this statement, so let's work through it one piece at a time.

> db.history.updateOne({ "_id": /^7000000_/, "count": { $lt: 1000 } },

The history collection contains each of our buckets. We're using updateOne on the history collection. It's important to stress that we are using an update statement, not an insert statement, and we're only updating one document. This is because one or more buckets may already exist for customerId 7000000. We want to insert a new trade into a single existing bucket. We'll discuss what happens if the customer does not have an existing bucket in just a moment.

The count field insures that a single bucket never exceeds 1000 trades. The first bucket found that contains 999 objects or fewer will be updated. Buckets that contain 1000 or more trades will be ignored because they do not satisfy the query criteria.

"$push": { 
"history": {
        "type": "buy",
        "ticker": "MDB",
        "qty": 25,
        "date": ISODate("2018-11-02T11:43:10")
} }

The $push operator adds an element to the end of an array. Trade history is represented in the history field so appending a new trade to this field is the correct action to take.

"$inc": { "count": 1 }

The count field must be maintained atomically with an accurate trade count. By using the $inc operator, the bucket count field increases by one for every trade that gets inserted.

    "$setOnInsert": { "_id": "7000000_1541184190" } },
{ upsert: true })

Finally, the upsert modifier and $setOnInsert operator play in concert to cover "does not exist" scenarios.

The query criteria will prevent a document from updating under two conditions: no bucket exists for the customerId (the first part of _id); and no existing buckets contain less than 1000 trades. Using upsert creates a new document when the query criteria fails to update an existing document. In other words, the updateOne statement will create a new bucket if no bucket exists to be updated.

The $setOnInsert operator sets _id when a new bucket gets created. Recall that _id is a compound/concatenated string containing the customerId and time of the first trade in the bucket. The application constructs this field as part of the updateOne operation. First, customerId is already known since it's part of the search criteria. Second, the first trade time of a new bucket is always the trade being inserted. Therefore, constructing _id given two knowns is simple. Take the customerId and date from the current trade and concatenate them to construct the $setOnInsert value. The operator (and string) is ignored if no new bucket needs creating.

The final document looks exactly like the one presented above for display. Voila! Paging using a single document and single update statement.

There are limitations with this approach. For example, what happens if you need to remove a trade from one of the buckets? There will be a gap, which may cause the next updateOne statement to update the wrong document. MongoDB version 4.2 avoids this restriction by allowing expressive update instructions (see the recent Coming in MongoDB 4.2 article on the subject). This methodology also dictates sort order. New trades are always appended to the end of sort order (in this case time order). Sorting on a different field requires using aggregation with the $unwind operator. But even with these limitations, a large number of use cases can be solved with this efficient use of the bucket pattern.

We've explored two options in this pair of articles. Both options provide pros and cons, like all engineering challenges. We discussed modifying the query portion of a find statement to "skip" documents, and we discussed using the bucket pattern to pre-page results. Individual use cases will determine which method works best.

Remember, it's OK to limit the user interface for a better user experience. Pick a method based on the user's needs and design appropriately. Your options are many and with either approach, performance will be excellent and time to release quick-and-easy!