By André Spiegel, Consulting engineer for MongoDB
As a recently hired engineer at MongoDB, part of my ramping-up training is to create a number of small projects with our software to get a feel for how it works, how it performs, and how to get the most out of it. I decided to try it on Twitter. It’s the age-old question that plagues every Twitter user: who just unfollowed me? Surprising or not, Twitter won’t tell you that. You can see who’s currently following you, and you get notified when somebody new shows up. But when your follower count drops, it takes some investigation to figure out who you just lost.
I’m aware there’s a number of services that will answer that question for you. Well, I wanted to try this myself.
The Idea and Execution
The basic idea is simple: You have to make calls to Twitter’s REST API to retrieve, periodically, the follower lists of the accounts you want to monitor. Find changes in these lists to figure out who started or stopped following the user in question. There are two challenging parts:
- When you talk to Twitter, talk slowly, lest you hit the rate limit.
- This can get big. Accounts can have millions of followers. If the service is nicely done, millions of users might want to use it.
The second requirement makes this a nice fit for MongoDB.
The program, which I called “followt” and wrote in Java, can be found on github. For this article, let me just summarize the overall structure:
The scribe library proved to be a great way to handle Twitter’s OAuth authentication mechanism.
Using GET followers/ids, we can retrieve the numeric ids of 5,000 followers of a given account per minute. For large accounts, we need to retrieve the full list in batches, potentially thousands of batches in a row.
The numeric ids are fine for determining whether an account started or stopped following another. But if we want to display the actual user names, we need to translate those ids to screen names, using GET users/lookup. We can make 180 of these calls per 15 minute window, and up to 100 numeric ids can be translated in each call. In order to make good use of the 180 calls we’re allowed, we have to make sure not to waste them for individual user ids, but to batch as many requests into each of these as we can. The class net.followt.UserDB in the application implements this mechanism, using a BlockingQueue for user ids.
Storing Twitter Data in MongoDB: A Lesson in Schema Design
So how do we store the information in MongoDB? I came up with the following, simple schema:
This document means that the Twitter user with numeric id 12345 has been followed by user 54321 since August 21. The last scan of user 12345, when user 54321 still showed up in the followers list, happened at 6am on August 23rd. At 7:50am that day, user 54321 was no longer following 12345. A document with no “end” field means that the “follower” is still following the “followee”.
This simple schema won over more complex designs, for example the idea to store all of a user’s followers as an array inside a single document. This approach wouldn’t scale to millions of followers, since the size limit for a single document in MongoDB is 16MB. But more importantly, we usually neither need nor want all of a user’s followers in our application’s memory at the same time. This is because a) the memory usage might quickly get out of hand (think millions of users), and b) the algorithms we need to run on these follower lists can indeed work without having it all in memory (more on these algorithms below). Therefore, having it all in a single document would actually be disadvantageous to our application. Two important lessons here:
- Just because you can have complex documents in MongoDB, doesn’t mean you actually need them all the time. Sometimes something very simple and “relational” is just fine.
- Schema design is influenced, more than anything else, by your data access patterns. Make it so that your typical operations are efficient.
As we retrieve a user’s follower list from Twitter, we update the “last” timestamp in the corresponding documents. If we don’t find a document for that followee/follower pair, or all these documents have an “end” field, then that is a new follower and we create a new document for that pair. It can actually take plenty of time until we are through with all of a user’s followers, since we can only retrieve 5,000 of them per minute. After all the followers have been retrieved, the next step is to figure out which users have unfollowed: in other words, which users that we saw in a previous iteration are no longer in the follower list.
Set Intersection and Adaptive Merging
This is a set intersection problem. We have two potentially very large sets (the previous and the current follower lists), and we want to find out which members are only in one of the sets. (Actually, this is the inverse of computing the intersection and hence the same problem.) There are a number of ways we can do this.
The naive approach is to iterate over one of the sets and search each value in the other set. This is also called multisearch. We could do it with an individual database query for each value, but the performance is obviously unacceptable: Although each query is actually quite fast (100µs on my laptop), that adds up to more than 100 seconds for a million followers.
If we store at least the second set in the application’s main memory, multisearch gets about two orders of magnitude faster. But if we need to do this for many follower lists at the same time, the memory consumption may be prohibitive. And we can also do better algorithmically.
If both sets are sorted (something that MongoDB’s indexes give us essentially for free), then we can perform the equivalent of a merge operation: Iterate over both sets at the same time and advance the iterators according to whether the current set elements are the same or different. This is very fast and uses constant memory.
An even better solution is to exploit the “last” timestamp in our documents. Any follower whose “last” timestamp wasn’t updated in the first phase of the scan must have unfollowed. By simply comparing the time stamps to the start time of the scan we instantly get all unfollowers, and we can actually update their “end” time stamp as part of the same find-and-modify operation. This solution, which we could call “mark-and-sweep”, turns out to be the fastest – although it isn’t any better than the previous ones algorithmically, it wins by offloading all the work into the database.
There is another approach based on more recent research which would require a little more implementation work, and we therefore couldn’t try it as part of this small project. It is a variant of the merge algorithm called adaptive merging. It exploits the idea that the two sets are very similar. So rather than comparing the elements one by one, we take our chances and “leap forward” within the two sets and check whether we are still in sync (i.e. the elements from set A and set B at that index are still the same). If they are, we take an even larger leap forward (this is also called “galloping”). However, if the elements we find at the end of the leap are not the same, we do a binary search in the section we just leapt over to find the exact place where the difference occured.
If we assume that we cannot keep both sets in memory, then this algorithm requires some careful buffering of the results as they come in from the database, which was a little beyond the scope of this project. It would however be an interesting exercise by itself, not least because the open source nature of MongoDB would allow us to use and modify buffering internals deep down in the driver code to do exactly what we want.
Performance and Results
The table below shows execution times for the algorithms we did implement for follower lists of 100k, 500k, and 1m entries, and 100 missing followers between the two lists. All times are in seconds, and were measured locally on my laptop.
| algorithm/list size
We let the application run for a couple of days and tracked the development of the @MongoDB account and the accounts of a few other databases and vendors. The results, with the other vendors anonymized, are shown in the charts below. These charts echo the findings of the current database popularity ranking
, which shows MongoDB as the most popular NoSQL database.
But the Twitter numbers add a not-so-obvious twist to it: Although @MongoDB is still a small account as compared to some of the big players, its follower growth rate is unusually high for an account of this size, while the loss rate is just about average. (This can be seen most clearly in the normalized chart of follower gain/loss per day and 1000 followers.)
We’re excited to see those results.