How should a list of followers schema be designed for scalability?

Ajay_Pillay · July 3, 2021, 9:56am

Hello, so I have the following document in a collection for each user.

{
   _id: "userID",
   ... bunch of other data ...,
   follower_list: ["userID1", "userID2", ...],
   following_list: ["userID1", "userID2", ...]
}

Every user can follow or be followed by another user. So within each user’s document, this array of userID’s is stored. I think this is a logical way of representing the data. However, I have a few concerns:

While I feel this is logical, is there a better way to do this? It has to be able to support efficient creation and find operations. This format is very straightforward as whenever a new follow occurs, I just have to push to these arrays, and unfollowing is just removing the element using $pull.
If my array is very large, would removal become less efficient? How does Mongo deal with arrays at a lower level? For example in C++, vectors are contiguous in memory and thus deleting an element at a specific index would mean that previous elements have to be shifted up, and this is an O(n) operation. Is that what happens with $pull?
What happens if my list grows large, such that the user’s document exceeds 16MB because of these 2 arrays? Is this why I would need to implement sharding; horizontal scaling?

Appreciate any help, thank you.

Gabriel_Betancourt · November 25, 2021, 4:53pm

I have same issue. Following this post to see if somebody came with an approach…

Jason_Tulloch · February 4, 2022, 9:36pm

I can try to take a crack at this based on other discussions I have had and what I’ve read. A solution that has worked well for myself is capping the size of the following and follower lists (say 100 user IDs per list). In the event that the arrays exceed 100 user IDs, the remaining IDs are saved and stored in a separate collection called OutlierFollowingList and OutlierFollowerList.

Simple Approach
Taking this a step further, I would consider adding fields named countFollowing, countFollowers, hasOutlierFollowing, and hasOutlierFollowers. Every time the user follows another user or has a new follower, this count will increment by one. When the count exceeds 100 in either list you can toggle the hasOutlier lists from false to true (and vice versa if they go below 101).

This approach should work well as the size of the arrays is small and would never exceed the 16MB document concern. In the off event that your outlier document is about to exceed 16MB you could add further logic to save the data to a new document in the collection.

Taking it a Step Further
What I would also think about is the frontend, specifically for each user you follow and each follower you have, what does the user need to see. For example, when you list your followers, you may want to view their name and their profile picture. By duplicating this data as an object in the array, you will not have to query each name and profile picture per follower in your list. The downside is that if the user’s name or profile picture changes you will need to make updates throughout the database, in multiple locations. But you can also be clever here, specifically, you can save the profile picture as a string that never changes (the string name itself) and if you really need it, can limit the user’s ability to change their name (as this never really happens - if you open some mobile applications you will see what I mean, a lot do not allow this).

The array would look something like this (note that the _id below the user’s ID that matches their user document _id…

followerList = [
   { _id: ObjectID("User's ObjectID here"), name: "string", profilePicture: '/aws.s3bucket.name.jpg' },
   ...
]

Date Field
If you follow this approach, I would also recommend saving the date field to each followerList object (or any data point to reference). By adding the date field, I keep the 100 most recent followers in the followerList while 101+ are in the outlier document. You don’t have to use date, but for my applications it typically has made sense to present the most recent followers first. You also only need to call sort on 100 followers (max), versus the entire count. The date will never need to change either which is one less thing to worry about. Lastly, I would recommend saving the date as a variable, before inserting it into any documents.

const date = new Date()

This way, if you need to save the date you followed a user and the date that user was followed by you, the date is exactly the same down to the millisecond (it will be slightly off otherwise).

Final Considerations

Think about the pages you are presenting the followers and following lists. Normally, how many users will your application need to display. For example, you could display the most recent 20 and have a button to view more followers/following which runs a separate query to get the data in the outlier document.
At what number of followers/followings is the count of users pretty rare? 100, 250, 500, 1000? You can use this information to decide when to move user data to the outlier collection as well.
Size of each user document in general. If the user document has a few fields such a profilePicture, name, birthday, followingList, and followerList you can store a lot more users in the main documents array vs. the outlier document. On the contrary, if you have 20 other arrays that may hold 100s of objects, limiting the number of objects in each array makes more sense to avoid getting close to 16MB. You can easily test this for your particular application with dummy date and see how large they can become as a worst case scenario. When I did this on my own application with ~20 arrays capped at 100 I don’t think I even got to 1MB (but don’t quote me on that).
Triggers - Something I have explored recently are database triggers to move data between my main documents and outlier documents. This approach would allow you to call a more efficient function for the user on the application (you can just exclude the outlier data logic) and then call a trigger if a condition is met (e.g. countFollowers > 100). Basically, what I mean is that you can save the 101th user in the followerList to make the application user’s experience better (i.e., faster) but then immediately call a trigger to move the oldest user to the outlier collection to get the document back down to 100 total users.

Hope this helps, happy to explain further if its needed as well!