Efficient Structure for Social Media Feeds (fan-out on write)?

Mike_Scornavacca · January 14, 2022, 5:35am

Hello everyone,

I am currently developing a social media app, and I’m getting very stuck on an efficient way to structure individual user’s social media feeds, sorted by most recently posted. Here are the relevant structures (both of which reference a users collection that isn’t shown here).

followers

{
    _id: <ObjectID>
    user_id: <ObjectID>
    following_id: <ObjectID>
    timestamp: <datetime>
    status: <string>
}

posts

{
    _id: <ObjectID>
    creator_id: <ObjectID>
    content: <string (to S3 or something)>
    ...
}

My current solution (which I am aware is suboptimal), is to run an aggregation pipeline that consists of two stages: for each post $lookup all the followers of the creator, and then $match the relevant relationships for the viewing user. This seems very inefficient, as I am going to have to essentially do a $lookup on every post.

I’ve read solutions about having a “fan-out on write”, where users have a “timeline” of sorts, and when users that they are following make a post, it gets pushed onto their timeline. The timeline would be capped as to not overflow the document size. This seems like a good possibility, but I’m very confused about the logistics:

If a user scrolls through their entire timeline, am I supposed to run my inefficient pipeline to populate it with new posts? Should the size of the timeline expand to support users who scroll very far down on their feed?
If a user decides to follow someone new, and they have recent posts, should I be taking their posts and carefully inserting them into the user’s timeline such that it remains in chronological order?

The data is structured in such a way that user’s with millions of followers would still maintain efficiency. It is really easy to query a user’s followers/following. But I’m still not sure, is there a completely different way I should be structuring this data?

This seems like a problem that I’m sure many other people have run into, but I’m struggling to find answers to some of these questions. Any advice would be very much appreciated

Jason_Tulloch · February 28, 2022, 5:50pm

@Mike_Scornavacca First, thanks for sharing your proposed solution, despite appearing to be a great option to build a social media style feed with a non-relational database there does not seem to be much information regarding this approach. I would be curious to hear what someone at MongoDB thinks, but here are my thoughts.

Answering your Two Specific Questions

If a user scrolls through their entire timeline, am I supposed to run my inefficient pipeline to populate it with new posts? Should the size of the timeline expand to support users who scroll very far down on their feed?

Can you elaborate on your use case here? Depending on need, the easiest solution is to just have the feed end. I am not sure if they do this anymore, but know Facebook did exactly that for a number of years, at some point there is just a message at the bottom of the feed that said “No more content available.” If you are trying to populate the feed with new posts, do new posts exist? Why weren’t they in the feed in the first place? Long story short, I think this is dependent on your applications need. If each document in the Timeline collection is unique to a user, I would imagine that each post inserted here would only take up a few KB of space, you can easily store hundreds/thousands of posts for a user’s timeline before getting close to the 16MB cap (not suggesting you need to fill each timeline to 16MB).

If a user decides to follow someone new, and they have recent posts, should I be taking their posts and carefully inserting them into the user’s timeline such that it remains in chronological order?

I think there are a number of solutions here, dependent on your goals. The option I would prefer is to just sort the posts whenever you get the user’s timeline vs. inserting posts in the correct order anytime there is a new follower, multiple posts are made at the same time, etc.

Separate Collections (Joining with Lookup)
I am confident you are spot on here, using $lookup to join the Users and Posts collections will not work effectively as the number of documents in each collection grows. There are countless ‘problems’ shared across the Internet. Although it looks and is easy to implement, it is definitely not the right solution as $lookup would be used frequently for an application with a social media feed. I can imagine scenarios where running an aggregation pipeline can take several seconds (or even minutes) and the user just watches a loader spin. Obviously not ideal for a social media application.

Fan-Out on Write
This approach should work great and after some research appears to be exactly what Twitter does. I really like this approach because it only requires a simple get request, you can set each Timeline _id to match the user’s id (indexed by default) and very effectively fetch the timeline document for the user. Loading the feed would be very quick.

To consider with fan-out on write approach:

What data is duplicated in the Posts and Timeline collections. Even if posts are recorded to just a few user feeds, any changes to the original Post document (in the Posts collection) would require updates to all the posts in each Timeline document. I would be thoughtful as to what fields are in the Timeline documents to avoid headaches here as well as what fields can change, if any.
A post could be added to a users timeline at a slightly different time compared to others. For almost or all social media applications this is okay since a post being recorded to someone else’s feed a few seconds before does not drive performance issues and normally goes unnoticed.
Using a trigger helps. If a user creates a post and it needs to be inserted to a number of timelines, I would push that work away from the client so they can continue to use the application.
Be mindful of deleting posts. Similar to my first bullet, consider whether or not a user can delete a post. If yes, you will need to remove the post from all timelines.

Out of curiosity, have you started using the fan-out on write approach? How are you handling inserting posts to the timelines of all the user’s followers?

Asya_Kamsky · March 1, 2022, 4:55pm

I’m sorry I missed this question the first time around. There is a reference implementation for social platform called Socialite we wrote back in 2014 - all the principles it demonstrates are still applicable. Take a look at its documentation here: GitHub - mongodb-labs/socialite: Social Data Reference Architecture - This Repository is NOT a supported MongoDB product and there are a few recordings talking about the various trade-offs and benchmarking though I’m not sure I was able to find all of them (original presentation was in three parts): Industries | MongoDB and I’m still looking for parts two (how to store user graph) and 3 (how to cache timeline efficiently).

Asya

Paul_Ruppert · January 2, 2023, 8:44am

Hi all, I’d love it if you could help me too. I’m considering developing a service to gain followers on different social networks. This idea is very much in demand among social network users, and it will be my first project, so I will be glad to get some advice from experienced developers. I want to create a service like this: https://enforcesocial.com/buy-tiktok-folllowers , but better. What do you advise paying attention to when making such a service? What problems may arise? I will be glad to get your advice, and good luck to all of you in future projects.

Verten_Saltan · September 20, 2023, 3:03am

When users scroll through their entire timeline, you can implement a pagination system. Load a batch of posts as they scroll, and fetch more when they reach the end. This way, you avoid reloading the entire timeline, which can be resource-intensive.