Hello everyone,
I am currently developing a social media app, and I’m getting very stuck on an efficient way to structure individual user’s social media feeds, sorted by most recently posted. Here are the relevant structures (both of which reference a users collection that isn’t shown here).
followers
{
_id: <ObjectID>
user_id: <ObjectID>
following_id: <ObjectID>
timestamp: <datetime>
status: <string>
}
posts
{
_id: <ObjectID>
creator_id: <ObjectID>
content: <string (to S3 or something)>
...
}
My current solution (which I am aware is suboptimal), is to run an aggregation pipeline that consists of two stages: for each post $lookup
all the followers of the creator, and then $match
the relevant relationships for the viewing user. This seems very inefficient, as I am going to have to essentially do a $lookup
on every post.
I’ve read solutions about having a “fan-out on write”, where users have a “timeline” of sorts, and when users that they are following make a post, it gets pushed onto their timeline. The timeline would be capped as to not overflow the document size. This seems like a good possibility, but I’m very confused about the logistics:
- If a user scrolls through their entire timeline, am I supposed to run my inefficient pipeline to populate it with new posts? Should the size of the timeline expand to support users who scroll very far down on their feed?
- If a user decides to follow someone new, and they have recent posts, should I be taking their posts and carefully inserting them into the user’s timeline such that it remains in chronological order?
The data is structured in such a way that user’s with millions of followers would still maintain efficiency. It is really easy to query a user’s followers/following. But I’m still not sure, is there a completely different way I should be structuring this data?
This seems like a problem that I’m sure many other people have run into, but I’m struggling to find answers to some of these questions. Any advice would be very much appreciated