Complex Schema design, (flexible-) sync compatibility

This will be a rather long and broad question, if it’s not the right place to ask, please point me to a better place to do so. I’m not sure if flexible sync or sync in general is the right option for our application:

Requirements (simplified):
We are building a social media application with complex requirements and data structure. The core functionality are posts, which are displayed to users dynamically. Users periodically post location data, based on which some posts from a Post table are displayed to them. In addition, some of the posts are displayed independently of location, but only to some of the users.
Each user will have an “infinite” feed of these posts (which is already not quite ideal for a database approach but rather caching I suppose?)
Users can also add each other, having friends lists and so on. There are a lot more functionalities of the app, but already with the things listed, I have my doubts.

Thoughts and doubts
After working with partition based sync for a few weeks and coming up with our partition strategy, there are a few issues. First of all, it all feels far from ideal, since we have a ton of data duplicates which are being generated and deleted as the user interacts with the app (sends posts, adds users) or moves (gets some of the posts dependent on location). Posts get duplicated when a user is supposed to see them and users duplicated when another user adds them. It all works, but is rather inefficient, slow and tedious, since all the data has to be kept up to date and has to be “distributed” to the users. Also it is not quite clear to me how the infinite feed is properly implemented on the client side (swift SDK). With thousands of users and even more posts, duplication and keeping data persistent will be very painful, or at least inefficient.
I have now started reading into flexible-sync, but since the examples are very simple I have no idea whether it would work well for our use-case.

Question(s)

  1. From the (probably too little) information provided, could flexible sync be a good solution to our use-case?
  2. Is flexible sync really production ready? And for a complex case such as ours? (I saw it is no longer in preview?)
  3. (Or would partition based still be a better option, maybe without all the duplication? If not, any other thoughts / options?)

If anything is unclear I can always elaborate or show our models or data structure this far.

Hi David,

To answer your questions:

  1. I believe flexible sync is a good fit for your design. Partition Sync is powerful if it fits into your model of what you want (one device one store type of syncing), but for more complex querying, permissioning, lower long-term storage cost, and better stability, I think that Flexible Sync will be better for you
  2. Yes, we believe it is production ready and have been excited to see people use it
  3. Managing duplicate objects is obviously not very ideal, so like I said above, I think it is worth testing out how flexible sync works for you.

One word of caution (since I assume your model is very link-heavy) is that in Flexible sync you query on each collection, so if you query on collection A and that has links to collection B, then you will not get the objects/documents in collection B unless you are querying on those too. There are ways to design your schema to avoid this problem, but your explanation made me think that you may run into this issue. We are working on this issue and trying to automatically pull in all linking objects, but we still would suggest designing your schema such that you can do it naturally.

Best,
Tyler

2 Likes

Thanks for the reply, that’s great news. I saw that in the Atlas UI it no longer says “Preview” for flexible sync, but I can’t see any new article with an update on the status of flexible sync. When did it go out of preview?

Sure that makes sense. You are right with the linking, for example we have a a field “creator” on each duplication of a post, linking to a user object… with partition sync we had to duplicate the user to have the same partition as the duplicate post, otherwise the user could obviously not be accessed. So with flexible sync we would have just one instance of the post and could just link to one user since it’s access is no longer managed by partition. But when retrieving the posts we would not get the user object because of flexible sync right?

Could you elaborate on how that can be done or where I can read more about this? And what do you mean by:

My first intuition would be to just open a new subscription for the user, so once we have a post, we open a subscription using the id of the user in the “creator” field. Which seems a quite inconvenient… But I guess that would also not be “naturally” and doesn’t involve adjusting the schema, so I believe it’s not what you meant.

Let’s say in my UI I want to display a list of post and in each the name of the creator, so a property of a linked object of that post…

Also, could you give a timeline when the “automatically pull in all linking objects” feature could be ready?

cc @Ian_Ward @Tyler_Kaye

Hi,

When did it go out of preview?

We announced the general availability of Flexible Sync at MongoDB World. See here: https://webassets.mongodb.com/MongoDB-World-2022-Datasheet.pdf?_ga=2.91648025.241869671.1654536218-105395967.1654279900

As for your second point, I am not totally sure what your partitioning schema was, but the general gist is that you can open a “subscription” on the Posts table and that will send all posts (and all embedded objects) that match your query, but it will not send the linking objects (it will send the links, they will just be implicitly null since the Client-side realm doesnt have the underlying “User” objects). Therefore, you would just want to also add a subscription on the “Users” collection.

I think the ideal way to do this is to have some data duplication (this is a MongoDB concept in general). You do not actually want the entire User object for all of these posts (in fact, its probably a security risk to do so), so you can instead model your schema like this:

{
     _id: ObjectId, 
     message: "hi", 
     creator: {
          user_name: "Tyler", 
          user_id: ObjectId("5dfa7b09d5ec134c607cc57e"),
    },
}

This way the document has all of the information you want to show but you still have the linking information if you want to “navigate” to the user or if you want to actually download the whole user document. I think in this case an interesting question is “do you really want to download the user object (which might be big) for all posts that a user sees?”

This is a pretty normal situation to find yourself in and MongoDB normally suggests denormalization of data for this: 6 Rules of Thumb for MongoDB Schema Design: Part 1 | MongoDB Blog

That way you keep all of the information relevant to the “post” in a single document while also retaining the ability to link to other documents (user) while still embeddeding some of the more relevant fields from the user document within the post document

1 Like

Thank you so much for the detailed response, that’s very helpful.

It makes sense to duplicate some of the data into the posts, though it’s not ideal having to keep that data up to date when it gets modified. But by using triggers I think that should be no problem (or at least it’s still better than with the duplication of partition based approach). For the subscription on the “User” I suppose one would have to use the links retrieved from the Posts as query? Or query the “User” table again with other parameters?

One last question: in the linked article about flexible sync from the datasheet you sent, there is this paragraph:
"The new Flexible Sync feature in Atlas Device Sync elegantly solves geo-partitioning issues by allowing us to only synchronize nearby spatialized content relevant to each user… " which sounds exactly like something we would need as well for some parts of our app.
We used the geoNear aggregation stage with the partition based solution, and I was planning on doing something similar (like using geoNear in a trigger based on movement to enter userIds into a post as an array “visibleTo”). But it sounds like there should be a much more elegant solution using flexible sync, do you know how that could be done?

I would also need to add, that these documents that should be synchronised because they are nearby are displayed with their distance from the user, so a simple query would not suffice…