Private and public documents

Green · May 5, 2021, 3:48pm

Hello everyone, Im designing an app where there are public and private documents. The public documents are generated by the owner of the app, while the private data is generated by the user. These documents reside in the same collection with the following structure

{
   _id,
   isPublic,
   userId,
   other fields ... such as name
}

I have some doubts about the queries against such collection. My requirements are to show to the user both the public and his private documents, without having to select public, private or the combination of the two. So, lets say that I have a query that matches three documents on the name, one for public, one private for the user that issued the query and another one of another user. At this point I have to return only the first two documents. My query will then look like: "give me all document matching the [name] where userId = [userId] OR isPublic=true.
I was wondering what would be the best way to set an index to support such query. In general, the where clause should always be present, in each and every query.

Thank you!

Pavel_Duchovny · May 6, 2021, 7:53am

Hi @Green,

Nice to hear from you again.

This exact problem is what Realm and Realm Rules are coming to solve where you can define your sync and read/write rules. However, to leverage this you need to use MongoDB Atlas and Ream applications which I very much encourage you.

https://www.mongodb.com/how-to/realm-partitioning-strategies/

However, if you wish to use MongoDB you should make sure that the queries add the correct filtering criteria.

What I think you can do is to use a $unionWith or 2 queries/aggregations to get the data. You will index the {userId 1 , isPublic : 1} of course and isPublic separately (or even hold public data in antoher collection).

Instead of making one query with an OR maybe run 2 or a union of:

The user private data filtering only on user and isPublic=false.
The public data.

If you wish the result set to come back in a natural or _id order you can add a sort stage in the end (adding the sorting field _id to the index last part)

Let me know what you think,

Thanks
Pavel

Green · May 6, 2021, 11:08am

Hi @Pavel_Duchovny, thank you again for your nice answers! I would like to go for atlas, but Im not familiar with cloud services and Im a little bit scared of the costs. I like the solution with $unionWith, but can this be used to union the same collection? Is there a way to union the result of two queries/aggregations on the db? Also, if I have to query a single document by id, I should still do the union right?
Another question, this index is the main index I use to select private and public data. In addition to that, there would be queries where other fields are specified (name, tags, category, etc…). In that case I suppose I need to create a single index on the field I want to query and then mongo will combine the result of the two index scans?

Thank you

Pavel_Duchovny · May 7, 2021, 5:19am

Hi @Green,

First Atlas provide a nice free 4 ever tier to experience this involves realm as well, you should try it.

You can union same collection data with different pipeline filters.

Well the union is for public and private data so even if one is filtered and public is not a union needs to be used to bring the additional data set.

Now if you just query by id the additional fields could be fetched from the pointed document by the index. But if you need to filter by more complex or other criteria you should build additional indexes.

We have a notion of covered queries in MongoDB where if index also have project fields it can avoid accessing the doc but if you need to fetch arbitrary or large amount of fields maintain a large index for that is unadvaised…

MongoDB can use index intersection in $or but it considered a non selective pattern and should be avoided if possible…

Thanks
Pavel

Green · May 7, 2021, 10:09am

Hi @Pavel_Duchovny thank you again! Well things are a little bit more complicated. First, I have languages for public documents that comes in. Public documents have a uniqueId and a language, which makes them uniquely identifiable (I had to add a uniqueId because I need to reference these documents somewhere else and if the user change the language, I should be able to still have a valid reference). In general, I provide documents with uniqueId and language (and let mongo generate the unique _id), while user provide documents with _id generated by mongo and language = their chosen language (not actually used). So in general my queries are:

Search:

Find all public documents with the given language or private documents matching a category, a tag, a name or gte and lte conditions on 4 other fields. The user can specify either zero or all matching criteria.
Given an idd, a userId, and a language give me the details of that document. At this point I dont know if the document is public or private, so I will say: give me the document with _id = given id and userId = given userId OR uniqueId = given id and language = given language.

It looks like I need a separate collection for public documents, but then how do I query these two collection with sort and pagination?

Thank you

Green

Pavel_Duchovny · May 9, 2021, 7:40am

Hi @Green,

So it sounds like the first section can be designed of 2 queries or one unionWith :

db.documents.aggregate( [{$match : { "language " : "en", "isPublic" : true } }, { $unionWith: { coll: "documents", pipeline: [ {$match : { "userId" : "xxx", "name" :  ... } } ]} },
{$sort  : {"_id" : 1}}, {$limit : ...}])

Now you can do $sort and $limit and add the _id of the last page to the next fetch match of each stage…

The second section sounds like possible union:

db.documents.aggregate( [{$match : { "_id " : "...", "userId" : ... } }, { $unionWith: { coll: "documents", pipeline: [ {$match : { "_id" : "xxx", "langue" :  ... } } ]} },
{$sort  : {"_id" : 1}}, {$limit : ...}])

Same foe second query.

The indexes here should be:

{ languege : 1, isPublic : 1, _id}, {userId : 1, _id : 1}
{ languege : 1, _id : 1}

Let me know if that make sense…

Thanks
Pavel

Green · May 10, 2021, 6:46pm

Hi @Pavel_Duchovny, really thank you this makes sense! Unfortunately, I have introduce a uniqueId instead of the _id, so I have to create the index like

{language : 1, uniqueId : 1}

Thank you! Just another question, I see there are collation that I can use, but this still implies that I need to use two different documents for the languages (unfortunately embedding is not a choice as the documents could get too big)? I’m asking because I hate the fact that I had to introduce this uniqueId, but I still need a way to uniquely identify the documents also by a general id and the language to select the correct document.

Thank you!

Pavel_Duchovny · May 11, 2021, 5:16am

How many different languages can be per document?

Not sure what you mean by two documents? Collation can be either on a collection or query/operation … If collection level than its default cross collection.

Theoretically you can have a collection per language with language prefix…

Can you show me an example?

Using your own id is also good, but _id is there by default for every doc…

Thanks,
Pavel

Green · May 11, 2021, 8:12am

Not so many languages, for the moment only two. The problem is that these documents refers to other documents, of course with the same language. In general I have public documents with uniqueId and language and then private documents with only the id. At the moment, all my documents are in the same collection and I can distinguish them using either the:

id or userId (for private find one queries → {"_id" : <id>} and private find all queries {userId : <userId>})
id or userId - language (for public find one queries → {"_id" : <id>} and public find all queries → {"userId:" "publicUserId", "language" : "en"})

I was thinking to embed the languages in the same document as subdocuments, and it works for two types of documents, maybe 3, but then I have other documents where this does not work given the complexity of the document itself. Its not just a single field like name description etc that changes, but these documents have a nested structure and inside that structure I reference other documents and it is exactly here where I need to have the changes for language. Example:

{
    _id,
    name,
    description,
    tags,
    language,
    days: [
        {
            day: 1,
            data: [
                {
                    nb: 0,
                    subData: [
                        {
                            nb: 0,
                            refDoc1: reference to doc with same language,
                            name: name of the refDoc1 (as extended reference)
                        },
                        {
                            nb: 1,
                            refDoc2: reference to doc with same language,
                            name: name of the refDoc2 (as extended reference)
                        }]
               }]
        }]
}

Using your own id is also good, but _id is there by default for every doc…

I would have prefer to use the _id, but then since I have everything in one document I have to use the uniqueId to be able to reference back the documents. For example, I have Document with uniqueId : 1. I have 2 Documents for that uniqueId, one for language “en” and the other for “de”. There are documents that reference one of these two documents, depending on the language, so I will keep the uniqueId and when I have to show the details of the references documents I will do → {uniqueId: 1, language: en}.

Not sure what you mean by two documents? Collation can be either on a collection or query/operation … If collection level than its default cross collection.

What I mean is, how the collation selects the correct documents? Is it creating a separate collection on its own? Or you have to specify the language on the documents and then the collation selects the correct documents by index?

Thank you

Green

Pavel_Duchovny · May 11, 2021, 9:12am

Hi @Green,

I think we need to seperate this discussion into two.

General schema advises
Queries and language collation considerations.

Queries and language collation considerations.

So a collation can be specified in a few places:

You can specify collation for a collection or a view, an index, or specific operations that support collation

Depand on your use case you need to see if you have a default collation for the whole collection or you do a per field one by using an index with the collation.

Than in your queries you have to specify the specific collation to match the index.

General schema advises

So I am not sure I understand the whole design, same collection reference documents within the same collection?

Why can’t the _id of a unique document be placed as a refDoc : <_id value> ?

If there is a small amount of languages why not to use a collection per language and than have unique locale indexes per this collection?

You can reference id’s from one language to another if needed. Moreover, the same _id can be used in two different collections as well

Let me know if that make sense…

Thanks,
Pavel

Green · May 11, 2021, 10:35am

Hi @Pavel_Duchovny you are helping me alot! Thank you.
So for the schema, lets take an example: I have a Recipe which references one or more Aliments. Both the Recipes and the Aliments have one version for each language. The Aliments are easy, they do not reference any document and are small, so in this case I could embed all the language in a single document (only name and description are going to change between one language and the other. However, I would like to have the same structure for all the multilanguage documents). Now, the Recipe has also its language and of course the Aliments it references should be of the same language too. This is an example of Recipe:

{
    _id,
    name,
    steps,
    tags,
    ingerdients: [
        {
            uniqueId,
            name
        }
    ]
}

Next, this Recipe is kept in another collection, lets call it DailyRecipes. The user can move one or more recipe into this collection. The user is currently using language “en”, so it will add to the DailyRecipes the Recipe [uniqueId: 1, language: en]. An entry in this DailyRecipes would be:

{
    _id,
    date,
    recipeUniqueId
}

Now, the user can travel from the DailyRecipes to the Recipe using the uniqueId and the current language (en) with the query {“uniqueId”: , “language”: “en”}. When the user changes the language, the navigation will use the query {“uniqueId”: , “language”: “de”}. What I dont like about that is this uniqueId, because it is only present for the public documents and not for the private and I would like to be able to treat the public and private documents in the same way in the API. What would be nice is to use the primary id and the language to select all the documents, instead of using the uniqueId and language for public and id for private.

Thank you!

Pavel_Duchovny · May 11, 2021, 11:52am

Hi @Green,

So this “UniqueId” is more a recipeId as it is not unique but can have few instances.

Why each document can’t have a recipeId ?

Additionally, why a document can’t have a “type” field defining if it is a “regular” recipe or a “DailyRecipe” ?

Why not to have the following schema:

{
    _id,
   recipeId, 
    name,
   isPublic : true/false,
   type : [ "regular", "daily"],
    steps,
  language,
    tags,
    ingerdients: [
        {
            uniqueId,
            name
        }
    ]
}

Now you search for user recipes and public from the same collection.

Or I am totally off?

Thanks,
Pavel

Green · May 11, 2021, 12:33pm

Hi @Pavel_Duchovny, actually no. The Recipes may be public or not, while the DailyRecipes collection is kind of a collection of today recipes. (Sorry the json I posted for the DailyRecipes was wrong):

{
    _id,
    date,
    recipes: [
        <recipeUniqueId>,
        <recipeUniqueId>,
        ...
    ]
}

When the user switches languages, the DailyRecipe document should point to the correct recipe with the current language, thats why I have this uniqueId (which is actually a recipe Id created by me). I know that this use case is really corner case for the regular users, but in the future there will be other kind of users that would use this feature to create the same document in multiple language. To me it would me much easier and clear to get rid of the additional uniqueId, but what I did not understand if it is possible with collation or some other mechanism (without having multiple collections) to have:

Document “en”:

{
    _id: 1,
   language: en,
   name: Fish Recipe
}

Document in “it”:

{
   _id: 1,
   language: it,
   name: Ricetta pesce
}

Having multiple collections could pose a problem when I have to aggregate the data from a public-language specific collection and user specific collection, or am I wrong? (Especially for search queries with sort and limit) (And not for now, but for the future if I have to shard the db I was wondering how these search queries would behave when the data is on different shards).

Thanks

Green

Pavel_Duchovny · May 11, 2021, 1:10pm

Hi @Green,

Why not to have the same recipe in one document having a language entry:

 {
    _id: 1,
   "en" : {
         name: Fish Recipe,
         Ingredients : [ ] ,
         ....
},
"it" : {
name: Ricetta pesce
 Ingredients : [ ] ,
}
...
}

You project the language you need based on user current preference. Using collation you can create an index per language field with its own collation

db.recipes.createIndex({_id : 1 , it : 1} , {collation ... } );

Thanks
Pavel

Green · May 11, 2021, 1:22pm

Hi @Pavel_Duchovny, that’s what I can do for Aliments and maybe Recipes, but not for other documents as they are probably getting too big … but I have to do some tries on that. Being able to put the stuff in the same document would be great, it will make the API and the code cleaner and partially simplify my indexes.

For the moment thank you so much! You helped me alot!

Green

Green · May 11, 2021, 1:34pm

Hi @Pavel_Duchovny just one last thing, when referencing another documnet, should the field holding the id be a string or a ObjectId?

Thank you!

Pavel_Duchovny · May 11, 2021, 2:06pm

@Green,

You should use object id of the reference to objectId and string to string for both code simplicity and potential $ lookup ability …