Documents with localization schema

I am trying to find a suitable structure for my MongoDB documents that contains localization support. To simplify my case, let’s assume that we have 10000 documents where each document in the collection consists of properties that have localization support for over 150 languages represented as language codes. In the real scenario, the document would contain multiple sub-documents also.

Example of a document where all properties are localized:

{
    "_id": "61a8cbb19a398cd2531b515e",
    "property1": {
        "eng-us": "translation text",
        "eng-uk": "translation text",
        "ekk": "translation text",
        "fra-fr": "translation text",
        "fra-ca": "translation text"
        ... +150 other languages 
    },
    "property2": {
        "eng-us": "translation text",
        "eng-uk": "translation text",
        "fra-fr": "translation text",
        "fra-ca": "translation text"
        ...+150 other languages
    },
    .. +20 other properties
}

Now let’s assume that everytime I fetch this document, I would only need one or two languages for each property. For instance, I only want “eng-us” for all the properties that has localization support. Which in turn would result after a projection to the following document.

db.find( { _id: "61a8cbb19a398cd2531b515e" }, { "property1.en-us": 1, "property2.en-us": 1, ...+20 other properties }

Would as result return the following document:

{
    "_id": "61a8cbb19a398cd2531b515e",
    "property1": {
        "eng-us": "translation text",
    },
    "property2": {
        "eng-us": "translation text"
    },
    .. +20 other properties with only "eng-us" property
}

Question:

  • If I would execute multiple projections in this way, how would such an operation influence the performance and memory usage of the MongoDb server compared with just returning the document without projections?
  • Is there any other methods that I should consider that could significantly improve the performance and memory of fetching documents with such structure?

Hi,
That’s a great question, quite thought provoking and it’s good to help someone think about exactly how MongoDB will handle this. There are two parts or maybe three to the question: Projection, Caching and ease of use. It’s also a good example of understanding your use - i.e. you want porperties for a given language more often then all languages for a given property.

Projection wise - Inside MongoDB, when not using aggregation data is manipulated as BSON, and BSON is just a list of Filedname:Type:Value tuples so the CPU cost of accessing and projecting a field is proportional to the number of fields at the same level you need to access. As it walks through the list it can skip any field it doesn’t need so the time to find your data will be proportional to the total number of properties, and inside each property the time to find a language will be proportional to the total number of languages so the entire document will need to be processed. In effect the number of string comparisons will be nTotalProperties + (nProjectedProperties * nTotalLanguages).

If you flip your schema round and have language.property instead then the time will be nTotalLanguages + (nProjectedLanguages * nTotalProperties) - I suspect that will be faster but only you know the maths.

There is a further optimisation you can do it you really want this schema which is to sort both properties and languages by most frequently accessed - once MongoDB finds a property in BSON it stops looking for it so once it has found all the things you are projecting it can skip out - by sorting you can make the most frequent requests faster / use less CPU.

Personally - given the data volumes involved I would group by language not property and sort the languages by predicted use (this is if you really feel this is worth optimising) then just project with { “en-us”:1} and pull the whole object back rather than enumerating properties on the server - it’s a but more network but you say just 20 or so properties total and this way your code is easier and more resilient to change.

If you care about cache - then break it further - put each language in a separate document - MongoDB will always load the whole document in cache, if many languages are seldom used then breaking this to a document per locale will allow you to cache only what you need.

2 Likes