Removing map entry with minimum key

Jan_de_Wilde · April 19, 2023, 12:30pm

Hi mongodb community,

I would be very grateful for help with the following issue.
It is about a document containing yearly activities for a user:

{
  "_id": {
    "$oid": "6437db5e791549e390d8a0f0"
  },
  "idUser": "testUser1",
  "numYear": 2023,
  "activities": {
      "63ff26237d68b622e28852b3": {
        "_id": { "$oid": "63ff26237d68b622e28852b3" },
        "tstActivity": { "$date": "2023-03-01T11:17:06.000Z" },
        "txtNameActivity": "testActivity1"
      },
      "63ff2fff7d68b622e2886288": {
        "_id": { "$oid": "63ff2fff7d68b622e2886288" },
        "tstActivity": { "$date": "2023-03-01T11:59:11.000Z" },
        "txtNameActivity": "testActivity2"
      }
  }
}

We have some edge-cases, where such a document grows too big, causing:

WriteError{code=17419, message='Resulting document after update is larger than 16777216', details={}}

All mongodb access happens through java services (using only the native mongodb java driver).
So the basic idea to cope with the issue is:

On upsert of a new activity catch the exception for this specific error: (MongoWriteException.getError().getCode() == 17419)
If that happens, delete the activity with the minimum key (hex string representation of _id) and try again until success

And the question would be about how point (2) can be solved in an efficient way.
This is the approach I have so far:

Bson match = Aggregates.match(Updates.combine(
  Filters.eq("numYear", 2023),
  Filters.eq("idUser", testUser1)));

Bson addFields = Aggregates.addFields(new Field("activityArray",
  new Document("$objectToArray", "$activities")));

Bson project = Aggregates.project(Projections.fields(
  Projections.excludeId(),
  Projections.computed("minActivityKey",
    new Document("$min", "$activityArray.k"))));
	
MongoCollection<Activities> mongoCollection = mongoDatabase
  .getCollection("activities", Activities.class);

AggregateIterable<Activities> aggregate = mongoCollection.aggregate(
  Arrays.asList(match, addFields, project));

Bson unset = Updates.unset("activities." + 
  aggregate.first().getMinActivityKey());

mongoCollection.updateOne(match, unset);

The problem about this is, that the “Activities” class represents the document schema and consequently does not contain the aggregated field “minActivityKey”.

So my question would be if there is a way to access the result of aggregation pipelines without mapping it to the class the MongoCollection is associated with?

In case anyone know of better options or best practices to achieve this possibly not so uncommon issue, I would also be very glad for input.

Thank you in advance and br,
Jan

steevej · April 19, 2023, 1:16pm

May be, just may be may you could avoid hitting the limit by having a model with less redundancy.

String representation of _id takes more space and it is slower to compare than an $oid. And in your case you also store the _id as an $oid. It is very wasteful.

You should not have dynamic key names.

Your object activities should be an array that uses the attribute pattern. This way you could index, sort, slice, filter or map it. The fact that your implementation needs $objectToArray is an indication that you should have an array. What objectToArray does is to take an object and convert it to the attribute pattern.

So I would convert the schema from

 "activities": {
      "63ff26237d68b622e28852b3": {
        "_id": { "$oid": "63ff26237d68b622e28852b3"  },
        "tstActivity": { "$date": "2023-03-01T11:17:06.000Z"  },
        "txtNameActivity": "testActivity1"
      },
      "63ff2fff7d68b622e2886288": {
        "_id": { "$oid": "63ff2fff7d68b622e2886288"  },
        "tstActivity": { "$date": "2023-03-01T11:59:11.000Z"  },
        "txtNameActivity": "testActivity2"
      }
  }

to

 "activities": [
      {
        "_id": { "$oid": "63ff26237d68b622e28852b3"  },
        "tstActivity": { "$date": "2023-03-01T11:17:06.000Z"  },
        "txtNameActivity": "testActivity1"
      },
      {
        "_id": { "$oid": "63ff2fff7d68b622e2886288"  },
        "tstActivity": { "$date": "2023-03-01T11:59:11.000Z"  },
        "txtNameActivity": "testActivity2"
      }
  ]

This will give you more compact data and more efficient code. More compact data will give you better performance because you will hit your special case less often.

Jan_de_Wilde · April 19, 2023, 1:55pm

Thanks a lot for your input!

I’m a bit puzzled because a MongoDB consultant has told us to rather avoid arrays, because they are MongoDB internally stored as object structures anyways (with key 0, 1, 2 …) and it would often be a good idea to rather set a unique identifier as key of object structures instead, especially to make update operations simpler.

Also we are relying heavily on using the $set and $unset operations in upserts utilizing these manually set keys to identify entries to insert/update/remove efficiently and avoid data duplication - I am clueless yet about how that would work with arrays.

Obviously the data efficiency of storing the id twice is not optimal, but in reality these objects are quite a bit larger, so the ids do not really make that much of a difference. And also with the most optimal data efficiency, we could not avoid to reach the document size cap in some edge cases (some test system produce unrealistically large amounts of activities for always the same test user, this is not a productive problem).

However even if we should optimize the data model, I would still need a solution to work with the current data model and am hoping for helpful input about that.

steevej · April 20, 2023, 2:45am

A consultant that knows MongoDB or a MongoDB employee that you hired as a consultant. As a independent consultant myself my recommendation is to not blindly listen to consultants. If the recommendation of not using array is from a MongoDB employee I would like MongoDB staff in this forum to comment on the avoid array recommendation.

I just made 2 simple collections, one using the map way and the other using the array way with the 2 documents from my previous post and I got:

db.as_array.stats().size
315
db.as_object.stats().size
361

Yes 46 bytes for 2 ‘activity’ is not that much. But to reach the 16M limit you must have a lot of ‘activity’. End of my argument about size.

The update source of your post is not that simple. With array you could easily implement the bucket pattern using $size in the query part to limit the array size completely avoiding the need to handle a 16M error. With the bucket pattern you can

Are you sure about

Are those dynamic keys sorted/indexed? I do not think they are. With an array you could easily index activities._id and make query an order of magnitude faster.

I am curious about how you do queries like Which user did txtNameActivity:testActivity2 or What activity was done on a give tstActivity date?

The consultant should have shown you.

For example

key = "63ff26237d68b622e28852b3"
{ "$unset" : [ "activities." + key ] }

would be

key = "63ff26237d68b622e28852b3"
{ "$pull" : { "activities" : { "_id" : new ObjectId( key ) } } }

To avoid duplicates you would

activity = {
        "_id": { "$oid": "63ff26237d68b622e28852b3"  },
        "tstActivity": { "$date": "2023-03-01T11:17:06.000Z"  },
        "txtNameActivity": "testActivity1"
}
... { "$addToSet" : { "activities" : activity } }

My point is that we can do with dynamic keys can be done with arrays.

With dynamic keys I have to idea how this could be done without $objectToArray like you do.

You could use MongoCollection<Document> rather than MongoCollection<Activities> to aggregate in order to get the $min.

Jan_de_Wilde · April 20, 2023, 8:16am

Wow, thanks so much for your comprehensive advice!

It was a MongoDB employee who explicitly stated:

arrays are internally stored as object structures

array update operations are syntactically more difficult than when using the object notation

bc of the easied syntax it is often a good option to address the “array” elements by their
field names - if the “array” contains elements that can be addressed by something in the
element uniquely consider taking this as a fieldname

So that’s what we did, but now I’m starting to consider redesigning the whole solution with arrays instead of object structures.

I am curious about how you do queries like Which user did txtNameActivity:testActivity2 or What activity was done on a give tstActivity date ?

The efficiency of queries has not been the main criteria so far because of the bigger picture of the solution. There are in fact multiple collections with similar structure and the main idea was to upsert new activities in the first collection, then read the upserted activity out of a change stream document, and perform follow-up upserts in the next downstream collection and so on.

So the queries you suggested would then only be performed on the very last of these collections. But also there we currently have these objects structures instead of arrays, which I suppose will be hindering efficient querying to answer that kind of questions (that most probably will be asked at some point).

With array you could easily implement the bucket pattern using $size in the query part to limit the array size completely avoiding the need to handle a 16M error.

Using the bucket pattern sounds like an approach that possibly should have been followed too. It would just add quite a bit of complexity especially to the already complex recalculation logic, to rebuild or update all those collections for changed logic and/or migrated/deleted data.

Also I guess you could call the current implementation already sort of static (yearly) bucket pattern, and I already had the “fun” manually implementing a dynamic bucket pattern in cassandra - size dependent with dynamic start/end timestamps of the point of time when the cap has been reached - and I was praying that I don’t have to do that again. I guess this issue goes too far, but good to know that arrays could help with that too.

Are those dynamic keys sorted/indexed? I do not think they are. With an array you could easily index activities._id and make query an order of magnitude faster.

Especially the arguments about the better performance with an index on activities._id (the dynamic keys are not indexed) and not needing $objectToArray anymore to find the minimum _id also sound convincing to me.

The first questions that come to my mind about the migration from object structure to arrays would be:

$addToSet seems to be like “insert or ignore” whearas we would always need “insert or replace” like $set does in object structures. So is this also possible with arrays?
EDIT: This seems not to be a trival issue and may be a counter argument against using arrays:
Use map vs array of embedded documents? - Working with Data - MongoDB Developer Community Forums
In the ChangeStreamDocument we get only the upserted activities in the updatedFields. If there is then only one activities field containing an array, would we still be able to only retrieve the upserted activities this way?
EDIT: This may be trivial - it seems to report the upserted array elements just like added embedded fields, just with the array index instead of the hex-string as key (e.g. activities.0 instead of activities.63ff26237d68b622e28852b3)

I guess this exceeds the scope of the initial question (and probably of appropiate questions in this forum in general) and with some time, research and perhaps more consultance I’ll find the missing pieces to reevaluate the schema design.

steevej · April 20, 2023, 1:37pm

One more issue, that would applies to both map and array implementation.

Is your activities withing a single document often updated? By this I mean do you update by adding one activity at a time or you add most of the activities at the same time.

The reason I asked is because when an document is updated the whole document is written back to storage. So if a document is update once per activity, then a document with N activities will be written N times. In that case it would probably be more efficient (update wise) to have the activities in a separate collection. This way only the small activity documents are written rather than a big document over and over again.

Jan_de_Wilde · April 20, 2023, 4:58pm

Yes, it is part of a real-time aggregation stream and activities are always upserted one at a time each time a relevant event arrives.

This makes me think of my previous idea of having a flat data model with one document per activity. I was then confronted with the argument that such an approach would be a result of still thinking in the relational world and with mongodb there should rather be one document for what belongs together. But it would solve any potential problems to document size or embedded documents.

I am currently thinking in the direction, that we should perhaps go for such documents per idUser and indexed arrays(?) of activities in a last collection, that will contain a subset of relevant data and allow flexible querying. But in the upstream collections that are very frequently updated for one activity at a time indeed it seems more reasonable to me to have one activity per document atm.

Thanks for pointing that out!

steevej · April 21, 2023, 2:11am

Look at GridFS-Can we use aggregation query from GridFS specification - #4 by steevej.

I wrote

In some cases, plain old normalization, is still a valid solution.

And we could add: in others it is the best solution.

Jason_Tran · April 21, 2023, 6:06am

Hi Jan and Steeve,

In addition to what Steeve has already mentioned I went through this topic and in summary:

With the current document design there are edge cases where some documents are hitting the 16MB limit
You have / are considering a resdesign - This opens up a few options, some of which you have already mentioning including flattening the structure + creating a new collection for activities.
One idea mentioned was to change the activities field to an array. This would still have the possibility of hitting the same limit as the current design. In short for this point, it could ease the workflow when you’re hitting the limit, but will not reduce the chance of hitting it, as this is basically still the same design. Of course there is tradeoffs with this change as well, some being:
- Ambiguous querying (as opposed to the current design)
- I note “reduce the chance” as opposed to completely remove the chance as I am thinking of a case where 1 single element is removed whilst another is pushed in where the newly added element is of slightly larger size than the element removed. This is perhaps probably an “edge edge” case but wanted to clarify my statement.

This makes me think of my previous idea of having a flat data model with one document per activity

My conclusion is, the current schema design works fine for most of your use cases, but there are edge cases that requires you to work around the design’s limitations. Changing this structure into an array would not help you to avoid this situation, but might make it easier to handle. As per Steeve’s suggestion, although denormalization is usually how people design their schema in MongoDB, perhaps for this use case, you can consider normalization, as long as it doesn’t make your other existing workflow harder to do.

system · April 26, 2023, 6:06am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.