Wow, thanks so much for your comprehensive advice!
It was a MongoDB employee who explicitly stated:
- arrays are internally stored as object structures
- array update operations are syntactically more difficult than when using the object notation
- bc of the easied syntax it is often a good option to address the “array” elements by their
field names - if the “array” contains elements that can be addressed by something in the
element uniquely consider taking this as a fieldname
So that’s what we did, but now I’m starting to consider redesigning the whole solution with arrays instead of object structures.
I am curious about how you do queries like Which user did txtNameActivity:testActivity2 or What activity was done on a give tstActivity date ?
The efficiency of queries has not been the main criteria so far because of the bigger picture of the solution. There are in fact multiple collections with similar structure and the main idea was to upsert new activities in the first collection, then read the upserted activity out of a change stream document, and perform follow-up upserts in the next downstream collection and so on.
So the queries you suggested would then only be performed on the very last of these collections. But also there we currently have these objects structures instead of arrays, which I suppose will be hindering efficient querying to answer that kind of questions (that most probably will be asked at some point).
With array you could easily implement the bucket pattern using $size in the query part to limit the array size completely avoiding the need to handle a 16M error.
Using the bucket pattern sounds like an approach that possibly should have been followed too. It would just add quite a bit of complexity especially to the already complex recalculation logic, to rebuild or update all those collections for changed logic and/or migrated/deleted data.
Also I guess you could call the current implementation already sort of static (yearly) bucket pattern, and I already had the “fun” manually implementing a dynamic bucket pattern in cassandra - size dependent with dynamic start/end timestamps of the point of time when the cap has been reached - and I was praying that I don’t have to do that again. I guess this issue goes too far, but good to know that arrays could help with that too.
Are those dynamic keys sorted/indexed? I do not think they are. With an array you could easily index activities._id and make query an order of magnitude faster.
Especially the arguments about the better performance with an index on activities._id (the dynamic keys are not indexed) and not needing $objectToArray anymore to find the minimum _id also sound convincing to me.
The first questions that come to my mind about the migration from object structure to arrays would be:
$addToSetseems to be like “insert or ignore” whearas we would always need “insert or replace” like$setdoes in object structures. So is this also possible with arrays?
EDIT: This seems not to be a trival issue and may be a counter argument against using arrays:
Use map vs array of embedded documents? - Working with Data - MongoDB Developer Community Forums- In the
ChangeStreamDocumentwe get only the upserted activities in theupdatedFields. If there is then only oneactivitiesfield containing an array, would we still be able to only retrieve the upserted activities this way?
EDIT: This may be trivial - it seems to report the upserted array elements just like added embedded fields, just with the array index instead of the hex-string as key (e.g.activities.0instead ofactivities.63ff26237d68b622e28852b3)
I guess this exceeds the scope of the initial question (and probably of appropiate questions in this forum in general) and with some time, research and perhaps more consultance I’ll find the missing pieces to reevaluate the schema design.