Toying around, is this still a "collection" Coll3 = client[DB][Coll][Coll2][Coll3]

I’m noob to Mongo, so I was toying around with how to model some simple flat table data I want on Mongo: <report_name>_data = {"ID": ..., "DATE":..., "NAME": ..., etc.}

What tricky is structuring the collections and database connections. I made some tests and got something to work, but I don’t know what I did…

import pymongo
client = pymongo.MongoClient('', ...)

## started with the normal Mongo docs structure ...
APP_NAME = 'test'                    # DB NAME
CLIENT_NAME = 'hanna_barbera'        # COLLECTION NAME
#db = client[APP_NAME]          
#db_collection = db[CLIENT_NAME]  

## then I just kept going!
VENDOR_NAME = 'spacely_sprockets'    # ??
REPORT_NAME = 'contact_list'         # ??
#coll2 = db_collection[VENDOR_NAME]
#coll3 = coll1[REPORT_NAME]

## Actually, I don't need these other objects, so this works too...

Looking in the documentation and online for things like, nested, array, collection, I don’t find any similar examples. Everything I read says, db at the first level, then collections, then documents.

Are VENDOR_NAME and REPORT_NAME still “collections”? Or, does something else happen?

Yes those are collections. The name of a collection is provided via the full_name attribute:

>>> client['test']['hanna_barbera'].full_name
>>> client['test']['hanna_barbera']['spacely_sprockets']['contact_list'].full_name
>>> client['test']['hanna_barbera.spacely_sprockets.contact_list'].full_name

Note that different collections in the same database are completely unrelated no matter the naming convention.

It might make sense to put all the reports in a single collection, eg ‘test.reports’. You can add a “vendor” and “report” field to every document. Eg:

data = {"ID": ..., "DATE":..., "NAME": ..., etc.}
data["vendor"] = "spacely_sprockets"
data["report"] = "contact_list"

So is it half-dozen of one and 6 of the other? If it’s simple to insert and recall the collection with the name example I made. Or, insert & filter & query by new fields added to the input data.

I noticed the post was moved to Drivers & ODMs, because of pymongo–for sure. I take it this name example is not common in Mongo Syntax?

No they are not the same. It’s much more efficient and ergonomic to store all similar data in one collection. Splitting similar data across multiple collections is definitely the exception rather than the rule, so to speak.

For example, if you wanted to count all the reports by a single vendor, a very simple and reasonable request. With multiple collections this would be very painful (listCollections, string search/regex to match the vendor’s collection name(s), then count each matching collection). With a single collection it would be simply:

collection.count_documents({'vendor': '<vendor name>'}) 

Ah. Analytics. Not to bore you with the details of my project, but I really am trying to orient myself to Mongo.

  • I need a robust way to process report files. Report ratios: 1 client : ~50-75 vendors : 1-3 reports . ~50 clients. Clients may have 30-60% unique vendors. The low estimate is 800 different reports. Schema-less lets me continue to process documents, and focus on responding to schema changes in later analysis phase. Put it another way, staff and bots continue to upload files, only the reports are late.

  • Clients are independent entities, and serviced as independent entities. Cross-section analysis between clients could be interesting as internal business insights, but it’s not what we’re paid to do.

The peas and carrots don’t mix, but I take your point.

All things being equal, there is no comparison for ease of analysis the flat organization provides. It’s an important lesson for me to understand Mongo, that I don’t need to keep similar-schema reports together. If I add features into the document-level records (client, vendor, report) I can–what I’m trying to wrap my head around–have secure, stable, reliable data storage AND simple analysis.