I have use case of storing large json files (200+GB) and chosen the mongoDB as a data management platform. I can be able to store using GridFS in to two collections (Files and Chucks) and I can be able to query all chucks for a given file as well, but the problem is actual data stored in binary with Binary.createFromBase64 format. But I want to apply aggregations and projection using mongoQuery. Will there be a way that I can convert “Binary.createFromBase64” to “string” when I’m projecting and apply aggregations?
GridFS is probably not the best way to store a 200GB JSON object. What’s the schema of these objects? It would likely be better to deconstruct the object into smaller pieces and store them in a regular collection so that they can be queried and updated. GridFS files are immutable and the contents are not queryable.
Getting to the appropriate schema will also depend on the read/write patterns of your app. Apologies if this is not as helpful as you expected. Perhaps someone else can offer some more specific advice?
Unfortunately, I don’t have control on constructing this single large (200+GB) json document as we are consumer of these files and find way to extract the data out of it. Probably having this computation moving to spark would make sense and store extracted objects in SQL format.
But I would like to hear from community to see if they propose any other ideas for this usecase.
As Steevej said, why not just either pre-process the file into small files and then process those or process the large file but with JSONPath or something. I’ve not tested any of the JSONPath implementations in regards to parsing large files but you should be able to do that.
You could also always pre-process the data into a series of collections and then have another script to combine them as needed into final-form documents.
So extract all the Provider groups and in_network and provider_reference data and then depending on how you want the document to look like, re-process them in Mongo, with indexes in place this could be pretty fast.
This would also allow you to discard any “fluff” that you don’t need that’s taking up space in that monster file.
Another thought was to convert the JSON to XML and then use an XPath processor if some of those are better at handling large file sizes than the JSONPath ones.
You could also then look at a streaming SAX Parser or something.
While reading John_Sewell reply which mentioned to extract in_network and provider_reference, it made me think that I might have skipped to quick the screenshot about the document.
And I did.
I just notice the presence of the field _id as an ObjectId. This is very mongodb specific. So I suspect that the 200GB json file you are receiving is a json file generated with mongoexport from another instance of mongod. This means you should be able to mongoimport it directly. You might need to play around with the option –jsonArray.
So try mongoimport and report any issue and success.
I have used mongofiles to import this large document and it’s got stored in binary format, which I have shared the screen shot in the original post, I took smaller document < 16MB for schema reference for the community. based on documentation jsonArray works for only smaller documents.
I have tried with smaller file (150MB) to test mongoimport and its failed with “document is too large” message. Will there be any other mongo tools that can split to smaller files while uploading to mongodb or do we need to use programing tools?