Store and Query 200+GB json document from MongoDB

Venkat_Obillaneni · October 19, 2023, 6:46pm

Hello.,

I have use case of storing large json files (200+GB) and chosen the mongoDB as a data management platform. I can be able to store using GridFS in to two collections (Files and Chucks) and I can be able to query all chucks for a given file as well, but the problem is actual data stored in binary with Binary.createFromBase64 format. But I want to apply aggregations and projection using mongoQuery. Will there be a way that I can convert “Binary.createFromBase64” to “string” when I’m projecting and apply aggregations?

Shane · October 19, 2023, 7:17pm

GridFS is probably not the best way to store a 200GB JSON object. What’s the schema of these objects? It would likely be better to deconstruct the object into smaller pieces and store them in a regular collection so that they can be queried and updated. GridFS files are immutable and the contents are not queryable.

Venkat_Obillaneni · October 19, 2023, 8:14pm

Make sense, here is the json schema, is there any tool in mongoDB that I can deconstruct these objects to smaller pieces or do I need to use any other programing languages for it like a python or so.

Shane · October 19, 2023, 9:44pm

I’m not aware of any tools to automatically suggest a schema for this kind of use case. My best suggestion would be to look into MongoDB Data Modeling resources, such as:

For example, with a 200GB document some of the embedded lists and/or objects will need to be split out into other collection(s) using references: https://www.mongodb.com/docs/manual/reference/database-references/

Getting to the appropriate schema will also depend on the read/write patterns of your app. Apologies if this is not as helpful as you expected. Perhaps someone else can offer some more specific advice?

Venkat_Obillaneni · October 19, 2023, 10:03pm

Thank You for your comments.

Unfortunately, I don’t have control on constructing this single large (200+GB) json document as we are consumer of these files and find way to extract the data out of it. Probably having this computation moving to spark would make sense and store extracted objects in SQL format.

But I would like to hear from community to see if they propose any other ideas for this usecase.

steevej · October 21, 2023, 1:55pm

I really do not understand why you could manipulate a single large JSON document into an SQL format while you cannot manipulate simply mongoimport that same JSON document.

If you can do the SQL thing then you have control on what you do after you received the single large JSON document.

Perhaps you would need something like

But what ever you do, I am pretty sure that making small JSON documents from a big JSON file is much simpler than making SQL out of your big JSON file.

John_Sewell · October 22, 2023, 12:38pm

As Steevej said, why not just either pre-process the file into small files and then process those or process the large file but with JSONPath or something. I’ve not tested any of the JSONPath implementations in regards to parsing large files but you should be able to do that.
You could also always pre-process the data into a series of collections and then have another script to combine them as needed into final-form documents.

So extract all the Provider groups and in_network and provider_reference data and then depending on how you want the document to look like, re-process them in Mongo, with indexes in place this could be pretty fast.
This would also allow you to discard any “fluff” that you don’t need that’s taking up space in that monster file.

Another thought was to convert the JSON to XML and then use an XPath processor if some of those are better at handling large file sizes than the JSONPath ones.
You could also then look at a streaming SAX Parser or something.

steevej · October 22, 2023, 2:09pm

While reading John_Sewell reply which mentioned to extract in_network and provider_reference, it made me think that I might have skipped to quick the screenshot about the document.

And I did.

I just notice the presence of the field _id as an ObjectId. This is very mongodb specific. So I suspect that the 200GB json file you are receiving is a json file generated with mongoexport from another instance of mongod. This means you should be able to mongoimport it directly. You might need to play around with the option –jsonArray.

So try mongoimport and report any issue and success.

Venkat_Obillaneni · October 23, 2023, 3:42am

Thank You for your input I will check out jq.

Venkat_Obillaneni · October 23, 2023, 3:44am

Thank You John, I prefer to have as multiple smaller json documents and have indexes on those to query faster for required data points extraction.

Venkat_Obillaneni · October 23, 2023, 3:47am

I have used mongofiles to import this large document and it’s got stored in binary format, which I have shared the screen shot in the original post, I took smaller document < 16MB for schema reference for the community. based on documentation jsonArray works for only smaller documents.

steevej · October 23, 2023, 12:45pm

and it got

Because mongofiles is quite different from mongoimport.

So

Venkat_Obillaneni · October 23, 2023, 3:25pm

I have tried with smaller file (150MB) to test mongoimport and its failed with “document is too large” message. Will there be any other mongo tools that can split to smaller files while uploading to mongodb or do we need to use programing tools?

steevej · October 23, 2023, 3:29pm

Share the first few lines of the json document.

Venkat_Obillaneni · October 23, 2023, 3:59pm

here we go…

steevej · October 23, 2023, 4:40pm

Is the closing curly brace at line 14013 closes the opening curly brace of line?

Both jq and jsonpath should be able to extract the in_network massive array into separate files.

Venkat_Obillaneni · October 23, 2023, 4:44pm

Yes, it’s closes of the opening brace of document. I will explore jq into seperate files.

Venkat_Obillaneni · October 27, 2023, 1:35pm

Quick Update… I have used python to pre-process this large json document to smaller chucks and able to process to MongoDB.