Importing 917MB of JSON data

Jim_Labilles · March 22, 2022, 11:59am

Hello Admins,
I’d like to ask the same question because I have a hunch that this is the issue that I am facing. I am a student who is fairly new to MongoDB, I am attempting to import a tweeter json data which is currently 917mb.

The file looks like this:

There are 18M lines of tweet data which only contain the date of the tweet and the text itself.

Could anyone please help me get around this issue?

Thank you for your attention!
Jim

Jim_Labilles · March 22, 2022, 12:00pm

The error I am having is this:

steevej · March 22, 2022, 1:37pm

When you documents are inside an array you have to use the option --jsonArray option.

However in you case I am not sure it will work because you will exceed the 16Mb size.

I suggest that you use jq to change the format of your file to something like:

{
	"date" : 1 ,
	"tweet" : "hello"
}
{
	"date" : 2 ,
	"tweet" : "world"
}

That is remove the opening and closing square brackets and the comma separating each tweet. The document structure is simple and you could also use an text editor to reformat.

Jim_Labilles · March 22, 2022, 2:18pm

@steevej ,
Thank you for your reply. Appreciated it.
I’ve done as you’ve recommended to remove the ‘[ ]’ at the beginning and end of the file.
Screenshot 2022-03-23 004402

I now have a different error saying " Failed: error processing document #2: invalid character ‘,’ looking for beginning of value "

Any thoughts?

Jim

Jim_Labilles · March 22, 2022, 2:23pm

More info:
I’m following a thread about this error. Here is a verbose output of the error

Let me know if this helps.

J

steevej · March 22, 2022, 2:25pm

you missed the following part of my post

Jim_Labilles · March 22, 2022, 10:44pm

Thank you!
From the thread that I was reading, the guy mentioned the same. When I came back here, I felt so stupid coz the answer was already given to my stupid head and I neglected it.

@steevej
After removing the commas (using jq), I am now faced with another error:

“{Failed: error processing document #1: invalid character ‘\x1b’ looking for beginning of value}”

I tried to reformat my file in jq using the switch --ascii-output to force it to use UTF-8 chars only. But I still get this error. A quick google says its an escape char… but I really do not understand coz when I grep for this exact character (to hopefully remove it)… I don’t see an output.

Could you think of something I maybe doing wrong?

J

kevinadi · March 23, 2022, 1:55am

Hi @Jim_Labilles welcome to MongoDB and the community!

This error “{Failed: error processing document #1: invalid character ‘\x1b’ looking for beginning of value}” seems to correspond to the screenshot you have earlier, where the field Tweet contains an emoji:

Could you copy that actual document and post it here (not the screenshot) so we can reproduce what you’re seeing?

On another note, typically it’s best not to post screenshots in posts, since it’s not searchable, and on a certain screen size it may be very difficult to read. Thanks!

Best regards
Kevin

Jim_Labilles · March 23, 2022, 7:16am

Hello @kevinadi ,
Thank you for your reply.

Apologies for the screen shot. I’ll do as requested from here on.

Here is the exact snippet of the first 2 JSON tweets (line 1-8)

{
  "Date": "2022-03-21T21:58:35.000Z",
  "Tweet": "#HomeGrowAmnesty\ud83d\udc68\u200d\ud83c\udf3e \n#SAPARLI #SAVotes #LegalLikeTomatoes https://t.co/KD9W1iove8"
}
{
  "Date": "2022-03-21T21:58:35.000Z",
  "Tweet": "RT @VitalSparkCapt: @SBSNews Squatters residing in Australia's Kirribilli House, complain that they are living in a dump, and intend to 'Lo\u2026"                                                                                                                                                        
}

See when I use the --ascii-ouput switch in JQ, I thought that would have corrected these emoji’s… I guess it didn’t

kevinadi · March 23, 2022, 1:45pm

Hi @Jim_Labilles

I tried creating an example document based on your posted example with different kinds of emojis, and was able to import it correctly using mongoimport.

I assume you tried to import this using mongoimport. Could you try to update the MongoDB tools by downloading from MongoDB Database Tools download page and try again?

If this is still not working, please post:

Your MongoDB version and your OS
The output of mongoimport --version

Best regards
Kevin

Jim_Labilles · March 23, 2022, 8:42pm

Hi @kevinadi
Yes I did it using mongoimport in linux terminal.

mongoimport version: 100.5.2
git version: e2842eb549308c16e070fb2e197d4631cfd085cb
Go version: go1.16.7
   os: linux
   arch: amd64
   compiler: gc

Could it be because of the size?

J

Jim_Labilles · March 23, 2022, 11:48pm

@kevinadi
I’ve actually just tried it myself and I can confirm that it did indeed go through.

2022-03-23T19:39:36.317-0400    using write concern: &{majority false 0}
2022-03-23T19:39:36.317-0400    using 4 decoding workers
2022-03-23T19:39:36.317-0400    using 1 insert workers
2022-03-23T19:39:36.318-0400    will listen for SIGTERM, SIGINT, and SIGKILL
2022-03-23T19:39:37.044-0400    filesize: 520 bytes
2022-03-23T19:39:37.044-0400    using fields: 
2022-03-23T19:39:37.044-0400    connected to: mongodb+srv://[**REDACTED**]@cluster0.1zfo5.mongodb.net/election
2022-03-23T19:39:37.044-0400    ns: election.tweets_try
2022-03-23T19:39:37.078-0400    connected to node type: replset
2022-03-23T19:39:37.150-0400    2 document(s) imported successfully. 0 document(s) failed to import.

So I’m guessing it might be a size issue. Because the only difference now is that the main file has about (approx) 3.7M objects inside vs this which has only two…

the main subject of this post has been resolved by @steevej, which I am truly grateful for and to @kevinadi for putting on the right track in finally getting this solved.

kevinadi · March 24, 2022, 8:43am

Hi @Jim_Labilles

Size of the file should have nothing to do with invalid character errors. I created some example documents totalling 1.4 GB with some emojis sprinkled in and was able to import them all without errors.

If you’re using jq, I’m guessing that jq messed up the emoji codes somehow. That, or the program you used to modify this file mangled the emojis. Could you try to convert your file as per @steevej 's suggestion (remove commas, remove the [ and ]) without using jq?

Best regards
Kevin

Jim_Labilles · March 24, 2022, 9:11am

Hello @kevinadi,
to be honest, JQ is the only tool I know to use confidently at this stage…

But now that I’ve finally finished the whole process, I can definitely say that JQ is the culprit (even after doing exactly as @steevej has suggested). It is doing something in the background while converting the files. To be fair, maybe my methods are also wrong.

What worked for me is:

I filtered the large file into 5 JSON files that contains 1 key:value only per file
Then I uploaded each file to the same collection one by one.
I successfully uploaded 46900 documents in mongoDB ready for analysis

The JSON filtering with JQ was done in terminal of Kali (my ubuntu is in a login loop and needs to be fixed lol). I really think this process is making the mess. A few things I noticed:

1.If I filtered the key:value pair (5 total) that I need from the main JSON file and save it to one file, it saves to a very large file - 1.7GB. While the original file was only 120MB which contains at least 20 key:value pairs per object.
2. If I took just one key:value pair and save to a file (separately)… it only saved as a 1MB files. So with 5 files they are only 5MB total.

Any way… I hope this helps someone out, that is probably as ‘noob’ as me and are having the same issue.

Thanks again @kevinadi … the things I’ve learned along the way… is so gratifying… it made the whole frustration in the beginning so… worth it!

system · March 29, 2022, 9:11am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.