Upload excel file to MongoDB collection

kevin_Morte_i_Piferrer · November 29, 2021, 6:20pm

Hi everybody!

I’m trying to figure out how to upload an excel file with data regarding, say, customers (ie., name, surname, country and phone) and convert those excel rows to mongoDB documents in the collection Customers.

I used xlsx library to transform the excel to json in the client-side and send a json with fetch api and once in the server-side from json to MongoDB. The problem is that when uploading big files (I think problems started at +50k lines) the app just crashes and/or some objets didn’t got saved. And it actually stopped everythink. User can do anythink while it was uploading, which eventually it didn’t either.

Any ideas on how I should approach the issue?

I think the solution must be in node.js streams in some way but idk… Maybe GridFS…
I work with node.js in backend (js in frontend) and Express, although a solution without express would be better because at some point I’d like to leave it out.

Thanks!

Pavel_Duchovny · November 29, 2021, 7:04pm

Hi @kevin_Morte_i_Piferrer ,

I think the best way is to export the excel file into csv format with a header line and use mongoimport tool to load it.

Thanks
Pavel

kevin_Morte_i_Piferrer · November 29, 2021, 11:28pm

And in my code I should use child_process module of node.js in order to execute line commands from the code?

wan · November 30, 2021, 5:47am

Hi @kevin_Morte_i_Piferrer , and welcome to the Community Forums!

Let me start by saying that there are many ways to perform this task; the variations would depend on the use case requirements.

As mentioned above, you could use mongoimport to upload JSON, CSV, or TSV to MongoDB. You could do this from the client-side. If your use case requires the server to perform the upload, then you’d probably better off performing this operation on your Node.JS backend.

The issue that you are encountering seems to be related to uploading large files from the client app to the server app. Perhaps you need to look into REST multipart/form-data upload instead.

GridFS is useful to store files as blob, not as documents. As your use case is to transform the content of files into MongoDB documents, GridFS is not going to be useful for you in this case.

Regards,
Wan.

kevin_Morte_i_Piferrer · December 1, 2021, 3:48am

Hi Wan, thanks for your comments.

I’m going to try to unfold the use case.

I’m building and app to manage company operations (purchases, sales, bill of material, inventories, accountancy, etc.). Also known as ERP (enterprise resource planner).

Although the app manages the forms to introduce this data, it also provides the option of uploading data from an excel or open xml file (a spreadsheet).

My intention is not only upload data, but also store the excel file in a s3 of aws (this is already done) so the user can keep track of the uploads it has performed and I’m able to write this excel, add a timestamp to the name and add a column with the id generated and a comment if the item could not be written in db for any reason. So the user only needs to download this excel and have this info

That’s way I’m trying to perform this operation from the backend. Also, maybe I’m mistaken, in this way I don’t use too much of the clients computer’s power and also make sure that low user memories or old computers don’t affect the upload operation. But I may be wrong about this believe.

Now, I’m sending the file from the front end to the server like this:

this.sendBtn.addEventListener('click', async () => {

            const formData = new FormData();

            // APPEDN EACH FILE TO THE FORMDATA

            this.files.forEach(f => formData.append(f.name, f));

            // PROVIDE JSON WITH PARAMS

            formData.append('module', this.module);

            const daoRes = await DaoUpdate.uploadElements(formData);

            if(!daoRes.success) {

                // REQUIRED MANAGE EVENTUALITIES

                return;

            };

The static method of the class Dao is quite simple:

    // UPLOAD
    static uploadElements = async obj => {

        const res = await fetch(`${URL}/api/cloud/upload`, {

            method: 'POST',

            body: obj

        });

        return this._sendObj(res);

    };

And it works. I tried with 140mb csv file and does not block the thread and the file get’s to the s3 bucket. And I think this is the proper way to tackle the issue. Because if I’m not mistaked, the formData object already manages the “descomposition-into-digestible-chunks-of-data” (don’t find the correct words) and that’s why there is not blocking thread. (Maybe I should test with bigger files, but 1M lines, what I’m uploading, I think is the max).

Now, the controller now uses the formidable module, because the Node.js project seems to recommand that. See: How to Handle Multipart Form Data (Official Node.js docs)

The whole function body looks like that:

try {

        const user = req.user._id;

        // UPLOAD FILES TO APP

        const uploadsDir = path.join(path.resolve() + '/src/base/uploads');

        const options = {

            keepExtenstions: true,

            maxFileSize: 200 * 1024 * 1024,

            uploadDir: uploadsDir,

            multiples: true

        };

        const data = formidable(options);

        // INFO:

        // data.parse rebuilds the file to the passed folder, and once is done
        // calls teh callback function.

        return data.parse(req, async (err, fields, files) => {

            if(err) return console.info('err! ', err);

            // UPLOAD OBJECT

            const module = fields.module;

            const uA = Object.keys(files).map(async fName => {

                const fileRoute = files[fName].filepath;

                const fileStream = fs.createReadStream(fileRoute);

                const uploaded = await s3.send(new PutObjectCommand({

                    Bucket: `ek-knote-${module}`,

                    Key: `${user}/${fName}`,

                    Body: fileStream

                }));

                // REMOVE FILE FROM APP

                if(uploaded.$metadata.httpStatusCode === 200) {

                    try {

                        await unlink(fileRoute);

                    } catch (err) {

                        console.info('Error by unliking uploaded fil --> ', err);

                    };

                };

                const obj = {

                    status: uploaded.$metadata.httpStatusCode,

                    fN: fName,

                    extension: fName.match(/[^.]+$/g).pop()

                };

                return obj;

            });

            return res.status(201).send(await Promise.all(uA));

        });

    } catch (err) {

        console.info('uploadElement (upload.js) ERROR --> ', err);

        return res.status(500).send();

    };

Where I have problemes is in parsing the file before unlinking it and start sending json objects to the MongoDB. Here is where I’m stuck. The problem with csv extenssion files is that the user should feed several archives, because csv is plain text and knows nothing about workbooks.
Also, It would be a better approach to not hold files into the server disk, because it scales poorly.

Thanks!

wan · December 2, 2021, 5:25am

Hi @kevin_Morte_i_Piferrer,

Seems to me that the issue is multiple files of CSV that should be grouped as one ?
If that’s the case, then perhaps the system needs to have a mechanism to keep track of multiple files as a single upload. i.e. metadata. This way, the files could be retrieved easily from S3 as a group.

Depends on your use case, architecture wise you could try to separate the web app server and the processing server. Each uploaded files will be queued to be processed (parsed).

Regards,
Wan.

kevin_Morte_i_Piferrer · December 2, 2021, 11:21pm

No, that wasn’t the issue. The issue is rather the fact that the user should not have to upload several files. Only one file.

I’m not sure I understand properly your idea. But if I do, wouldn’t that mean to split up the server in two and let the web app run in one and the excel process in the other?
If that’s the case, it wouldn’t change the fact that the excel file is, indeed, loaded in a server. Besides, to keep things separated, I already process the excel in another thread to allow user interacton while processing big files.

Regardless of everything, is it advisable to perform inserrtOne with a loop? Because it seems to me that in terms of Big O complexity have a linear one. Which makes a million db hits unmanageble.

Thank you for your insights!
Kevin

wan · December 2, 2021, 11:59pm

Hi @kevin_Morte_i_Piferrer

For your use case, it would be recommended to perform bulk operations instead. Performing a batch/bulk write operations would significantly reduces the number of network round trips from your application to the database server.

Regards,
Wan.

system · December 16, 2021, 1:43pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.