Document Field-by-Field Comparison -> Insert/Update

Hey everyone!

I’m not very experienced with MongoDB, and I would like a suggestion.

I have a collection, it will be updated daily by automatically uploading an XLSX file in which each row represents a document of the collection and they will have some sort of unique identifier, imagine that the document “person” has the field “username” and at any given time in the whole collection there can be only one document with that username and the field “active” = true.

Imagine that the first XLSX is uploaded, the collection will be empty so all the documents will be inserted with “active” = true.

[
    {
        "userName": "User Name 1",
        "email": "email1@email.com",
        "active": true
    },
    {
        "userName": "User Name 2",
        "email": "email2@email.com",
        "active": true
    },
    {
        "userName": "User Name 3",
        "email": "email3@email.com",
        "active": true
    }
]

Collection after first upload.

The next day a second XLSX is uploaded containing the following data:

[
    {
        "userName": "User Name 1",
        "email": "email1@email.com",
        "active": null
    },
    {
        "userName": "User Name 2",
        "email": "new_email2@email.com",
        "active": null
    },
    {
        "userName": "User Name 4",
        "email": "email4@email.com",
        "active": null
    }
]

At this point, the first document (User 1) of the new XLSX file is identical to the existing document in the collection, so nothing needs to be done and I ignore it… The second document (User 2) changed his “email” so I need to change the “active” field from the existing document in the collection to false and insert a new document that will completely replace the existing information relative to User 2 with the new data in the XLSX file and “active” = true… The third document is new to the collection so it will be inserted normally as a new document with “active” = true.

The Resulting collection will be as follow:

[
    {
        "userName": "User Name 1",
        "email": "email1@email.com",
        "active": true
    },
    {
        "userName": "User Name 2",
        "email": "email2@email.com",
        "active": false
    },
    {
        "userName": "User Name 3",
        "email": "email3@email.com",
        "active": true
    },
    {
        "userName": "User Name 2",
        "email": "new_email2@email.com",
        "active": true
    },
    {
        "userName": "User Name 4",
        "email": "email4@email.com",
        "active": true
    }
]

I cannot update the previous document if something changed, is a project requirement, so I keep a sort of “history” of every single variation.

The “real” user information for a given username will be defined by the “active” flag.

What will be the best way to do this using MongoDB? Eficiency wise.

Any ideas about how to face face this situation?
Preferably MongoDB side, not by fetching all the data and comparing all fields backend side.

Thanks in advance for the help <3

To me

is not

Because active:true is not the same as active:false.

My bad, I thought that it was clear enough. When the data is imported is always active = false / null since active is defined inside the collection, not in the XLSX file. It will be active once if inserted, if the other fields change.

Initial file data:

[
    {
        "userName": "User Name 1",
        "email": "email1@email.com",
        "active": null
    },
    {
        "userName": "User Name 2",
        "email": "email2@email.com",
        "active": null
    },
    {
        "userName": "User Name 3",
        "email": "email3@email.com",
        "active": null
    }
]

Imagine that the first XLSX is uploaded, the collection will be empty so all the documents will be inserted with “active” = true.

Once inserted in the collection:

[
    {
        "userName": "User Name 1",
        "email": "email1@email.com",
        "active": true
    },
    {
        "userName": "User Name 2",
        "email": "email2@email.com",
        "active": true
    },
    {
        "userName": "User Name 3",
        "email": "email3@email.com",
        "active": true
    }
]

I understand the confusion, I intended identical except for the “active” field, I’ll correct it now.

At this point, the first document (User 1) of the new XLSX file is identical to the existing document in the collection, so nothing needs to be done and I ignore it… The second document (User 2) changed his “email” so I need to change the “active” field from the existing document in the collection to false and insert a new document that will completely replace the existing information relative to User 2 with the new data in the XLSX file and “active” = true… The third document is new to the collection so it will be inserted normally as a new document with “active” = true.

Any advice about the best way of optimizing a field-by-field comparison and depending on the result insert and/or update?

Your problem is not easy because you are not leveraging the flexibility of MongoDB.

You problem would be almost trivial if you would keep 1 document per userName and store the email in an array.

When a new email comes in, you simply $push it into the array. The active email is always the $last.

After initial insert you have the following documents.

{
        "userName": "User Name 1",
        "email": [ "email1@email.com" ]
} 
{
        "userName": "User Name 2",
        "email": [ "email2@email.com" ] 
}
{
        "userName": "User Name 3",
        "email": "email3@email.com"
}

The on the next day. You can easily end up with.

{       /* this document is unchanged */
        "userName": "User Name 1",
        "email": [ "email1@email.com" ]
} 
{       /* here a new email is pushed, new_email2@email.com
           is active because it is last and other one is not active 
           because it is not last. */
        "userName": "User Name 2",
        "email": [ "email2@email.com" , "new_email2@email.com" ] 
}
{       /* it is not clear what happen to an existing userName when
           it is absent from an update. */
        "userName": "User Name 3",
        "email": [ "email3@email.com" ]
}
{       /* new document */
        "userName": "User Name 4",
        "email": [ "email4@email.com" ]
}

As I write, I experiment with Compass and may be the following is easier to implement.
First upload:

{
        "userName": "User Name 1",
        "current_email": "email1@email.com",
        "email_history": [ "email1@email.com" ]
} 
{
        "userName": "User Name 2",
        "current_email": "email2@email.com",
        "email_history": [ "email2@email.com" ] 
}
{
        "userName": "User Name 3",
        "current_email": "email3@email.com"
        "email_history": [ "email3@email.com" ]
}

after 1st update you end up with

{       /* this document is unchanged */
        "userName": "User Name 1",
        "current_email": "email1@email.com",
        "email_history": [ "email1@email.com" ]
} 
{       /* here a new email is pushed, new_email2@email.com
           and becomes current/active email */
        "userName": "User Name 2",
        "current_email": "new_email2@email.com",
        "email_history": [ "email2@email.com" , "new_email2@email.com" ] 
}
{       /* it is not clear what happen to an existing userName when
           it is absent from an update so I assume a no-op. */
        "userName": "User Name 3",
        "current_email": "email3@email.com",
        "email_history": [ "email3@email.com" ]
}
{       /* new document */
        "userName": "User Name 4",
        "current_email": "email4@email.com",
        "email_history": [ "email4@email.com" ]
}

Despite repeating current_email as $last of email_history, I am pretty sure that model is efficient space wise and performance wise as your original. In the original, the old document has to be written with active false at the same time the new document is written. Using the same document with history, only 1 document has to be written back.

I am still thinking about how to implement your original requirement because it is hard and hard problems are more interesting but I think a schema change would be better.

I can’t modify the same document; the requirement was having a single document for each modification. They wanted a full history of everything.

Also, the document is not that small, it’s not enormous but it has like 10 fields, I would end up with who knows how many arrays, but mostly, the imported documents could be out of order, let me explain that…

In this project, X times a month someone (or an automation script) will upload a document that will be read, each row will be a document and need to be inserted in the collection.

  • The fields may change and every time that something changes, I have to flag the existing document as “active” and the new document will take his place.

  • The uploaded file has a “validFrom” general to the document, representing from which point in time that document will be valid, so this modification even if is uploaded today, shouldn’t be considered until we pass “validFrom” in time.

{
“userName”: “User Name 1”,
“validFrom”: “2022/01/01”,
“email”: “email1@email. com”,
“active”: false
}
{
“userName”: “User Name 1”,
“validFrom”: “2022/01/01”,
“email”: “email12345@email. com”,
“active”: true
}
{
“userName”: “User Name 1”,
“validFrom”: “2022/03/01”,
“email”: “email2@email. com”,
“active”: true
}
{
“userName”: “User Name 1”,
“validFrom”: “2022/03/10”,
“email”: “email3@email. com”,
“active”: true
}

At a given time you could have 10 documents active for a single “userName” as long as the “validFrom” is different. And doing this inside an array in a single document would mean having for each field an array of objects each of them with different “validFrom” and “active” properties also loosing the information related to the “uploadProperties”, one of the fields is an object with information about the document upload, like date, file, user and so, so that they can recreate if something goes wrong which upload or filed was wrong, when was it uploaded, all the documents of that single upload, etc.

I know that’s not an easy task and like you, I like hard questions :wink: that’s why I keep asking.