E11000 duplicate key error collection - with BulkWrite correct?

Hello,

I’m working with a match-string to make sure, that I avoid duplicates. I assumed, it would be a good idea, to make that unique in the DB, so I created a unique index on my match string field.

Now, for data loads, I sometimes would like to do updates via BulkWrites (pymongo).

                event_set_upsert = pymongo.UpdateOne({
                            'events_id': event.get("_id")
                        },{
                            "$setOnInsert": { "created_at": datetime.now() },
                            '$set': event_set
                        }, upsert=True)
                event_golden_upsert_list.append(event_set_upsert)

After adding all the events that I’d like to update to the list, I’m trying for a bulk write:

    try:
        events_golden_collection.bulk_write(event_golden_upsert_list, ordered=False)
    except BulkWriteError as bwe:
        print("ERROR: Bulk Write Error")
        print(bwe)

Now, it seems as if none of my updates have been executed, if there is one bulk write error.
How can I make sure, that the updates are still executed for the matching IDs - just avoid inserting new documents with the same match-string?

Hi @Chris_Haus, can you please share the traceback you are getting, and also a representative event and event_set?

----- Loop: 47 Event: 63dfd2e5b162cd1e47e16338
UPSERT: 63dfd2e5b162cd1e47e16338 - 18059_2023-02-18_00-00

----- Loop: 48 Event: 63dfd2e5b162cd1e47e16339
UPSERT: 63dfd2e5b162cd1e47e16339 - 18057_2023-02-23_00-00

----- Loop: 49 Event: 63dfd2e5b162cd1e47e1633a
UPSERT: 63dfd2e5b162cd1e47e1633a - 18059_2023-02-24_00-00

----- Loop: 50 Event: 63dfd2e5b162cd1e47e1633b
UPSERT: 63dfd2e5b162cd1e47e1633b - 18057_2023-02-24_00-00


ERROR: Bulk Write Error

E11000 duplicate key error collection: events_golden index: match_zip_street_date_time_index dup key: { match_zip_street_date_time: "18057_2023-02-24_00-00" }

E11000 duplicate key error collection: events_golden index: match_zip_street_date_time_index dup key: { match_zip_street_date_time: "18057_2023-02-24_00-00" }

E11000 duplicate key error collection: events_golden index: match_zip_street_date_time_index dup key: { match_zip_street_date_time: "18057_2023-02-24_00-00" }

E11000 duplicate key error collection: events_golden index: match_zip_street_date_time_index dup key: { match_zip_street_date_time: "18057_2023-02-24_00-00" }

Just some excerpts from the data structure and print.
I’m looping through the events and check, whether to put them into the golden records collection.
So for every event I check:

  • does the ID exist in the target collection? → aim for an update
  • does the match string exist in the target collection? → skip this record
  • if both are false → aim for an insert

When I print the errors, I iterate through all of them. There I expected an individual error per upsert element in the list.
But it seems as if one upsert failed, and the same error is being printed for all upsert elements in the list.

[UPDATE]

When I restructure the code to use individual inserts and updates, it’s working fine.
Just the bulk seems to be not updating, when there is a duplicate index / value for any of the records to upsert.

I think the difference when you’re doing the individual inserts and updates is that the datetime string ends up being different for each entry, avoiding the duplicates, because there is a delay for each write.

By default bulk_write operations are ordered and stop after the first error. To continue after an error you can use an unordered bulk_write by passing the ordered=False argument: Bulk Write Operations — PyMongo 4.3.3 documentation

1 Like

OP’s code is already using ordered=False