I have two collections with around 100K documents in each in a self hosted MongoDB server (4.4 on old ubuntu 16.04).
I need to correlate the two collections by trying to use a “title” field in each which is a string.
Unfortunately I do not have a guarantee that those two strings will be identical. There can be some extra punctuation (eg quotes) in one but not the other, there could be an accented char in one (eg “à”) but a simple “a” in the other and so on.
In python I usually employ the rapidfuzz (similar to fuzzywuzzy) library to compute a similarity score between two strings and use a threshold value (say 85 out of 100 max) to decide if the strings could actually be the same albeit with some variation.
Please advise on how you would tackle this. Thanks a lot.