Identify duplicate names in a collection that may have been entered with spelling mistakes or synonyms

Brian_Henderson · March 27, 2023, 12:51pm

I have a list of candidates (with first name, last name and date of births) who been tested on multiple occasions and I now want to find those duplicate occasions however names may well have spelling mistakes and synonyms may have been used. Ideally I’d like to do a group by using information an Atlas Search index. Is this possible?

Jason_Tran · March 30, 2023, 11:05pm

Hi @Brian_Henderson - Welcome to the community

I’m not entirely sure of the use case details (including how to verify which documents would be duplicates) but perhaps the following documentation may help:

Would this work flow for identifying “duplicates” be used on an every now and then basis? Or would it be considered part of your application’s standard workload? Curious to know if it’s just going to be used irregular to update / remove duplicates or just for identifying duplicates.

If you could also provide some examples / demonstrations using sample documents that would be helpful as well.

Regards,
Jason

Brian_Henderson · March 31, 2023, 7:57am

Sorry I should give a bit more information. We work with schools and assess the same cohorts of students over a number of years but do not normally have a unique student identifier so we need the synonyms lookup (e.g. Elizabeth, Beth, Liz) as well as fuzzy logic matching. Ideally we would want to be able to group ‘on the fly’ to take the top table and run a query to generate the second:

Thanks

Is this possible?

Thanks

Jason_Tran · April 3, 2023, 5:46am

Thanks for providing those details Brian. Firstly, I’m not sure of all other requirements here but based purely off the sample documents and information provided, it may be possible to obtain the second table assuming the first table is a collection where the atlas search index exists.

In my example below for the student with a first name of "Elizabeth", I have just added an "assessment" field which contains the score for a particular assessment. This is not what I am recommending the schema to be and is only for demonstration purposes for Atlas search (and in particular, the compound operator, the text operator with fuzzy and synonyms options used).

Sample documents used in the test:

testdb> db.school.find({},{_id:0})
[
  {
    firstName: 'Elizabeth',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a1: 70 }
  },
  {
    firstName: 'Liz',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a2: 80 }
  },
  {
    firstName: 'ElizabethX',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a3: 90 }
  }
]

I am using a default Atlas Search index (named the "scores" index) but have defined Synonym Mappings. For the synonym mappings, I’ve created the "nameSynonyms" mapping below:

testdb> db.schoolsynonyms.find({},{_id:0})
[ {
 mappingType: 'equivalent',
 synonyms: [ 'elizabeth', 'liz' ]
 } ]

Assigning "Elizabeth" to variable "queryFirstName" and details of the $search portion of the pipeline (assigned to var a):

var queryFirstName = 'Elizabeth'
var a = 
{
		$search: {
			index: 'scores',
			compound: {
				should:[
				{
					text: {
						query: queryFirstName,
						path: 'firstName',
						fuzzy : {}
					}
				},
				{
					text: {
						query: queryFirstName,
						path: 'firstName',
						synonyms:'nameSynonyms'
					}
				}
			]
		}
	}
}

Output (removed "_id" for brevity):

testdb> db.school.aggregate(a)
[
  {
    firstName: 'Elizabeth',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a1: 70 }
  },
  {
    firstName: 'Liz',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a2: 80 }
  },
  {
    firstName: 'ElizabethX',
    lastName: 'Apple',
    DOB: ISODate("2000-01-01T00:00:00.000Z"),
    assessment: { a3: 90 }
  }
]

I presume the question here was more based off identifying duplicate names using $search. How you uniquely group and identify a student is a different matter. Hopefully the above example provides a bit of help in determining if this works for your use case.

You may wish to consider cleaning duplicates prior to them being entered into MongoDB if possible. This may assist with inaccuracies that may appear with the above method (for example - consider if an entry had the first name “Lizx”).

Regards,
Jason