HI!
I writing a service that look up for USA addresses in our db
two of the requirements for looking up addresses were:
- address received might be written with spelling mistakes
- the state of the address may be written with full name or as abbreviation (New York and NY for example)
we are first getting the address from an external source, and then saving them in our db for the next time they are being needed
the address we get doesn’t have to be the exact match, if we get another address in the same street for example, we are ok with that because what we really want to find the county of that address (a regional local governmental administrative division), and 2 addresses in the same street will have the same county.
this is an example to the address documents:
city:“California City”
country:“United States”
countryCode:“USA”
county:“Kern”
state:“California”
stateCode:“CA”
street:“California City Blvd”
zip:“93505-2234”
we also have a synonyms collection with all the states’ names and abbreviation (including one for United States that has USA and US in it)
the search query is being build dynamically, so if we don’t get “street” for example, it will not be added to the search query
this is the search query we came out with:
[
{
$search: {
compound: {
should: [
{
text: {
query: "9628 California City Blvd",
path: "street",
fuzzy: {},
},
},
{
text: {
query: "93505",
path: "zip",
fuzzy: {},
},
},
{
text: {
query: "USA",
path: "country",
fuzzy: {},
},
},
{
text: {
query: "USA",
path: "country",
synonyms: "address_synonym_mapping",
},
},
{
text: {
query: "CA",
path: "state",
fuzzy: {},
},
},
{
text: {
query: "CA",
path: "state",
synonyms: "address_synonym_mapping",
},
},
],
must: [
{
text: {
query: "California City",
path: "city",
fuzzy: {},
},
},
],
},
},
},
{
$project: {
state: 1,
county: 1,
city: 1,
country: 1,
street: 1,
zip: 1,
score: 1,
},
},
{
$addFields: {
score: { $meta: "searchScore" },
},
},
]
we put city as a “must” because we want to make sure that the found address will be in the same city, and since state can be received as an abbreviation or with a typo, searching for the state in the “must” section can lead to the address not being found at all, so we keep this search in the “should” section.
this search query finds the given address document with a search score of: 1.1768810749053955
we noticed that in the “fuzzy” searches, having a part of the address repeated will bump up the searchScore
example, having the next city search will give the search score of: 1.699939489364624
{
text: {
query: "California City City City City City",
path: "city",
fuzzy: {},
},
},
we want to receive the address that is the most accurate, but we want to put a threshold to filter out addresses that are not accurate enough.
because we might get addresses with partial information, we want the the threshold to be dynamic
what should the threshold searchScore be in each case? can we build it dynamically based on the input address we received? not only if the address is full or partial, but based on the address strings themselves?