$serach aggregation and index for filenames

I’ve been following a lot of the brilliant youtube tutorials for search aggregations, and have succeeded in getting my search form to work, but I am having some difficulty with filenames fuzzy search.

To be more clear, here is some of my search index and my node code

Search index ( I am searching through an array of emails, and fileobjects.filename, and a string message.


{
  "mappings": {
    "dynamic": false,
    "fields": {
      "emailTo": {
        "dynamic": false,
        "fields": {
          "email": {
            "type": "string"
          }
        },
        "type": "document"
      },
      "fileObjects": {
        "dynamic": false,
        "fields": {
          "fileName": {
            "type": "string"
          }
        },
        "type": "document"
      },
      "message": {
        "type": "string"
      }
    }
  }
}

The search tester works in the atlas console.

But when searching for filenames with extensions for example a filename would be “document.docx”

if I search for a document, nothing will be returned, if I search for “document.do” fuzzy search will find the doc. my guess is that this has something to do with fuzzy logic, but can I use a wildcard here, without forcing the user to insert a * in their search query?

here is my aggregation query using the node sdk, what can I change in the fuzzy properties that it will basically ignore everything after the . in the filename? but I do want to allow the user to search for docx as well. thanks for your help.

const agg = [

            {

                '$search': {

                    'index': 'quicksend',

                    'text': {

                        'query': data.query,

                        'path': ["emailTo.email", "message","fileObjects.fileName"],

                         'fuzzy' : {

                            "maxEdits" : 1,

                            "maxExpansions" : 50

                        }

                    },

                                     

                }

            },

                {

                      '$match' : {

                        'senderId' : data.senderId

                    }

                }
           ]

Hi @Rishi_uttam,

Yes we do have wildCard search where in which you can specify document.* should return all the matches of this query. Also, is it possible to merge your $match query inside the $search as Performance Considerations for Atlas Search talks about the $match and other stages after $search and their performance implications. Also, the aforementioned documentation has general guidelines on Atlas Search performance.

Thanks,
Darshan

1 Like

Thanks

About the performance, do you mean i should put the $match before $search? I was told that $search should be the first operator in the pipeline. How else can i filter the search for a given id? Im not sure how to use the compound operator in replacement of $match.

And about the fuzzy wildcard, where do i specify document.* ?

something like this for compound ? Does not work or provide any results.

‘compound’ : {

                    'must' : [{

                        'senderId' : data.senderId

                    }]

                }

Hi Rishi,

Apologise for the delay.

$search should be the first pipeline in aggregation in order to utilise the atlas search index. I mean to say is you can utilise the atlas search for filter out the $match that you are looking:

 [

            {

                '$search': {

                    'index': 'quicksend',

                    'compound': {
                    'must' : [ {
                    'text': {

                        'query': "text",

                        'path': ["emailTo.email", "message","fileObjects.fileName"],

                         'fuzzy' : {

                            "maxEdits" : 1,

                            "maxExpansions" : 50

                        }

                    }},

                    { "range" : {
   
                       "path" : 'senderId',
 
                       "gte": data.senderId,

                      "lte" :  data.senderId
                     }
                   }
                ]

               }                      

                }

            }
           ]

Further, in order to support above query, you should edit your search index add senderId in the search index(I believe it is number data type).

Let me know if the above query works after creating the search index.

Thanks,
Darshan

Hi Thanks

I have a question, why did you use a “range” with gte and lte for my senderId? there is not range in sender id its a unique id. And how can i add wildcard to this? Hope you can help me.

Currently, this is what i have, it seems to work, but still need wildcard, as it doesn’t seem to catch simple items unless i type the word out fully.

  1. Using the $search
  2. componding with must text on senderId
  3. Filter with text and fuzzy.

Vs you suggested me to use range.

{
                '$search': {
                    'index': 'quicksend',

                    "compound": {
                        "must": [{
                            "text": {
                                "query": data.senderId,
                                "path": "senderId"
                            },

                        }],

                        "filter": [{
                            "text": {
                                'query': data.query,
                                'path': ["emailTo.email", "message", "fileObjects.fileName"],
                                'fuzzy': {
                                    "maxEdits": 2,
                                    "maxExpansions": 50
                                }
                            }
                        }]



                    }


                }
            },

Wanted to update if the above was not clear.

I have a fileName field (which i have already created a search index for.

example:
fileName : ‘Pitch.pdf’

When searching for ‘pitch’ or ‘Pitch’ there is no result
only when searching for ‘pitch.p’ then it finds the fileName.

This is because of the maxEdits of 2, but how do I search within strings ? So searching for ‘pitch’ will find the ‘pitch.pdf’ result.

thanks.

Hi Rishi,

As per your query, you are using text for senderId, My understanding is it is number. In order to match the number in atlas search we have to use range with. gte,lte same value. Make sure you have below entry in your index definition:

 "senderId": {
        "representation": "int64",
        "type": "number"
      }

And how can i add wildcard to this?

In order to use wildcard for the search term, you can use wildcard operator. In case if you want to use the wildcard for the path, you can use something like below:

text: {
    "query": data.senderId,
    path: [{ wildcard: 'anotherField.*' }]

I have tested your requirement, Please note that Atlas search for string is case-insensitive. And in order to search those words containing fullstops, comma you can make use of wildcard:

MongoDB Enterprise mflix-shard-0:PRIMARY> db.movies.insert({ title:"pursuit.of.happiness",plot:"Pursuit.mov"})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise mflix-shard-0:PRIMARY> 

MongoDB Enterprise mflix-shard-0:PRIMARY> db.movies.aggregate([ {$search : { index:"staticIndex1", wildcard:{query:"pursuit*",path:"plot","allowAnalyzedField": true}}},{$project:{_id:0,plot:1}}])
{ "plot" : "Pursuit.mov" } 

I hope this answers your question.

Thanks,
Darshan

Thanks, work! My final aggregations as follows, in this case i dont need fuzzy edit if i am using wildcard?

Can i improve on the aggregation below? i am using compound with must and filter, but i see you did not use these in your examples.

    const agg = [
    {
        '$search': {
            'index': 'quicksend',
            "compound": {
                "must": [{
                    "text": {
                        "query": data.senderId,
                        "path": "senderId"
                    },
                }],
                "filter": [{
                    "wildcard": {
                        'query': data.query+'*',
                        'path': ["emailTo.email", "message", "fileObjects.fileName"],
                        "allowAnalyzedField": true
                    }
                }]
            }
        }
    },

Noticed that this fails ti searcg below,

query : ‘19058.pdf’
path : ‘ITEM LABELS FOR 19058.pdf’

in this case no results will be returned, but if i query ‘19058’ because of the * at the end of the query it works. I tried to add a * in front of the query but got no results as well.

Hi Rishi,

Thanks for your response. Glad to know that wildcard is serving your requirement.

Can i improve on the aggregation below? i am using compound with must and filter, but i see you did not use these in your examples.

I do not think any improvement is needed unless you can merge them together:

const agg = [
    {
        '$search': {
            'index': 'quicksend',
            "compound": {
                "must": [{
                    "text": {
                        "query": data.senderId,
                        "path": "senderId"
                    },
                }, {
                   "wildcard": {
                        'query': data.query+'*',
                        'path': ["emailTo.email", "message", "fileObjects.fileName"],
                        "allowAnalyzedField": true
                    }
                }]
            }
        }
    },

My apologise for the brackets indent imbalances if any

path : ‘ITEM LABELS FOR 19058.pdf’

I am confused that path name containing whitespaces. Can you please confirm, also if you are using query inside wildcard, you need to provide the regex. Can you please provide the query you are using.

Thanks,
Darshan

For example :

"wildcard": {

                                'query': data.query+'*',

                                'path': ["emailTo.email", "message", "fileObjects.fileName"],

                              "allowAnalyzedField": true

                            }

if a fileName in the collection = ‘HL-9001 72X108 MAP.jpg’
and searching for ‘9001’ it will return a positive result, but searching for HL-9001 will not return any results.

can you see what I need to change in the search query so that HL-9001 will also return a result.