Atlas $search how to return only matching sub-documents

sabataitis · November 15, 2022, 11:04am

I am a little stuck,

I have users collection defined as such:

 "_id" : ObjectId("..."),
 ...,
  "profiles" : [ 
        {
            "_id" : ObjectId("..."),
            "type": "type",
            "firstName" : "abc",
            "lastName" : "abc",
            "pageName": "abc",
        },
        ...
      ],

Every user can have multiple profiles, I have created an index “profile_name” to search for profiles based on firstName, lastName and pageName.

$search: {
  index: "profile_name",
  compound: {
    must: [{ 
     wildcard: {
       query: ["*someQuery*"], 
       path: ["profiles.firstName", "profiles.lastName", "profiles.pageName"], 
       allowAnalyzedField: true
    }
  }
 ]
}
}

how can I return only those matching profiles instead of the entire document… Currently the whole document is returned (if document contains a profile that matches the criteria but it has 30 profiles it returns all 30 profiles)

since I cannot use $search after unwinding, how should I proceed?

Elle_Shwer · November 15, 2022, 1:10pm

Hi there, can you please share the index you created?

Yilmaz_Durmaz · November 15, 2022, 5:42pm

A multi-stage aggregation pipeline might help.

Use the indexed search as first stage to find related documents, then use $unwind stage on selected documents, then $match on the next stage again to get the related sub-documents, add a grouping signature then group together by that signature.

This is not a perfect solution though but might at least help for a while until you get a better solution.

sabataitis · November 16, 2022, 1:12pm

Here is my index

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "profiles": {
        "fields": {
          "_id": {
            "type": "objectId"
          },
          "firstName": {
            "analyzer": "diacriticFolder",
            "searchAnalyzer": "diacriticFolder",
            "type": "string"
          },
          "lastName": {
            "analyzer": "diacriticFolder",
            "searchAnalyzer": "diacriticFolder",
            "type": "string"
          },
          "pageName": {
            "analyzer": "diacriticFolder",
            "searchAnalyzer": "diacriticFolder",
            "type": "string"
          },
          "type": {
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "diacriticFolder",
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

how can I return only the matched sub documents? using multi stage aggregation is not an option as I can not match matched sub documents without running the search query, which can only be ran once.

Elle_Shwer · November 16, 2022, 4:02pm

I’m not sure it is possible to only return matched sub documents. Will poke around on this. Perhaps highlighting would help?

sabataitis · November 16, 2022, 4:28pm

is there any way to return an _id of matched sub-document together with highlights?

sabataitis · November 17, 2022, 7:00am

or perhaps is there a way to sort the matching documents so they would appear at the top?

Yilmaz_Durmaz · November 17, 2022, 10:26am

I think you misunderstood the use cases of the search indexing. Actually of any indexing. Indexes are used for the preliminary elimination of unmatched documents. When we use an index for the first time, the resulting set of documents no longer are part of the index. that is also why we try to put any query that can use indexes at the top.

Processing indexes is fast and in case the result of this first stage satisfies your needs in 1 step then you can use the result as is. otherwise, you have to go through the remaining procedure of finding what you need. Unfortunately, the remaining part is nasty without indexes but not hopeless, because the ugly part lies only in the time required to complete operations.

The following aggregation pipeline does something similar to what you too should do. search by using the search index at the first stage, then unwind the profiles array, then search again with $match (nasty part, you need to use it on fields manually), then do whatever else you need to do in the remaining stages. I chose to flatten the result by replacing the root element. you can select fields and continue only with them

[
  {
    "$search": {
      "index": "profiles_names",
      "text": {
        "query": "typea",
        "path": {
          "wildcard": "*"
        }
      }
    }
  },
  {
    "$unwind": {
      "path": "$profiles",
      "preserveNullAndEmptyArrays": false
    }
  },
  {
    "$match": {
      "profiles.type": "typea"
    }
  },
  {
    "$replaceRoot": {
      "newRoot": "$profiles"
    }
  }
]

The above query does not need, but here are the data I used and the search index:

Click here to see the Data I used

[
  {
    "username": "usera",
    "profiles": [
      { "type": "typea", "pageName": "abc" },
      { "type": "typeb", "pageName": "cde" }
    ]
  },
  {
    "username": "userb",
    "profiles": [
      { "type": "typea", "pageName": "asd" },
      { "type": "typec", "pageName": "zxc" }
    ]
  },
  {
    "username": "userc",
    "profiles": [
      { "type": "typeb", "pageName": "jkl" },
      { "type": "typec", "pageName": "bnm" }
    ]
  }
]

Click here to see Search Index I used

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "profiles": {
        "fields": {
          "pageName": {
            "analyzer": "diacriticFolder",
            "searchAnalyzer": "diacriticFolder",
            "type": "string"
          },
          "type": {
            "analyzer": "diacriticFolder",
            "searchAnalyzer": "diacriticFolder",
            "type": "string"
          }
        },
        "type": "document"
      }
    }
  },
  "analyzers": [
    {
      "charFilters": [],
      "name": "diacriticFolder",
      "tokenFilters": [
        {
          "type": "icuFolding"
        }
      ],
      "tokenizer": {
        "type": "keyword"
      }
    }
  ]
}

sabataitis · November 17, 2022, 10:33am

Thank you for taking your time to reply. I understand your proposal, the very problem I am facing is that my search index has removal of accents, which I can not $match again after $unwinding… I’ve followed the suggestions of highlighting to retrieve any unique field that I could identify nested profile by…

I have managed to implement a solution using highlighting, I am querying by fields of pageName, firstName and lastName. After that I am highlighting unique field of slug, which lets me identify the matching profile.

The problem with that though is when I enter a query: “karasukrainoje”, it finds the pageName, but only highlights the slug “karasukrainoje”, when the whole slug is “karasukrainoje.1”, is there any way to highlight all of it?

Yilmaz_Durmaz · November 17, 2022, 10:52am

I haven’t used accents/collation before but if you haven’t configured it to store full documents in the index, you would be getting your actual documents after the search. and then you could work on the result set as usual without fancy additions. Even if you stored them, I think they are stored as is, only the keywords for the index should be changing for the accents.

Can you give us some accented documents to work with so we may at least try to see it from your viewpoint?

Yilmaz_Durmaz · November 17, 2022, 11:14am

A quick note: Another option here is to redesign your document schema to store “profiles” in a separate collection. this will increase complexity but will give the capability to have better indexing on them.

sabataitis · November 17, 2022, 11:16am

I am sorry haven’t had the time to pull some documents to test with,

is there a way to store only matched sub-documents? I’ve tried storing documents with index but again it would store all of the profiles instead of the matching ones. Am I missing something

Yilmaz_Durmaz · November 17, 2022, 11:54am

Take your time about giving sample documents.

if your question is about my quick note of storing profiles independently, it goes like this:

Click here to expand to see possible migration steps

make a migrator app, fetch a document, for each profile create a profile document in a new “profiles” collection, get its _id and use it to replace profile array in main document, and after processing whole array patch the corresponding document (profiles field) in the database. this is the migration part.

drop current search index from main collection and create new search index on this new “profiles” collection.

rewrite your application to use “profiles” collection when you do search on profiles.

it involves at least 3 different operations, one involves changing main documents. so please be careful with the implementation since it can cause unwanted data loss

sabataitis · November 17, 2022, 12:01pm

im sorry the reply was in regards to your previous comment…

ref

Yilmaz_Durmaz · November 17, 2022, 12:29pm

Ah, ok then. This is what I mentioned:
Return Stored Source Fields — MongoDB Atlas

This way index will store those fields and will return them immediately instead of returning the full documents to which these fields belong.

But it seems, either way, you will need subsequent matching stages on documents returned from index search stage: https://www.mongodb.com/docs/atlas/atlas-search/performance/index-performance/#storing-source-fields

What I meant is, by the way, even though the index works with accents differently, the returned documents from this stage should retain their accents on which you may have a match stage with collation.

PS: I am not 100% confident how stored fields works as I haven’t used them before. I just trust the documentation

PS again: I might even be mostly wrong and this one might be what you need. please check on how this storing option works thoroughly.

Elle_Shwer · November 17, 2022, 7:56pm

Potentially can use a Materialized view for this? Tutorial here

Stored Source isn’t a bad idea either… but it’s kind of for a different use case? (e.g. looking for more performant queries)