Lookup and match performance - Joining collections

Urbancsik_Gergely · April 7, 2024, 6:36pm

Hello.

I have two collections:
Annotations
Project

We are developing an application in C# where we want to provide complex filtering capabilities.
the following query is not complex, but we already run into performance problems.

So.
The c# code:

var filterResponse = await applicationRepository.AsQueryable<Domain.Annotation.Annotation>()
    .Join(
        applicationRepository.AsQueryable<Project>(),
        a => a.Project,
        p => p.Id,
        (annotation, project1) => new
        {
            annotation,
            project1,
        })
    .Join(
        applicationRepository.AsQueryable<Subject>(),
        p => p.annotation.Subject,
        s => s.Id,
        (arg1, subject) => new FilterResponse
        {
            Annotation = arg1.annotation,
            Project = arg1.project1,
            Subject = subject,
        })
    .Where(f => f.Subject.SubjectName == "adhesion assay")
    .ToListAsync();

the translated query:

 "pipeline":[
      {
         "$project":{
            "_outer":"$$ROOT",
            "_id":0
         }
      },
      {
         "$lookup":{
            "from":"projects",
            "localField":"_outer.project",
            "foreignField":"_id",
            "as":"_inner"
         }
      },
      {
         "$unwind":"$_inner"
      },
      {
         "$project":{
            "annotation":"$_outer",
            "project1":"$_inner",
            "_id":0
         }
      },
      {
         "$project":{
            "_outer":"$$ROOT",
            "_id":0
         }
      },
      {
         "$lookup":{
            "from":"subjects",
            "localField":"_outer.annotation.subject",
            "foreignField":"_id",
            "as":"_inner"
         }
      },
      {
         "$unwind":"$_inner"
      },
      {
         "$project":{
            "Annotation":"$_outer.annotation",
            "Project":"$_outer.project1",
            "Subject":"$_inner",
            "_id":0
         }
      },
      {
         "$match":{
            "Subject.Subject":"adhesion assay"
         }
      }
   ],

There are 1-1 connections between the tables.
The query is very slow.
I tried to put an index on the subject field, but it doesn’t use it.
On another site I read up on why this might be:

Is this really the case ? can the mongo db not optimize this ?

how can this query be written in a more optimal way , if I want to continue to start from the annotation table?
So the Concept is to do this 3 table concatenation , and you can write filters for practically any field.
Is it possible to implement this in a way that works fast ?

Thank you in advance for your help and reply.

steevej · April 12, 2024, 1:45am

It looks like you have 3, Annotations, projects and subjects. Please clarify. Please provide sample documents and the expected results. This will help us understand your use-case. The resulting pipeline seems very complicated and creates a very deep structure which is frankly hard to follow since it reuses some temporary fields such as _inner and _outer.

That is understandable because your $match is the last stage, when the documents reach that stage they are not the indexed document anymore. Even after your first $project, the documents are so different from the source collection that no index can be use past this. You should start with the $match and with the subjects collection first.

Any reasons why you do:

Urbancsik_Gergely:

{
         "$project":{
            "_outer":"$ROOT",
            "_id":0
         }
      },
      {
         "$lookup":{
            "from":"projects",
            "localField":"_outer.project",
            "foreignField":"_id",
            "as":"_inner"
         }
      },

rather than simply

      {
         "$lookup":{
            "from":"projects",
            "localField":"project",
            "foreignField":"_id",
            "as":"_inner"
         }
      },

These extra $project take memory, especially since you do it for all documents, that all need to be fetched.

Urbancsik_Gergely · April 14, 2024, 9:59am

Hello @steevej
First of all, thanks for your answer.
Clarifications:
Our goal is to build dynamic queries through the UI.
So the user can appear on a web interface with predefined fields for which they can define search criteria.
They can put the order of the fields on the UI as you like.
and each opreation can be defined in different ways, so very, very complex cases can arise.

The above query was generated by the C# mongo driver, based on the expression described…
I think a big part of the problem is that you can search in different collections at the same time, so you have to join them somehow ( lookup / join )
if all searchable fields were stored in a collection ( read model ), mongo would be much easier.
Also.
Since the search conditions can be for multiple tables, and as I read earlier, should start with ‘match’ in the pipeline, I see that it won’t work to join the tables involved in the search and execute the search at the end.
Instead, perhaps the solution is something like bringing one of the filtering conditions up front, and then performing the JOIN operation on the result of that.
I hope you understand the problem and what we are trying to solve.
hope you can suggest which direction might be the best in terms of performance.

Example documents:
Annotation:

{
  "_id": {
    "$oid": "5a8444f2446a2e38db07f5d2"
  },
  "consumer": "hubscience.com",
  "annotator_schema_version": "1.0",
  "updated": {
    "$date": "2018-02-14T14:17:22.104Z"
  },
  "quote": "membrane",
  "MainCategory": {
    "$oid": "5a84407e4f983008ac3aa03c"
  },
  "SubCategory": {
    "$oid": "5a84407e4f983008ac3aa041"
  },
  "wordpack": {
    "$oid": "5a46421397b19e4e7ca83f09"
  },
  "subject": {
    "$oid": "5a8444d000678f3b941549cf"
  },
  "annotationSet": {
    "$oid": "5a8440e54f983008ac3aa071"
  },
  "documentVersion": {
    "$oid": "5a8440e54f983008ac3aa072"
  },
  "document": {
    "$oid": "5a8440dc4f983008ac3aa063"
  },
  "project": {
    "$oid": "5a84407d4f983008ac3a9fed"
  },
  "context": "D1.E4(#10), and 1F5.D6 (#14) for detection of the O-GlcNAc modification on extracellular domains of membrane",
  "preannotated": true,
  "preannotatorVersion": "1.0",
  "children": [],
  "parents": [],
  "remarks": [],
  "ranges": [
    {
      "start": "/div[1]/div[1]/div[1]/div[3]/p[1]",
      "startOffset": 707,
      "end": "/div[1]/div[1]/div[1]/div[3]/p[1]",
      "endOffset": 715,
      "_id": {
        "$oid": "5a8444f2446a2e38db07f5d6"
      }
    }
  ],
  "created": {
    "$date": "2018-02-14T14:17:22.104Z"
  },
  "__v": 0

}

Subject:

{
  "_id": {
    "$oid": "5a84436700678f3b9414fbe7"
  },
  "Subject": "Method_Name_List",
  "Wordpack": {
    "$oid": "5a463f9097b19e4e7ca83e20"
  },
  "Created": {
    "$date": "2018-02-14T14:10:31.460Z"
  },
  "Properties": [],
  "Abbreviations": [],
  "Synonymes": [],
  "__v": 0
}

Project:

{
  "_id": {
    "$oid": "5a8d340cf26beb0c9cca948c"
  },
  "Name": "test2",
  "Created": {
    "$date": "2018-02-21T08:55:37.645Z"
  },
  "Annotator": [
    {
      "Name": "Biology",
      "Type": "Built_in",
      "Level": 1,
      "Affinity": 1,
      "Color": "c0c100",
      "Description": "Category for biological annotation",
      "DefaultWordPack": {
        "$oid": "5a8d340df26beb0c9cca948d"
      },
      "Children": [
        {
          "Name": "Name",
          "Level": 2,
          "Affinity": 2,
          "Color": "e5e697",
          "Description": "Name of biological entity",
          "DefaultWordPack": {
            "$oid": "5a8d340df26beb0c9cca948e"
          },
          "Children": [],
          "_id": {
            "$oid": "5a8d340ef26beb0c9cca94f8"
          },
          "WordPacks": [
            {
              "$oid": "5a463c0497b19e4e7ca83cda"
            },
            {
              "$oid": "5a8d340df26beb0c9cca948e"
            }
          ]
        },
        
  "__v": 0,
  "CleanedAt": {
    "$date": "2020-01-08T09:51:01.295Z"
  }
}

Very very easy example query:
.Where(f => f.Subject.SubjectName == “adhesion assay”)

yes, if we could know in advance what query the user will create, I can see a solution.
Starting from the subject, I run the match and then join the other collections on top of it.
but this is not the case, you can search for predefined fields in any of the 3 tables above, in any combination.
And obviously we don’t know in advance which condition will best narrow down the search space.
So the solution we are looking for is how to dynamically build queries so that they perform well.

Thanks.

steevej · April 18, 2024, 12:31pm

I do not think you can have complete flexibility and high performance within the same system.

While I understand that you cannot

you can analyze what he want and try to follow ESR rule as much as possible. There is not much more I can say.