Combine specific documents, join/merge

mc_m0ng0 · March 29, 2023, 12:04pm

I have a requirement to merge 2 to 3 documents in the same collection and produce a new document containing copies of data from fields in both. Sample data:

// Document one
{
    "jobs": [{
            "jobId": 1,
            "predictedCompletionStatus": "Late"
        },
        {
            "jobId": 2,
            "predictedCompletionStatus": "Early"
        },
        {
            "jobId": 3,
            "predictedCompletionStatus": "Early"
        }
    ]
}

// Document two 
{
    "jobs": [{
            "jobId": 1,
            "predictedCompletionStatus": "Early"
        },
        {
            "jobId": 2,
            "predictedCompletionStatus": "OnTime"
        }
    ]
}

// Merged Document
{
	"merged": [{
			"jobId": 1,
			"predictedCompletionStatus1": "Late",
			"predictedCompletionStatus2": "Early"
		},
		{
			"jobId": 2,
			"predictedCompletionStatus1": "Early",
			"predictedCompletionStatus2": "OnTime"
		}
	]
}

The documents should be joined on the jobId and then alias the columns from each document in order to differentiate them.
I’m used to doing this kind of thing in SQL where a join would get me what I need. Is it possible to do this? I’ve looked at $lookup but that seems to need the douments in different collections.

steevej · March 29, 2023, 12:45pm

You may $lookup from: the same collection you started with.

Please update your sample documents so that they are valid JSON documents we can cut-n-paste directly into our system. You are missing commas. You are also missing field names for your arrays.

You have jobId:3 in the first document but it is absent from the result. Is it because it is not present in document 2.

Having dynamic field names such as

and

is a bad idea. I suggest that you aim for something like:

// Merged Document
{ "merged" :
  [
    {
      "jobId" : 1 ,
      "predictedCompletionStatus" : [ 'Late' , 'Early' ] 
    } ,
    {
      "jobId" : 2 ,
      "predictedCompletionStatus" : [ 'Early' , 'OnTime' ]
    }
  ]    
}

Naturally, $lookup will produce an array anyway and in most programming languages it is easier to iterate over an array.

mc_m0ng0 · March 29, 2023, 1:14pm

Thanks for the reply. I’ve updated the OP to include proper JSON. Your suggested output looks better. Can we guarantee the ordering of the items in the array so that index 0 is always the first documents value and index 1 is the second documents value?

And yes jobID: 3 is abset as it doesn’t exist in the other document.

steevej · March 29, 2023, 1:21pm

In an array or document list, the order is only guaranty if we $sort. So even with

a $sort would be needed.

Do you have a field in the source document that you can sort?

mc_m0ng0 · March 29, 2023, 1:32pm

Each document has a name,created and an id field at the root level. For example:

{
	"_id": {
		"$oid": "6422918c0e9f34f2000ab941"
	},
	"created": "2023-03-29T08:04:44Z",
	"name": "Test Doc 1",
	"jobs": [{
			"jobId": 1,
			"predictedCompletionStatus": "Early"
		},
		{
			"jobId": 2,
			"predictedCompletionStatus": "OnTime"
		}
	]
}

Perhaps the created column can be used as generally the older document would be the “left hand side” and the newer the “right hand side”.

mc_m0ng0 · March 29, 2023, 2:54pm

If you have some time to provide a working aggregation I’d really appreciate it. I’m new to this so chasing my tail a bit with the documentation and trying to understand stuff when coming from a SQL background.

[
  {
    $match: {
      name: {
        $in: ["Test Doc 1", "Test Doc 2"],
      },
    },
  },
  {
    $sort:
      {
        created: 1,
      },
  },
  {
    $lookup: {
      from: "sandbox",
      //localField: "jobId",
      //foreignField: "jobId",
      let: {
        job_id: "$jobId",
      },
      pipeline: [
        {
          $match: {
            $expr: {
              $and: [
                {
                  $eq: ["$jobId", "$$job_id"],
                },
              ],
            },
          },
        },
        {
          $project: {
            created: 0,
          },
        },
      ],
      as: "result",
    },
  },
  {
    $group:
      {
        _id: "$jobId",
        merged: {
          $addToSet: "$result",
        },
      },
  },
]

The above mess is my attempt to get something going but it’s no where near what I want.

mc_m0ng0 · March 30, 2023, 1:24pm

Okay so I made some progress with the following aggregation. How can I filter out the item with jobID 3 with only one element in the array?

[
  {
    $match: {
      name: {
        $in: ["Test Doc 1", "Test Doc 2"],
      },
    },
  },
  {
    $sort: {
      created: 1,
    },
  },
  {
    $unwind: {
      path: "$jobs",
    },
  },
  {
    $group:

      {
        _id: "$jobs.jobId",
        predictedcompletionStatus: {
          $addToSet:
            "$jobs.predictedCompletionStatus",
        },
      },
  },
  {
    $addFields:
      {
        jobId: "$_id",
      },
  },
  {
    $project: {
      _id: 0,
    },
  }
]

steevej · March 30, 2023, 1:26pm

Any reason why you $match as follow?

This will make both matching documents as the source of a $lookup which does not seem to be what you want. From the merged document, it looks like you want to start with Document one and $lookup the matching jobId in Document 2. May be you also want to do that with Document 2 and $lookup the matching jobId in Document 3 if any? What if there is a Document 3 like

{
    "jobs": [{
            "jobId": 1,
            "predictedCompletionStatus": "Failed"
        }
    ]
}

What would you want for Merge document?

{ "merged" :
  [
    {
      "jobId" : 1 ,
      "predictedCompletionStatus" : [ 'Late' , 'Early' , 'Failed' ] 
    } ,
    {
      "jobId" : 2 ,
      "predictedCompletionStatus" : [ 'Early' , 'OnTime' ]
    }
  ]    
}

or this

{ "merged" :
  [
    {
      "jobId" : 1 ,
      "predictedCompletionStatus" : [ 'Late' , 'Early' , 'Failed' ] 
    }
    /* jobId:2 absent since there is not 3rd status */
  ]    
}

Your localField:/foreignField: (or let:) should be jobs.jobId rather than simply jobId because you are inside the array jobs:.

To order Document 2, the {$sort:{created:1}} needs to be in the $lookup: pipeline. Since you want the document that follows Document 1 in time you will need to let: document_one_created:“$created” and use $gt to $match. If you only want to consider Document 2, you then have to $limit:1.

Personally, since the source (localField) and target (foreignField) of the $lookup is an array, I think it would be simpler to $unwind before the $lookup.

Assuming that you only want to start from Document 1, I would go along the following pipeline stages.

match = { "$match" : {
  "name" : "Test Doc 1"
} }

unwind = { "$unwind" : "$jobs" }

lookup = { "$lookup" : {
    "from" : "sandbox" ,
    "as" : "_lookup" ,
    "let" : {
       "d1_jobId" : "$jobs.jobId" ,
       "d1_created" : "$created"
    } ,
    "pipeline" : [
        { "$match" : { "$expr" : { "$and" : [
          { "$eq" : [ "$jobs.jobId" , "$$d1_jobId" ] } ,
          { "$gt" : [ "$created" , "$$d1_created" ] }
        ] } } }
    ]
} }

I just saw that you were replying to the thread stop I am stopping here. I do not want to shoot on a moving target.

mc_m0ng0 · March 30, 2023, 1:34pm

The collection can have 100’s potentially thousands of documents so I want to limit the comparison to 2 or 3 specific documents which will be known at the start.

Good question. I’ll need to nail that down. Initially I was going for the second option.

I’ll have a play with that pipeline and see how it goes

Thanks for the help so far.

mc_m0ng0 · March 30, 2023, 4:31pm

I tried your pipeline but I wasn’t getting any data in the lookup output so not sure where that’s going wrong. $lookup seems like it might a cleaner approach though.

I ploughed on with my approach and I’ve got it filtered down to the ones I care about but I’m wondering what gotchas are in there. The hardcoded check on the size of the predictedcompletionStatus array is not ideal. It would be great if this could be filtered with reference to the number of documents.

I changed from using $addToSet to $push in the $group stage as when I went to 3 documents and a status was repeated [Late,Early,Late] the second late was dropped. Does push guarantee the ordering or am I just lucky that it seems to work even when I go to 3 documents?

My current pipeline:

[
  {
    $match: {
      name: {
        $in: ["Test Doc 1", "Test Doc 2"],
      },
    },
  },
  {
    $sort: {
      created: 1,
    },
  },
  {
    $unwind: {
      path: "$jobs",
    },
  },
  {
    $group: {
      _id: "$jobs.jobId",
      predictedcompletionStatus: {
        $push: "$jobs.predictedCompletionStatus",
      },
    },
  },
  {
    $sort: {
      _id: 1,
    },
  },
  {
    $project: {
      _id: 0,
      jobId: "$_id",
      predictedcompletionStatus:
        "$predictedcompletionStatus",
    },
  },
  {
    $addFields:
      {
        len: {
          $size: "$predictedcompletionStatus",
        },
      },
  },
  {
    $match: {
      len: {
        $eq: 2,
      },
    },
  },
  {
    $project:
      {
        len: 0,
      },
  },
]

steevej · March 30, 2023, 4:42pm

Like I wrote I stop right away because you were replying to the post while I was working on the pipeline. I did not wanted to waste time working on a moving target so I stop until I can see your replies.

mc_m0ng0 · March 30, 2023, 4:46pm

Thanks. I figured that’s why I got no data I’ve replied to your queries and posted where I’m at. I’m not sure if my current approach is a good one as I’m very new to this. Appreciate the input so far. I think I’ll park it for the evening now as I’m getting cross eyed looking at documentation and data.

steevej · March 31, 2023, 1:19pm

As far as I know it does add at the end but I cannot find any documentation in this effect. Hopefully someone will point us to the specification. As a solution, if the $push order cannot be guaranteed by the previous $sort stage, you could

"predictedCompletionStatus" : { "$push" : {
    "date" : "$created" ,
    "status" : "$jobs.predictedCompletionStatus"
} }

and then update predictedCompletionStatus with

{ "$set" : {
    "predictedCompletionStatus" : { "$sortArray" : {
        "input" : "predictedCompletionStatus" ,
        "sortBy" : { "date" : 1 }
    } }
} }

which might provide you with more information about the evolution as the result would look like

{
  _id: 1,
  predictedCompletionStatus: [
    {
      date: 2023-03-30T13:11:54.921Z,
      status: 'Late'
    },
    {
      date: 2023-03-30T13:12:33.269Z,
      status: 'Early'
    }
  ]
}
{
  _id: 2,
  predictedCompletionStatus: [
    {
      date: 2023-03-30T13:11:54.921Z,
      status: 'Early'
    },
    {
      date: 2023-03-30T13:12:33.269Z,
      status: 'OnTime'
    }
  ]
}

Efficiency wise, I would move

before

because why sort something that will be removed later.

You could also replace

with

  {
    $match: {
      $expr : {
        $eq : [ 2 , { $size : "$predictedcompletionStatus"  } ] ,
      },
    },
  }

then you could remove

system · May 5, 2023, 7:32am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.