Advanced lookup using arrays

I need to join 2 collections using $lookup, but filtering some documents before joining:

collection a

doc_1 => { "join_field":  33}
doc_2 => { "join_field":  78}
...
doc_100000000  => { "join_field":  33}

collection b

doc_1 => { "filter_field": "X", "join_field":  [ 33, 65, 89], }
doc_2 => { "filter_field": "X", "join_field": [ 55, 5423, 998] } 
doc_3 => { "filter_field": "Y", "join_field": [ 33, 86] }
...
doc_100000000  => { "filter_field": "X", "join_field": [ 33, 95] }

I need to start the aggregation from collection “a”, so starting with “b” is not an option.

b.join_Field array is an indexed field.

If I make a normal $lookup between “a” and “b” (without filter) it uses the index correctly, and it returns all the matches from “b” in the response

db.a.aggregate([
  { $match: { "join_field": 33 } },
  { $lookup: {
	  from: "b"
	  localField: "join_field",  // REGULAR FIELD
	  foreignField: "join_field" // ARRAY
	  as: "lookup"
  }}
])

// response
{ 
  "join_field":  33,
  "lookup": [
    { "filter_field": "X", "join_field":  [ 33, 65, 89], },
	{ "filter_field": "Y", "join_field": [ 33, 86] },
	...
	{ "filter_field": "X", "join_field": [ 33, 95] }
  ]
}

I have changed the query to this version, expecting to get the same. However, the $in expression written like this returns an error, which suggest to think that indexes are not being used.

db.a.aggregate([
  { $match: { "join_field": 33 } },
  { $lookup: {
	  from: "b"
	  let: { "jf": "$join_field" }, 
	  pipeline: [
	    { $match: {
			$expr: { $in: [ "$$jf", "$join_field" ] }
		}}
	  ]
	  as: "lookup"
  }}
])

// ERROR: $in requires an array as a second argument, found: missing",

I need to filter collection “b” on this $lookup stage on the pipeline, but based on showed, I cannot do it if indexes are not allowed on this version of the query.

Is there a way to do it?

Thank you very much

The $in version should work as seen on this playground.

If it fails, then most likely some documents are missing the field join_field or the field is not an array. Which is what the error is saying with $in requires an array as a second argument.

The last part of the error message found: missing, indicates that the field is absent as in this
missing field example at this other playground.

This last playground shows when the field exists but not an array.

But why change the original query if it worked?

Thank you very much @steevej for your answer.

The original query never worked, it get stuck and if I limited the input documents it retrieves many useless documents in the lookup, which I would have to discard in a later stage which increase immensely the response time.

I have done many test and I realized that indexes are not used when you change the lookup stage to use let/pipeline. Indeed based on my tests mongo has serious problems handling indexes (and, by extension, me too).

I created this question because I cannot believe yet that mongo behaves this way, so I should be forgetting something… But as far as I deep into the topic, I am certain that mongo stops using indexes easily, which is a little bit scary

My experience differs.

To see if indexes are used or not you should rely on the explain plan.

What I am really puzzled with is:

but then

You were right in expecting the same result set. You would have to discard the same documents. If you can lookup with localField/foreignField this is what you have to do and if you have indexes they will be used. Note that you can lookup with localField/foreignField and still use let/pipeline to weed out unwanted documents.

What needs to be done now is to discard unwanted document as soon as possible. We cannot help you at this regard with the details you shared.

Hi @steevej, thanks for your time. Indeed I have opened a case in mongo with this topic, I can share it with you in order to resolve your doubts about why do I need to use the extended version of the lookup.

You can replicate the problem following this steps:

Create a sample collection “projects” and index the array “persons.identifier”

const times = 1000000

const batch_docs = []
const filters = ["X", "W", "Z"]
for (let i = 0; i < times; i++) {
  batch_docs.push({
    "project_nb": i,
    "filter_attb": filters[Math.floor(Math.random() * filters.length)], // random from filters
    "persons": [
      { "identifier": i },
      { "identifier": i + (Math.floor(Math.random() * 100) + 1) },  // random between 0 and 100
      { "identifier": i + (Math.floor(Math.random() * 101) + 200) } // random between 200 and 300 
    ] })
}

db.projects.deleteMany({})
db.projects.insertMany(batch_docs)

db.projects.createIndex( { "persons.identifier": 1 } )`

Now, if you run this query, you will see it has a good performance, because indexes are being used (the explain plan confirm this)

db.projects.aggregate([
  { $limit: 1 },
  { $unwind: { path: "$persons", preserveNullAndEmptyArrays: true } },
  { $lookup: {
      from: "projects",
      localField: "persons.identifier",
      foreignField: "persons.identifier",
      as: "other_projects"
  }}
])

But there are other points which makes this query inneficient:

  • I would need to remove from “other_projects” the source project
  • I would need to recover just projects with a “filter_attb”: “Z”
  • I would need to check just the existence of one project (kind of findOne), so I wouldn’t need to retrieve all of them

With this, the previous query changes to this other form:

db.projects.aggregate([
  { $limit: 1 },
  { $unwind: { path: "$persons", preserveNullAndEmptyArrays: true } },
  { $lookup: {
      from: "projects",
      let: { "pr_id": "$_id", "per_id": "$persons.identifier" },
      pipeline: [
        { $match: { 
            $and: [
              { $expr: { $in: [ "$per_id", { $ifNull: [ "$persons.identifier", [] ] } ] } },
              { $expr: { $ne: [ "$pr_id", "$_id" ] } },
              { $expr: { $eq: [ "$filter_attb", "Z" ] } }
            ]
        }},
        { $limit: 1 }
      ],
      as: "other_projects"
  }}
])

With just this filter in the $lookup pipeline:

{ $expr: { $in: [ "$per_id", { $ifNull: [ "$persons.identifier", [] ] } ] } },

The query keeps using the index (even with the $ifNull which is necessary in case some document don’t have the array “$persons”)

But as soon as you add any of the other two filters in the query:

  { $expr: { $ne: [ "$pr_id", "$_id" ] } },
  { $expr: { $eq: [ "$filter_attb", "Z" ] } }

It starts to “overthink” and the previous index is not used anymore, as you can see in this extract of the explain plan

{
			"$lookup" : {
				"from" : "projects",
				"as" : "other_projects",
				"let" : {
					"pr_id" : "$_id",
					"per_id" : "$persons.identifier"
				},
				"pipeline" : [
					{
						"$match" : {
							"$and" : [
								{
									"$expr" : {
										"$in" : [
											"$per_id",
											{
												"$ifNull" : [
													"$persons.identifier",
													[ ]
												]
											}
										]
									}
								},
								{
									"$expr" : {
										"$ne" : [ "$pr_id", "$_id" ]
									}
								},
								{
									"$expr" : {
										"$eq" : [ "$filter_attb", "W" ]
									}
								}
							]
						}
					},
					{
						"$limit" : 1
					}
				]
			},
			"totalDocsExamined" : 2000298,
			"totalKeysExamined" : 0,
			"collectionScans" : 6,
			"indexesUsed" : [ ],
			"nReturned" : 3,
			"executionTimeMillisEstimate" : 18327
		}

Is there a performant way to apply on these situations? Am I doing something wrong?

Thank you very much.
Jaime

Have you tried

Something like:

{ $lookup: {
      from: "projects",
      localField: "persons.identifier",
      foreignField: "persons.identifier",
      let: { "pr_id": "$_id" },
      as: "other_projects",
      pipeline: [
        { $match: { 
            $and: [
              { $expr: { $ne: [ "$$pr_id", "$_id" ] } },
              { $expr: { $eq: [ "$filter_attb", "Z" ] } }
            ]
        }},
        { $limit: 1 }
      ]