Using db.collection.distinct() within the aggregation framework

Javier_Blanco · February 24, 2022, 11:53am

I’m trying to upload information via AF query to a very fussy BI software regarding JSON format. I have trouble with a $match stage:

	{
	    "$match": 
	    {
	        "idSensor": 3,
	        "idDevice": {"$in": db.devices.distinct("_id", {"group": DBRef("groups", 11)})}
	    }
	},

This works OK with a software I use to test my queries against the server, NoSQL Booster:

"idDevice": {"$in": db.devices.distinct("_id", {"group": DBRef("groups", 11)})}

But not with my BI software; it rejects it because it’s not properly formatted.

Using a JSON formatter I get this:

{
   "$match":
   {
      "idSensor": 3,
      "idDevice": {"$in":"db.devices.distinct(_id", {"group":"DBRef(groups", 11)}")"}
   }
},

But now this piece of allegedly properly formatted JSON doesn’t work with NoSQL Booster.

I guess I’m still missing some double quotes somewhere, any hint?

Thanks in advance.

UPDATE:

Syntax might be improved a little:

"idDevice": {"$in": db.devices.distinct("_id", {"group.$id": 11})}

{"group.$id": 11} instead of {"group": DBRef("groups", 11)}, but still have trouble with the main structure, db.devices.distinct(), which is a shell method after all…

Is there an alias of this method to be used within the aggregation framework?

UPDATE 2:

From db.collection.distinct() — MongoDB Manual

The db.collection.distinct() method provides a wrapper around the distinct command.

So I have tried this, that has the format I need:

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

But now the query goes on forever in NoSQL Booster and I get no results.

It seems this new piece of code is not equivalent to the previous one.

santimir · February 25, 2022, 2:31pm

Hello @Javier_Blanco

JSON only accepts this data types:

String surrounded by double quotes
Number
Boolean
Null

And Objects and Arrays storing those values, for objects the keys need to be double quoted too.

This:

"DBRef(groups", 11)}")"

Is none of those data types, also the , would give errors anywhere.

“idDevice” has to be one of those data types. For example "DBRef(groups, 11)})" But What are you trying to include there? If this is the output of a different call to db, then do it in 2 steps:

const idDevice = getMyData()
query.idDevice= getMyData

Where getMyData should return a string, or array etc, and then query is the object you already have, but we are setting idDevice property.

I hope I didn’t get your idea wrong.

Javier_Blanco · February 25, 2022, 5:14pm

Hi, @santimir, and thanks for your answer.

I’m using a BI software that only accepts AF queries strictly formatted as JSON. So I cannot define a constant and then do some other thing, I’m not working in the Mongo shell.

That piece of code works really well within my current query: when testing it against the server using another software, called NoSQL Booster, I get results in less than 5 seconds. So I need to format it properly -which it seems it’s not possible- or to change its syntax in a way it might be properly “jsonized”.

I thought this could be an alternative but equivalent syntax:

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

But I get no results when I try it in NoSQL Booster.

santimir · February 25, 2022, 8:31pm

@Javier_Blanco

I probably don’t know enough but I’ll just add a bite.

In this line:

"idDevice": {"$in": db.devices.distinct("_id", {"group": DBRef("groups", 11)})}

The db.devices.distinct( ) has to be evaluated before sending that to the server. Even if it is run like that in MongoShell, I presume it won’t work (won’t be compiled to the query we expect). So probably this is dependant on which driver/environment etc. we run.

If we stringify the db.device.distinct, this may not produce what you expect, mongodb will search for the string “db.devices.distinct…” in the last case.

Basically you should search how to use the output of a query in a different query in BI in my opinion, probably some examples of how it is supposed to parse. But it will likely need a different approach.

santimir · February 25, 2022, 9:40pm

And if something like that were to be run in mongoshell (which may be similar to where you run it), I’d do (minimal example):

> var a = db.hell.find({a:'b'} )
> var b = db.hell.find({a:a[0].a} )
> b
{ "_id" : ObjectId("621946acbcfb1c97f59c49be"), "a" : "b" }
{ "_id" : ObjectId("621946d1bcfb1c97f59c49bf"), "a" : "b" }
{ "_id" : ObjectId("621946f2bcfb1c97f59c49c0"), "a" : "b" }

Where I am using one query result into another one.

Javier_Blanco · February 28, 2022, 10:39am

OK, here it goes the whole query:

var pipeline = 
[
	{
	    "$match": 
	    {
	        "idSensor": 3,
	        "idDevice": {"$in": db.devices.distinct("_id", {"group.$id": 11})},
	    }
	},
	{
        "$lookup":
        {
            "from": "devices",
            "localField": "idDevice",
            "foreignField": "_id",
            "as": "array"
        }
    },
    {
        "$unwind": "$array"
    },
    {
	   "$project":
	   {
	      "_id": 1,
	      "idLocation": {"$toString": "$array.idLocation"},
	      "coNaLocation": {"$ifNull": ["$array.locationAddress", null]},
	      "idDevice": {"$toString": "$idDevice"},
	      "naDay": {"$substrBytes": ["$data.inicio", 0, 10]},
	      "coInterval": 
	      {
	         "$concat":
	         [
	            "[",
	            {"$substrBytes": ["$data.inicio", 11, 2]},
	            ":00, ",
	            {"$dateToString": {"format": "%H", "date": {"$add": [{"$toDate": "$data.inicio"}, 3600000]}, "timezone": "$dateTime.offset"}},
	            ":00]"
	         ]
	      },
	      "naBroadcast": "$data.archivo",
	      "naType": "$data.tipo",
	      "qtBroadcast": {"$toInt": 1},
	      "qtBroadcastDurationS": {"$divide": [{"$subtract": [{"$toDate": "$data.fin"}, {"$toDate": "$data.inicio"}]}, 1000]}
	   }
	}
]

db.sensorsDataHistoric.aggregate(pipeline)

In NoSQL Booster, it needs less than 5 seconds to retrieve the documents. First one:

{
  "_id": "612f55683e1fd4728b578a16",
  "idLocation": null,
  "coNaLocation": null,
  "idDevice": "2443",
  "naDay": "2021-08-04",
  "coInterval": "[10:00, 11:00]",
  "naBroadcast": " mym_covidMEDIDAS_2021_fer.mp4",
  "naType": "video",
  "qtBroadcast": 1,
  "qtBroadcastDurationS": 31
}

The problem: idDevice cannot be properly formatted as a JSON, so I can’t use the query in my BI software.

Note: I only upload the piece of code between brackets to my BI software. It doesn’t allow anything else. It has another formulary to fill in the rest of parameters: server, db, collection, etc.

I have an alternative query that produces the same outcome:

var pipeline = 
[
	{
	   "$match": {"group.$id": 11}
	},
	{
	   "$lookup":
	   {
	      "from": "sensorsDataHistoric",
	      "localField": "_id",
	      "foreignField": "idDevice",
	      "as": "array"
	   }
	},
	{
	   "$unwind": "$array"
	},
	{
	   "$match": {"array.idSensor": 3}
	},
    {
	   "$project":
	   {
	      "_id": "$array._id",
	      "idLocation": {"$toString": "$idLocation"},
	      "coNaLocation": {"$ifNull": ["$locationAddress", null]},
	      "idDevice": {"$toString": "$array.idDevice"},
	      "naDay": {"$substrBytes": ["$array.data.inicio", 0, 10]},
	      "coInterval": 
	      {
	         "$concat":
	         [
	            "[",
	            {"$substrBytes": ["$array.data.inicio", 11, 2]},
	            ":00, ",
	            {"$dateToString": {"format": "%H", "date": {"$add": [{"$toDate": "$array.data.inicio"}, 3600000]}, "timezone": "$array.dateTime.offset"}},
	            ":00]"
	         ]
	      },
	      "naBroadcast": "$array.data.archivo",
	      "naType": "$array.data.tipo",
	      "qtBroadcast": {"$toInt": 1},
	      "qtBroadcastDurationS": {"$divide": [{"$subtract": [{"$toDate": "$array.data.fin"}, {"$toDate": "$array.data.inicio"}]}, 1000]} 
	   }
	}
]

db.devices.aggregate(pipeline)

But it needs more than 700 s to run. I’m uploading a big bunch of documents to my BI software, so I need the process to be quick. First one:

{
  "_id": "612b8450959e931c0727e32d",
  "idLocation": "2415",
  "coNaLocation": "[0032] Calle Fray Ceferino, 25 - 33001 - Oviedo (Asturias)",
  "idDevice": "2353",
  "naDay": "2021-08-29",
  "coInterval": "[12:00, 13:00]",
  "naBroadcast": " APP_masymas100000_MAY21.mp4",
  "naType": "video",
  "qtBroadcast": 1,
  "qtBroadcastDurationS": 25
}

That’s why I’m trying to find a proper syntax for

"idDevice": {"$in": db.devices.distinct("_id", {"group.$id": 11})}

I thought

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

would be an equivalent one, but it seems it’s not, not sure why.

santimir · February 28, 2022, 5:35pm

One thing that could be done is to add explain:true to the aggregation options in both queries, then you will see how they are parsed, and what can be improved performance wise.

Feel free to paste them here, including only the winning plan, maybe.

Javier_Blanco · February 28, 2022, 6:04pm

Well, the winning one is the first one (less than 5 s vs more than 700 s), that’s why I need to find a suitable syntax for idDevice.

What am I doing wrong here?

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

santimir · February 28, 2022, 6:24pm

I meant the parsed query. The queries are not just what we typed, so we need to see what is actually being executed (indexes used, optimizations etc).

$in, at least in standard JS queries, accepts an array, please see the syntax here. The colons are, clearly, not following that syntax.

Basically, it searches for a field named idDevice equals to any of the elemnts in the array. Ex idDevice:"_id" would match.

Asya_Kamsky · March 1, 2022, 4:52pm

Here’s the underlying issue: this is not a query - this is … two queries. Your distinct command is run first. It returns an array. That array is then used inside your $match $in expression.

There is no way to run this as a “single” query. You could make a more complex pipeline or just store what is returned from first query in a variable and pass it to the second query. It looks like one of the tools you’re using is doing this automatically maybe? But the other one is not and just expects plain normal JSON.

Now, in addition to embedding distinct command output into your $match expression you also have an issue of using DBRef which is not plain JSON nor is it a MongoDB type - it’s a convention that is shorthand for an embedded object which has two or more properties. I think you already found that you can work around this in some cases by using "group.$id" type syntax instead.

Asya

Javier_Blanco · March 2, 2022, 6:43pm

Thanks for your answer, @Asya_Kamsky.

Well, the point is that my BI software needs a proper formatted JSON, so I have trouble with “db.devices.distinct()”. It doesn’t mind if something runs first or second or if there are one or two queries; my BI software doesn’t complain about that: it complains about my code not being a proper formatted JSON because of “db.devices.distinct()”, so I need to find an alternative syntax for it.

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

Is that an alternative syntax? I thought it might be, but it doesn’t work. I just want to know if there’s a mistake I can fix or if I have to forget about it.

Asya_Kamsky · March 3, 2022, 10:33pm

db.collection.distinct is not JSON it’s a command. The syntax ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}] may be legal JSON but it does not represent running a distinct command, it’s just an array with a bunch of values in it. You have two collections and you want to run a pipeline on sensorsDataHistoric but limiting your results to distinct values you get out of another collection (devices).

You need to rewrite the pipeline to be a single query most likely.

Here’s what I don’t understand, you are already doing a $lookup into devices collection, why don’t you limit your $lookup to the correct subset and get rid of that previous distinct?

You tried it but you did it with aggregation on devices and $lookup on sensorsDataHistoric - I’m suggesting doing it the other way around.

By the way, what version is this?

Asya

Javier_Blanco · March 4, 2022, 10:02am

If we keep on reading:

{
  distinct: "<collection>",
  key: "<field>",
  query: <query>,
  readConcern: <read concern document>,
  collation: <collation document>
}

Field Type Description

distinct string The name of the collection to query for distinct values.

key string The field for which to return distinct values.

query document Optional. A query that specifies the documents from which to retrieve the distinct values.

So, according to this, how should I change my piece of code to make it functional?

"idDevice": {"$in": ["distinct": "devices", "key": "_id", "query": {"group.$id": 11}]}

Javier_Blanco · March 4, 2022, 11:13am

var pipeline = 
[
	{
	    "$match": 
	    {
	        "idSensor": 3
	    }
	},
	{ 
        "$lookup": 
        {
            "from": "devices",
            "let": {"sensorsDeviceId": "$idDevice"},
            "pipeline": [{"$match": {"$expr": {"$and": [{"$eq": ["$$sensorsDeviceId", "$_id"]}, {"$eq": ["group.$id", 11]}]}}}],
            "as": "array"
        }
	},
    {
        "$unwind": "$array"
    },
    {
	   "$project":
	   {
	      "_id": 1,
	      "idLocation": {"$toString": "$array.idLocation"},
	      "coNaLocation": {"$ifNull": ["$array.locationAddress", null]},
	      "idDevice": {"$toString": "$idDevice"},
	      "naDay": {"$substrBytes": ["$data.inicio", 0, 10]},
	      "coInterval": 
	      {
	         "$concat":
	         [
	            "[",
	            {"$substrBytes": ["$data.inicio", 11, 2]},
	            ":00, ",
	            {"$dateToString": {"format": "%H", "date": {"$add": [{"$toDate": "$data.inicio"}, 3600000]}, "timezone": "$dateTime.offset"}},
	            ":00]"
	         ]
	      },
	      "naBroadcast": "$data.archivo",
	      "naType": "$data.tipo",
	      "qtBroadcast": {"$toInt": 1},
	      "qtBroadcastDurationS": {"$divide": [{"$subtract": [{"$toDate": "$data.fin"}, {"$toDate": "$data.inicio"}]}, 1000]}
	   }
	}
]

db.sensorsDataHistoric.aggregate(pipeline)

Something like this?

Well, this goes on forever and never gets solved…

In more detail:

	{ 
        "$lookup": 
        {
            "from": "devices",
            "let": {"sensorsDeviceId": "$idDevice"},
            "pipeline": 
            [
            	{
            		"$match": 
            		{
            			"$expr": 
            			{
            				"$and": 
            				[
            					{"$eq": ["$$sensorsDeviceId", "$_id"]}, 
            					{"$eq": ["group.$id", 11]}
            				]
            			}
            		}
            	}
            ],
            "as": "array"
        }
	},

I guess the issue lies within the $and; maybe does it get confused by mixing a variable from sensorsDataHistoric with another one from devices?

Javier_Blanco · March 7, 2022, 10:16am

I’m working with 4.4.6.

Field	Type	Description
`distinct`	string	The name of the collection to query for distinct values.
`key`	string	The field for which to return distinct values.
`query`	document	Optional. A query that specifies the documents from which to retrieve the distinct values.