The same distinct command, the result sometimes contains null and sometimes does not

chaigang · January 16, 2024, 6:51am

version:6.0.12 4.2.8

db.t1.insert({id:1})
db.t1.insert({id:2})
db.t1.insert({})
db.t1.insert({id:2})

db.t1.find()
{ “_id” : ObjectId(“65a62118842d8bddd7ef9699”), “id” : 1 }
{ “_id” : ObjectId(“65a62119842d8bddd7ef969a”), “id” : 2 }
{ “_id” : ObjectId(“65a6211c842d8bddd7ef969b”) }
{ “_id” : ObjectId(“65a6211f842d8bddd7ef969c”), “id” : 2 }

— There are no nulls in the result set

db.t1.distinct(‘id’)
[ 1, 2 ]

– After creating the index, there are nulls in the result set

db.t1.createIndex({id:1})
{
“numIndexesBefore” : 1,
“numIndexesAfter” : 2,
“createdCollectionAutomatically” : false,
“ok” : 1
}
db.t1.distinct(‘id’)
[ null, 1, 2 ]

After adding the index, null values appeared in the result set, causing the application processing logic to change. I would like to ask if this is a bug.

Samuel_84194 · January 16, 2024, 1:26pm

Hello, welcome to the MongoDB community!

This is actually not a bug. You need to create a unique index so that null values are not included in your index. The only unique field by default is _id.

db.t1.createIndex({id:1}, { unique: true })

https://www.mongodb.com/docs/manual/core/index-unique/#:~:text=A%20unique%20index%20ensures%20that,the%20creation%20of%20a%20collection.

steevej · January 16, 2024, 1:48pm

While a unique index would work to ensure null are not in the distinct() result set, they would prevent adding documents with the same id. In the sample documents supplied there is 2 documents with id:2 so creating a unique index would failed with E11000-duplicate-key-error.

The appropriate index would be the partial index:

db.t1.createIndex({id:1},{partialFilterExpression:{id:{$exists:true}}})

Samuel_84194 · January 16, 2024, 2:00pm

It’s true Steve, I didn’t notice the ids being the same. Thanks for the correction.

steevej · January 16, 2024, 2:14pm

Saru mo ki kara ochiru

chaigang · January 17, 2024, 2:24am

Thanks Reply,
db.t1.createIndex({id:1},{partialFilterExpression:{id:{$exists:true}}})
This will indeed avoid null included in the result, but it also brings new problems.
When I query for id:null , the index cannot be used.
db.t1.explain().find({id:null})
I personally think that the result set is different just because of the addition of the index. This should be considered a bug。

steevej · January 17, 2024, 2:36am

It clearly is.

I think other way.

Is id:null a value you want to find or not? If it is, then do not use the partial index and accept the fact that null is a distinct value. If you do not accept the fact that null is a distinct value then you should not try to find( {id:null} ).

Any way, nothing stops you from having another index with {partialFIlterExpression:{id:null}}. So there is a solution.

Any way for me, the following should be true:

looking for all the distinct value should return me all the documents, so I expect null to be there

Now that you know how it works, make it work for you. I am quite happy with the way it works. I design using the specs I am given.

First and for all Have Fun.

chaigang · January 17, 2024, 2:41am

looking for all the distinct value should return me all the documents, so I expect null to be there。

I agree with this point of view, but the current situation is that without index, there is no null in the result set
For now, I can only adapt to it

chaigang · January 17, 2024, 2:51am

looking for all the distinct value should return me all the documents, so I expect null to be there。

I agree with this point of view, but the current situation is that without index, there is no null in the result set
For now, I can only adapt to it。

alexbevi · January 17, 2024, 2:41pm

db.foo.drop()
db.foo.insertMany([ { a: 1 }, { a: 1 }, { }, { a: 2 } ])
db.runCommand({ distinct: "foo", key: "a" }).values
// [ 1, 2 ]
db.foo.createIndex({ a: 1 })
db.runCommand({ distinct: "foo", key: "a" }).values
// [ null, 1, 2 ]

This appears to be a behavioral difference between the how null values are returned to the distinct command when a COLLSCAN plan is used vs. when an IXSCAN plan is used.

I’ve filed SERVER-85298 to follow up on this further.

steevej · January 17, 2024, 3:08pm

Edge cases like that are always interesting because they make you think and are often an occasion to learn.

So I tried to find an alternative that works the same index or not and the aggregation framework came to the rescue. I found:

db.foo.aggregate( { $group : { _id : "$a"}})

gives

{ _id: 1 }
{ _id: null }
{ _id: 2 }

without the index and gives

{ _id: null }
{ _id: 1 }
{ _id: 2 }

with the index. The explain plan confirms the index is used by $group once it is created:

stage: 'DISTINCT_SCAN',
keyPattern: { a: 1 },
indexName: 'a_1',

alexbevi · January 17, 2024, 3:47pm

I believe the issue here is the usage of the distinct command; not whether or not there are other ways to surface a list of distinct values (though it’s good to showcase the power/utility of aggregation here).