Unique Partial Index for queries

Hello,

I am working on a service which stores meta data for a collection of files.

There is one document per file in the database. Each document has a field “sha256” which is the sha256 hash of the file. This field might be null for some documents (project requirement, unfortunately this cannot be changed). If the field is set to a value other than null, then it must be unique.

A solution for this requirement is a unique index with a partial filter expression {"sha256": {"$type": "string"}}. This works as intended.

One problem I have is, that this index is never used for queries. My solution for this is to add another index on the same field as hashed index.

The current solution works. It is not possible to add sha256 duplicates, it is possible to add sha256 null values and it is possible to query for a sha256 value while utilizing an index.

As this database is getting bigger and bigger, my problem is the cost of storing the indexes twice. The unique partial index is 68GB, the hashed index is 14GB.

Is there any way to optimize this in terms of storage? To me it seems unefficient to index one field twice.

Thank your for your help in advance!

why? did you add unique constraint somewhere?
And a hashed index can’t be unique.

Maybe you are looking for this.

You would have to have a query filter matching the query filter or hint(really try to avoid hints) to use the existing index.

I would suggest NOT inserting a null value to the sha256 field. And then instead use a unique partial index of {sha256:{$exists:1}}

The missing field is implicitly null when queried for or projected.

You could also save a few bytes by storing the hash as a binary object as the digest is inherently binary, but possibly not worth the application changes:

test> doca
{
  sha256: 'afba684fd0fa05fd48fea823aa891af26c4072d02a914a6c8c6ae2c0a6436f8a'
}
test> bsonsize(doca)
82
test> docb
{
  sha256: Binary(Buffer.from("afba684fd0fa05fd48fea823aa891af26c4072d02a914a6c8c6ae2c0a6436f8a", "hex"), 0)
}
test> bsonsize(docb)
50


1 Like

Thank you very much for your input!

Changing the partial index to {sha256:{$exists:1}} sounds like a reasonable change. I will definitely try this.

Furthermore, thanks for the hint to store hashes as binary. The collection is quite large, so even small improvements have a large impact.

Thanks again. I will report if the changes were a success, once I am done.

1 Like