Querying for all unique key-value pairs in an array property across all documents

Richard_Hannah · January 6, 2023, 6:38pm

I am trying to work out a way of getting all unique pairs of a key-value array property across all documents in a collection.

The array ‘Tags’ property looks like this on say two documents:
Document1:
{
Tags: [
{ k: “key1”, v: “value1” },
{ k: “key2”, v: “value2” },
…
]
}
Document2:
{
Tags: [
{ k: “key1”, v: “value3” }
…
]
}

With expected query output something like this:
[
{ k: “key1”, v: “value1” },
{ k: “key2”, v: “value2” }
{ k: “key1”, v: “value3” },
…
]
or this:
[
{
k: “key1”,
values: [
“value1”,
“value3”,
…
]
},
{
k: “key2”,
values: [
“value2”,
…
]
}
…
},
{
k: “key2”,
values: [
“value2”,
…
]
}
]

I have an index Tags_k_1_Tags_v_1, but I cannot get a query to return that uses it. I cannot use distinct() as more than one field is required. I have tried aggregation by unwinding the array, then grouping by k, and addToSet the values. No luck using the index. There are possibly millions of documents that will grow over time, so it is important an index is used.

Any help appreciated.

steevej · January 7, 2023, 5:43pm

Please read Formatting code and log snippets in posts and update your sample documents so that we can cut-n-paste into our servers for experimentation.

Richard_Hannah · January 19, 2023, 7:52am

Rather than delve into an example, I will try to articulate the question a bit as I imagine it is a common performance problem to solve.

If you have a multikey index on a property (‘Tags’) that is array of key-value (‘k’ and ‘v’) pairs, is there an index that allows you to efficiently build a complete list of unique key-value pairs across the collection? I want to additionally filter by another property (‘Owner’) matching a value.

I have tried indexes “Owner_1_Tags_1” and “Owner_1_Tags.k_1_Tags.v_1”.
I tried using Distinct(“Tags”, “{ Owner: }”)
I tried a number of aggregation pipeline variants involving $unwind or $group.

I can get the list of key-value pairs for all documents where Owner matches a value, but I cannot get it to efficiently use an index, causing seconds to query across a growing millions of matching documents. How can I do this using an index?

Sumanta_Mukhopadhyay · January 19, 2023, 8:34am

There are a few different ways you could approach this problem depending on the specific requirements of your application and the specific database you are using. Here’s one example of how you could use MongoDB’s aggregation framework to get the unique pairs of key-value properties across all documents in a collection:

Use the $unwind stage to “flatten” the Tags array, so that each document in the pipeline represents a single tag.
Use the $group stage to group the documents by the k field, and use the $addToSet operator to add each unique v value to a set.
Use the $project stage to reshape the documents to your desired output format. For example, you could use the following projection to get the output format you described in the first example:

{ 
    $project: { 
        _id: 0, 
        k: "$_id", 
        v: "$values" 
    } 
}

Note that you need to make sure that your index Tags_k_1_Tags_v_1 is on the correct collection and fields, otherwise it won’t be used.

Also, if you’re just looking for unique pairs of key-value properties and don’t care about the exact format of the output, you could use the $group stage again to group the documents by both the k and v fields, which will automatically eliminate any duplicate pairs.

Richard_Hannah · January 19, 2023, 11:10am

Thanks for the swift reply!

I have tried your suggestion below before and found that, although it works, it does not use an index.

[
  {
    $match: {
      Owner: "Test",
    },
  },
  {
    $unwind: {
      path: "$Tags",
    },
  },
  {
    $group: {
      _id: "$Tags.k",
      values: {
        $addToSet: "$Tags.v",
      },
    },
  },
  {
    $project: {
      _id: 0,
      k: "$_id",
      v: "$values",
    },
  },
]

Explain() does not indicate index Owner_1_Tags.k_1_Tags.v_1 or Owner_1_Tags_1 is used. Is it the Owner filtering stage that is stopping the opportunity to use an index?

Richard_Hannah · January 19, 2023, 11:27am

Further to the reply above:

I have just tried removing the Owner match stage and simplifying the indexes I am trying to use to be just Tags_1 and Tags.k_1_Tags.v_1, but again, neither index is used.

steevej · January 19, 2023, 2:42pm

Some stages, $group for sure and probably $unwind too, modifies the original indexed documents is such a way that an indexes cannot be used.

I have an idea that might work. Might, because the lack of sample documents stop me from being able to test. I am lazy. There is no way I will create documents that the requester can make available to me by simply using the appropriate markup.

project = { $project : { "tags.k" : 1 } }
unwind = { $unwind : "$tags" }
filter = { $filter : { input : "$tags" , cond : { "$eq" : [ "$$this.k" , "$$k"] } } }
lookup = { $lookup : {
    from : "tags" ,
    localField : "tags.k" ,
    foreignField : "tags.k" ,
    let : { k : "$tags.k" } ,
    pipeline : [
        { $project : { _id:0,"v" : filter } } ,
        { $unwind : "$v" } ,
        { $set : { v : "$v.v" } }
    ]
    as : "_result"
} }
cosmetic = { $project : { _id : 0 , k : "$tags.k" , v : "$_result.v" } }
pipeline = [ project , unwind , lookup , cosmetic ]

Richard_Hannah · January 19, 2023, 4:03pm

@steevej apologies for not providing example documents. Given these documents:

[
{
  "Tags": [
    {
      "k": "Type",
      "v": "Asset"
    },
	{
      "k": "Code",
      "v": "DEV"
    },
  ],
  "Owner": "Test"
},
{
  "Tags": [
    {
      "k": "Type",
      "v": "Metric"
    },
	{
      "k": "Code",
      "v": "POL"
    },
  ],
  "Owner": "Test"
},
{
  "Tags": [
    {
      "k": "Type",
      "v": "Asset"
    },
	{
      "k": "Asset",
      "v": "asset1.json"
    },
  ],
  "Owner": "Test"
}
]

I effectively want to run this:
db.event.distinct("Tags", { Owner: "Test" })
to get all unique k-v tag pairs across all documents where Owner = “Test”:

[
	{
      "k": "Type",
      "v": "Asset"
    },
	{
      "k": "Type",
      "v": "Metric"
    },
	{
      "k": "Code",
      "v": "DEV"
    },	
	{
      "k": "Code",
      "v": "POL"
    },
	{
      "k": "Asset",
      "v": "asset1.json"
    }
]

I expect to be able to use a multi-key index on Tags, and it should be a covered query if the index includes Owner and Tags.

I cannot get the lookup pipeline above to work for me in Compass. It complains localField and foreignField cannot be used with Pipeline. I am not sure how anything involving the unwind can work efficiently as it does not appear to be able to use an index.

steevej · January 20, 2023, 1:21am

Try updating Compass. The localField/foreignField with pipeline version is recent.

With your sample documents, the index

{Owner:1,Tags.k:1,Tags.v:1}

and the following pipeline:

[       { '$match': { Owner: 'Test' } },
        { '$sort': { 'Tags.k': 1, 'Tags.v': 1 } },
        { '$project': { 'Tags.k': 1, 'Tags.v': 1 } },
        { '$unwind': '$Tags' },
        { '$group': { _id: { k: '$Tags.k', v: '$Tags.v' } } }
]

produce the result:

{ _id: { k: 'Asset', v: 'asset1.json' } }
{ _id: { k: 'Type', v: 'Asset' } }
{ _id: { k: 'Code', v: 'POL' } }
{ _id: { k: 'Type', v: 'Metric' } }
{ _id: { k: 'Code', v: 'DEV' } }

the explain plan indicates

stage: 'IXSCAN',
keyPattern: { Owner: 1, 'Tags.k': 1, 'Tags.v': 1 },
indexName: 'Owner_1_Tags.k_1_Tags.v_1',
isMultiKey: true,
multiKeyPaths: { Owner: [], 'Tags.k': [ 'Tags' ], 'Tags.v': [ 'Tags' ] },
/* some fields omitted */
indexBounds: 
                       { Owner: [ '["Test", "Test"]' ],
                         'Tags.k': [ '[MinKey, MaxKey]' ],
                         'Tags.v': [ '[MinKey, MaxKey]' ] }

But it looks like it is not covered since IXSCAN is under FETCH and totalDocsExamined is not 0.

And it looks this is the best that can be done according to

Richard_Hannah · January 20, 2023, 10:45am

@steevej thanks for your efforts. I tried this out but unfortunately it is still taking seconds across 100,000s of documents. It also has thrown:
Command aggregate failed: Sort exceeded memory limit of 104857600 bytes

It is odd that there is not a way of extracting the data that is fully covered by an index. I guess the only way to get this to perform is to maintain a separate collection of unique tag key-value pairs with reference counts and keep it in sync with the documents having tags? This would be a pain.

Any thoughts on alternatives would be appreciated.

steevej · January 20, 2023, 1:50pm

For

I will simply remove the $sort. You still get an IXSCAN without it. The $match and $project seems sufficient to use the index.

Note that performance issues are not strictly caused by the code logic. It also depends on the hardware you use. If the working set does not fit in RAM, you will be I/O bound and that is slow.

Read Atlas Search to return all the applicable filters for a given search query without specifying - #2 by Erik_Hatcher, the idea is smart as you may be able to add an extra field to speed up things rather than a new collection. Your documents could look like:

[
{
  "Tags": [
    {
      "k": "Type",
      "v": "Asset"
    },
	{
      "k": "Code",
      "v": "DEV"
    },
  ], 
  "k" : [ "Type" , "Code" ]
  "Owner": "Test"
},
{
  "Tags": [
    {
      "k": "Type",
      "v": "Metric"
    },
	{
      "k": "Code",
      "v": "POL"
    },
  ],
 "k" : [ "Type" , "Code" ] ,
  "Owner": "Test"
},
{
  "Tags": [
    {
      "k": "Type",
      "v": "Asset"
    },
	{
      "k": "Asset",
      "v": "asset1.json"
    },
  ],
  "k" : [ "Type" , "Asset" ] ,
  "Owner": "Test"
}
]

The index Owner:1,k:1 would be a much smaller working set.

Another thing you could try is to use the index Owner:1,Tags:1 to see if the index size is smaller it may help.

Schema wise, do you really need the attribute pattern for tags. The range of value of “k” seems to be quite small so having direct attributes “Type”, “Asset” and “Code” might be much better.

Richard_Hannah · February 6, 2023, 9:18am

Just to let you know that I decided to abandon the approach of getting all unique key-value pairs altogether, managing to get the same information from dedicated properties and elsewhere. Many thanks for your help. I will mark the above as a solution, as ultimately it is as good as it could be.

system · February 11, 2023, 9:18am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.