How to use conditionals (and more?) to combine and reduce data

Hi All! I’ve got a doozy… I need to reduce some data from a large collection. The data I am focusing on looks like this:

[
  {
    "HistogramData": {
      "HistogramDataColumnTitle": {
        "ColumnTitle": [
          "Tran  PPV  in/s",
          "Tran  Freq  Hz",
          "Vert  PPV  in/s",
          "Vert  Freq  Hz",
          "Long  PPV  in/s",
          "Long  Freq  Hz",
          "Geophone  PVS  in/s"
        ]
      },
      "HistogramDataIntervals": {
        "Time": "2021-05-20 12:22:32",
        "Row": {
          "Cell": [
            "0.1462",
            "46.5",
            "0.0525",
            ">100",
            "0.5494",
            "25.6",
            "0.5577"
          ]
        }
      }
    }
  },
  {
    "HistogramData": {
      "HistogramDataColumnTitle": {
        "ColumnTitle": [
          "Tran  PPV  in/s",
          "Tran  Freq  Hz",
          "Vert  PPV  in/s",
          "Vert  Freq  Hz",
          "Long  PPV  in/s",
          "Long  Freq  Hz",
          "Geophone  PVS  in/s"
        ]
      },
      "HistogramDataIntervals": [
        {
          "Time": "2021-05-22 19:15:51",
          "Row": {
            "Cell": [
              "0.0050",
              "19.7",
              "0.0056",
              "7.4",
              "0.0050",
              "28.4",
              "0.0063"
            ]
          }
        },
        {
          "Time": "2021-05-22 19:30:51",
          "Row": {
            "Cell": [
              "0.0050",
              "60.2",
              "0.0044",
              "20.5",
              "0.0050",
              "48.8",
              "0.0063"
            ]
          }
        }
      ]
    }
  }
]

and I somehow need to reduce it to this:

[
  {
    "Time": "2021-05-20 12:22:32",
    "Max  PPV  in/s": 0.5494,
    "Max  Freq  Hz": 25.6
  },
  {
    "Time": "2021-05-22 19:15:51",
    "Max  PPV  in/s": 0.0056,
    "Max  Freq  Hz": 7.4
  },
  {
    "Time": "2021-05-22 19:30:51",
    "Max  PPV  in/s": 0.0050,
    "Max  Freq  Hz": 60.2
  }
]

What is going on here? The values in the “Cell” array correspond to the “ColumnTitle” array. The values and dates are stored as strings for various reasons that cannot be addressed, let’s take it as a given. A document may have one set of values as you see in the first document, or multiple sets of values as you see in the second document. Regardless, for each set of values, I must compare all “PPV in/s” values, find the max, and pair with the corresponding “Freq Hz”. Add time and that’s what I present as the desired outcome. Presenting the results in a new array after “Cell” is also acceptable.

I’ve been going around in circles trying to figure out how to implement different stages in an aggregation pipeline. Any help would be much appreciated. Thank you in advance!

Are your ColumnTitle always the same and in the same orders?

When you ask for Max PPV, do you mean the max between Tran PPV, Vert PPV and Long PPV?

When HistogramDataIntervals is an array, do you want the Time and Values of the row that has the max as defined in the previous sentence?

Your input values seem to be string and your result shows number. What to do if the Freq of the Max PPV is the string value >100 as you have in one of your input documents?

I’ve identified three types of sensors. The “combo” sensors have all the seven channels I presented above and always report out in that order. The other type does not have the “Geophone” channel. Lastly there are sensors that only have a “Geophone” channel.

Yes, max PPV is the max between Tran PPV, Vert PPV and Long PPV.

If you mean presenting the max values alongside the original values then yes that would be good. Time must be presented, not optional.

Good catch. When frequency is >100 I plot is as 100. It could be >200 too and I plot that as 200.

I’ve been contemplating whether having an intermediate step like this would be beneficial:

[
  {
    "Time": "2021-05-20 12:22:32",
    "Tran  PPV  in/s": "0.1462",
    "Tran  Freq  Hz": "46.5",
    "Vert  PPV  in/s": "0.0525",
    "Vert  Freq  Hz": ">100",
    "Long  PPV  in/s": "0.5494",
    "Long  Freq  Hz": "25.6",
    "Geophone  PVS  in/s": "0.5577"
  },
  {
    "Time": "2021-05-22 19:15:51",
    "Tran  PPV  in/s": "0.0050",
    "Tran  Freq  Hz": "19.7",
    "Vert  PPV  in/s": "0.0056",
    "Vert  Freq  Hz": "7.4",
    "Long  PPV  in/s": "0.0050",
    "Long  Freq  Hz": "28.4",
    "Geophone  PVS  in/s": "0.0063"
  },
  {
    "Time": "2021-05-22 19:30:51",
    "Tran  PPV  in/s": "0.0050",
    "Tran  Freq  Hz": "60.2",
    "Vert  PPV  in/s": "0.0044",
    "Vert  Freq  Hz": "20.5",
    "Long  PPV  in/s": "0.0050",
    "Long  Freq  Hz": "48.8",
    "Geophone  PVS  in/s": "0.0063"
  }
]

Thank you so much, Steeve!

You may skip reading this post as a correct solution is in the next post

Last minute note: You may read after COMPLEX AND WRONG SOLUTION BELOW but I found that I err and I could not produce

in the result set because I was going for a single PPV max per input documents. And it looks you want a PPV max per Row.Cell. This is simpler because a lot of $map can be removed from the solution below once you $unwind.

I find it always beneficial. But I would start with

_unwind = { "$unwind" : "$HistogramData.HistogramDataIntervals" }
_project = { "$project" : {
    "_id" : 0 ,
    "time" : "$HistogramData.HistogramDataIntervals.Time" ,
    "row" : "$HistogramData.HistogramDataIntervals.Row.Cell"
} }

to simply the data to:

{ time: '2021-05-20 12:22:32',
  row: [ '0.1462', '46.5', '0.0525', '>100', '0.5494', '25.6', '0.5577' ] }
{ time: '2021-05-22 19:15:51',
  row: [ '0.0050', '19.7', '0.0056', '7.4', '0.0050', '28.4', '0.0063' ] }
{ time: '2021-05-22 19:30:51',
  row: [ '0.0050', '60.2', '0.0044', '20.5', '0.0050', '48.8', '0.0063' ] }

Some of the stages from the wrong solution below can be use but in a much simpler form. In most only the inner $map is required now that it is $unwind. The _set_indices stays the same. For example _set_PPV with the double $map becomes

_set_PPV = { "$set" : {
    "_PPV" : { "$map" : {
        "input" : "$_ppv_indices" ,
        "as" : "index" ,
        "in" : { "$arrayElemAt" : [ "$$row" , "$$index" ] }
    } }
} }

I will publish a real solution, after I sleep. Good Night.

COMPLEX AND WRONG SOLUTION BELOW

The fact that some times HistogramDataIntervals is an object and at other times it is an array.

My first stage would be a $set to unify everything to an array. Something like:

_set_Data = { "$set" : { 
    "_Data" : { "$cond" : {
        "if" : { "$isArray" : "$HistogramData.HistogramDataIntervals" } ,
        "then" : "$HistogramData.HistogramDataIntervals" ,
       "else" : [ "$HistogramData.HistogramDataIntervals" ]
    } } ,
} }

Then another set stage which could be combine with the above but I like to separate them. This set just set fields to then index needed to get the ppv and freq values in different arrays.

Divide and Conquer

_set_indices = { "$set" : {
    "_ppv_indices" : [ 0 , 2 , 4 ] ,
    "_freq_indices" : [ 1 , 3 , 5 ]
} }

Next transformation would use $map on _Data to extract PPV values into a separate array using _ppv_indices for the Row.Cell embedded array. Something like:

_set_PPV = { "$set" : {
    "_PPV" : { "$map" : {
        "input" : "$_Data" ,
        "as" :  "data" ,
        "in" : {
            "Time" : "$$data.Time" , 
            "ppv" : { "$map" : {
                "input" : "$_ppv_indices" ,
                "as" : "index" ,
                "in" : { "$arrayElemAt" : [ "$$data.Row.Cell" , "$$index" ] }
            } }
         }
    } }
} }

Doing the same as above but for Freq.

_set_Freq = { "$set" : {
    "_Freq" : { "$map" : {
        "input" : "$_Data" ,
        "as" :  "data" ,
        "in" : {
            "Time" : "$$data.Time" , 
            "freq" : { "$map" : {
                "input" : "$_freq_indices" ,
                "as" : "index" ,
                "in" : { "$arrayElemAt" : [ "$$data.Row.Cell" , "$$index" ] }
            } }
         }
    } }
} }

My next transformation would be to user $map to convert the values of _Freq.freq and _PPV.ppv to numbers.

_convert_PPV = { "$set" : {
    "_PPV_converted" : { "$map" : {
        "input" : "$_PPV" ,
        "as" :  "data" ,
        "in" : { 
            "Time" : "$$data.Time" , 
            "ppv" : { "$map" : {
                "input" : "$$data.ppv" ,
                "as" : "value" ,
                "in" : { "$convert" : {
                    "input" : "$$value" ,
                    "to" : "double" ,
                    "onError" : { "$cond" : [ { "$eq" : [ ">100" , "$$value" ] } , 100.0 , 200.0 ] }
                } }
            } }
         }
    } }
} }

Ditto for Freq.

_convert_Freq = { "$set" : {
    "_Freq_converted" : { "$map" : {
        "input" : "$_Freq" ,
        "as" :  "data" ,
        "in" : {
            "Time" : "$$data.Time" , 
            "freq" : { "$map" : {
                "input" : "$$data.freq" ,
                "as" : "value" ,
                "in" : { "$convert" : {
                    "input" : "$$value" ,
                    "to" : "double" ,
                    "onError" : { "$cond" : [ { "$eq" : [ ">100" , "$$value" ] } , 100.0 , 200.0 ] }
                } }
            } }
         }
    } }
} }

Now the data (_PPV_converted and _Freq_converted) is in a workable format to get the max ppv.

_max_ppv = { "$set" : {
    "_max_ppv" : { "$map" : {
        "input" : "$_PPV_converted" ,
        "as" :  "data" ,
        "in" : {
            "Time" : "$$data.Time" , 
            "ppv" : "$$data.ppv" ,
            "max" : { "$max" : "$$data.ppv" }
         }
    } }
} }

Now that we know the max within which time row we can find the max of the document.

_max = { "$set" : {
    "_max" : { "$max" : "$_max_ppv.max" }
} }

The rest gonna be for another day. I am done for today. As a final note a lot of those $set can be done in a single stage. But when developing I find it easy to keep them separate. Optimization can come after accuracy.

Let’s start again.

First 2 stages:

_unwind = { "$unwind" : "$HistogramData.HistogramDataIntervals" }
_project = { "$project" : {
    "_id" : 0 ,
    "time" : "$HistogramData.HistogramDataIntervals.Time" ,
    "row" : "$HistogramData.HistogramDataIntervals.Row.Cell"
} }

Then a simple stage that extracts the PPV values in a separate array.

_set_PPV = { "$set" : {
    "_PPV" : { "$map" : {
        "input" : [ 0 , 2 , 4 ] ,
        "as" : "index" ,
        "in" : { "$arrayElemAt" : [ "$row" , "$$index" ] }
    } }
} }

Then a stage to convert the new PPV.

_convert_PPV = { "$set" : {
    "_PPV_converted" : { "$map" : {
        "input" : "$_PPV" ,
        "as" :  "ppv" ,
        "in" : { "$convert" : {
            "input" : "$$ppv" ,
            "to" : "double"
        } }
    } }
} }

Then we get the _max_PPV

_max_ppv = { "$set" : {
    "_max_ppv" : { "$max" : "$_PPV_converted" }
} }

The we use $indexOfArray to find the index of _max_ppv inside the _PPV_converted array.

_index = { "$set" : {
    "_index" : { "$indexOfArray" : [ "$_PPV_converted" , "$_max_ppv" ] }
} }

My next stage would be to find the corresponding Freq in row which is the index of max ppv + 1

_freq = { "$set" : {
    "_freq" : { "$arrayElemAt" : [ "$row" , { "$add" : [ { "$multiply" : [ "$_index" , 2 ] } , 1 ] } ] }
} }

All this for a pipeline that looks like:

[ _unwind , _project , _set_PPV , _convert_PPV , _max_ppv , _index , _freq ]

That provides the following:

{ time: '2021-05-20 12:22:32',
  row: [ '0.1462', '46.5', '0.0525', '>100', '0.5494', '25.6', '0.5577' ],
  _PPV: [ '0.1462', '0.0525', '0.5494' ],
  _PPV_converted: [ 0.1462, 0.0525, 0.5494 ],
  _max_ppv: 0.5494,
  _index: 2,
  _freq: '25.6' }
{ time: '2021-05-22 19:15:51',
  row: [ '0.0050', '19.7', '0.0056', '7.4', '0.0050', '28.4', '0.0063' ],
  _PPV: [ '0.0050', '0.0056', '0.0050' ],
  _PPV_converted: [ 0.005, 0.0056, 0.005 ],
  _max_ppv: 0.0056,
  _index: 1,
  _freq: '7.4' }
{ time: '2021-05-22 19:30:51',
  row: [ '0.0050', '60.2', '0.0044', '20.5', '0.0050', '48.8', '0.0063' ],
  _PPV: [ '0.0050', '0.0044', '0.0050' ],
  _PPV_converted: [ 0.005, 0.0044, 0.005 ],
  _max_ppv: 0.005,
  _index: 0,
  _freq: '60.2' }

All the data is there (even the intermediate values for debug purpose), what is missing is some cosmetic $project to present the fields time, _max_ppv, and _freq in the format you wish. I will leave that to the reader as an exercise.