How to run a compression aggregation on timeseries data?

Zoltan_Zvara · September 11, 2023, 5:22pm

I’m searching for an algorithm in MongoDB concerning the following problem. Assume the following data (simplified):

{ ts: 1,  x: 1 } // ts: timestamp, x: integer value
{ ts: 2,  x: 2 }
{ ts: 3,  x: 2 }
{ ts: 4,  x: 2 }
{ ts: 5,  x: 1 }
{ ts: 6,  x: 1 }
{ ts: 7,  x: 4 }
{ ts: 8,  x: 4 }
{ ts: 9,  x: 4 }
{ ts: 10, x: 5 }
{ ts: 12, x: 5 }
{ ts: 13, x: 1 }
{ ts: 14, x: 1 }
{ ts: 15, x: 1 }
{ ts: 16, x: 1 }
{ ts: 17, x: 2 }

I’m looking for an algorithm that compresses the data points based on x in the timeseries, essentially converting the timeseries into a series of change points rather than returning all individual data points.

The output would be the following (assuming that I prefiltered the data for 2 < ts < 16):

{ ts: 1,  x: 1 } // filtered out
{ ts: 2,  x: 2 } // filtered out
{ ts: 3,  x: 2 } *
{ ts: 4,  x: 2 } // removed, because no change in `x`
{ ts: 5,  x: 1 } *
{ ts: 6,  x: 1 } // removed, because no change in `x`
{ ts: 7,  x: 4 } *
{ ts: 8,  x: 4 } // removed, because no change in `x`
{ ts: 9,  x: 4 } // removed, because no change in `x`
{ ts: 10, x: 5 } *
{ ts: 12, x: 5 } // removed, because no change in `x`
{ ts: 13, x: 1 } *
{ ts: 14, x: 1 } // removed, because no change in `x`
{ ts: 15, x: 1 } * (kept to mark the end of the timeseries)
{ ts: 16, x: 1 } // filtered out
{ ts: 17, x: 2 } // filtered out

Final output (each data marks the first point of value change):

{ ts: 3,  x: 2 }
{ ts: 5,  x: 1 }
{ ts: 7,  x: 4 }
{ ts: 10, x: 5 }
{ ts: 13, x: 1 }
{ ts: 15, x: 1 }

My implementation ideas and attempts include the following:

Using an aggregation for each data point, look up the previous data point in the collection and merge the previous ts with the current one. | 1M data points, 5 minutes of runtime, frequent OOMs
Using an aggregation group all data together, run a pairwise reduce, detect the changes, and then filter out elements without a change. | 1M data points, 5 minutes of runtime, frequent OOMs
(in progress) Using an aggregation group all data together and run a custom $function that scans the data and detects the change points. | (in progress)

Should MongoDB be used for these algorithms (it seems not), or am I missing something?

John_Sewell · September 11, 2023, 7:05pm

This does seem similar to a post a while back, I came up with an abomination of a solution but the one on SO was performant.
The idea is kind of the same where you want to watch for groups of data that are sequential, i.e. next to each other…

I’ve not played with time series data in Mongo so a similar type of approach may not be suitable…

steevej · September 12, 2023, 12:24pm

It would be great if you could share the complete aggregations you tried. Especially for

because I feel it is the way to go and very surprised that you get OOM.

taglas_tamas · September 12, 2023, 6:44pm

Using mainly the $functions operator, the pipeline would look something like this:

[
  {
    $sort: {
      ts: 1,
    }, // Sort the documents by ts in case they are not sorted
  },
  {
    $set: {
      prevValue: {
        $function: {
          body: function (
            currentValue,
            prevValue
          ) {
            return currentValue - prevValue;
          },
          args: ["$x", 0],
          // 0 for the initial previous value
          lang: "js",
        },
      },
    },
  },
  {
    $match: {
      $expr: {
        $ne: ["$x", "prevValue"],
      },
    },
  },
  {
    $project: {
       ts: 1,
       x: 1
    },
  },
  {
    $out:
      "agg-out",
  },
]