Time Series Aggregation Slow

Joshua_Crowther · September 2, 2022, 3:39pm

In vetting MongoDB as a Time Series solution I wanted to try a simple data aggregation on financial data. My documents have the following schema:

{
  "time": <BSON Time Field>,
  "metadata": {
    "exchange": "<Exchange Name>",
    "symbol": "<Market Symbol>",
    "timestamp": "<Microsecond Timestamp>",
    "tradeId": "<String TradeID>"
  },
  "side": "<buy or sell>",
  "price": 12345,
  "amount": 1
}

I created my collection in the following manner:

db.createCollection("trades", {
  timeseries: {
    timeField: "time",
    metaField: "metadata"
  }
});

I tried to follow a similar approach taken in the article “Currency Analysis with Time Series Collections” adapted for the things I had seen while investigating the performance of the seemingly trivial OHLC aggregation that was done.

My aggregation:

db.trades.aggregate([
  {
    $match: {
      "metadata.exchange": "ftx",
      "metadata.symbol": "BTC/USD:USD",
    },
  },
  {
    $group: {
      _id: {
        exchange: "$metadata.exchange",
        symbol: "$metadata.symbol",
        time: {
          $dateTrunc: {
            date: "$time",
            unit: "minute",
            binSize: 5,
          },
        },
      },
      high: { $max: "$price" },
      low: { $min: "$price" },
      open: { $first: "$price" },
      close: { $last: "$price" },
    },
  },
  {
    $sort: {
      "_id.time": 1,
    },
  },
]);

However when I do this and investigate the explain it seems that no matter what indexes I create I cannot prevent a FETCH operation and as such this aggregation is terribly slow. Looking at the winning plan from some of my trials I get:

{
  stage: 'FETCH',
  filter: { 'meta.symbol': { '$eq': 'BTC/USD:USD' } },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: {
      'meta.exchange': 1,
      'control.min.price': 1,
      'control.max.price': 1
    },
    indexName: 'metadata.exchange_1_price_1',
    isMultiKey: false,
    multiKeyPaths: {
      'meta.exchange': [],
      'control.min.price': [],
      'control.max.price': []
    },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: {
      'meta.exchange': [ '["ftx", "ftx"]' ],
      'control.min.price': [ '[MinKey, MaxKey]' ],
      'control.max.price': [ '[MinKey, MaxKey]' ]
    }
  }
}

(NOTE: I’ve created indexes on almost all permutations of the “used” portions of my data model - i.e. metadata.exchange, metadata.symbol, time, price and this seems to be the best I can get)

Current Indexes

```js [ { v: 2, key: { price: 1, time: 1, 'metadata.exchange': 1 }, name: 'price_1_time_1_metadata.exchange_1' }, { v: 2, key: { 'metadata.exchange': 1, price: 1 }, name: 'metadata.exchange_1_price_1' }, { v: 2, key: { 'metadata.exchange': 1, 'metadata.symbol': 1, price: 1 }, name: 'metadata.exchange_1_metadata.symbol_1_price_1' }, { v: 2, key: { 'metadata.symbol': 1, price: 1 }, name: 'metadata.symbol_1_price_1' }, { v: 2, key: { 'metadata.exchange': 1, time: 1 }, name: 'metadata.exchange_1_time_1' }, { v: 2, key: { 'metadata.exchange': 1, time: 1, price: 1 }, name: 'metadata.exchange_1_time_1_price_1' }, { v: 2, key: { 'metadata.exchange': 1, price: 1, time: 1 }, name: 'metadata.exchange_1_price_1_time_1' }, { v: 2, key: { 'metadata.symbol': 1, 'metadata.exchange': 1, price: 1 }, name: 'metadata.symbol_1_metadata.exchange_1_price_1' }, { v: 2, key: { 'metadata.exchange': 1, 'metadata.symbol': 1 }, name: 'metadata.exchange_1_metadata.symbol_1' }, { v: 2, key: { 'metadata.exchange': 1 }, name: 'metadata.exchange_1' }, { v: 2, key: { 'metadata.exchange': 1, time: -1 }, name: 'metadata.exchange_1_time_-1' }, { v: 2, key: { 'metadata.exchange': 1, price: 1, time: -1 }, name: 'metadata.exchange_1_price_1_time_-1' } ] ```

I’m currently doing this with only 486k documents in my collection but have seen that there will be approx 100 mil documents across just a single symbol that I’m looking to track. This is being done in kubernetes with a sharded mongodb (7 shards with 2 replicas per shard, 2 mongos and 1 cfg server with 1 backup) with no resource limits being set on the shard servers. Is there any way I can get these queries to start coming back in a reasonable amount of time (sub second for <1mil documents)?