Help with an aggregation pipeline

Reka_Burmeister · October 24, 2023, 6:39pm

I’m trying to put together an aggregation pipeline that filters a large data set, then returns a paginated subset, along with the total count and the earliest timestamp within the collection (regardless of the page). Would very much like to avoid a FETCH.

[
  {
    $match: {
      $and: [
        {
          "identifier.arrayField": {
            $in: ["0000000000001"],
          },
        },
        {
          $and: [
            {
              someArrayField: {
                $exists: true,
              },
            },
            {
              $and: [
                {
                  someArrayField: {
                    $elemMatch: {
                      itemId: {
                        $in: [
                          "65353aeabf6d627ba9bdc5cc",
                        ],
                      },
                    },
                  },
                },
                {
                  anotherField: {
                    $ne: "65361492e800c0150c7a7ed",
                  },
                },
              ],
            },
          ],
        },
        {
          anIndexedField: "value",
        },
      ],
    },
  },
  {
    $facet: {
      items: [
        {
          $sort: {
            "sort_field": 1,
          },
        },
        {
          $skip: 0,
        },
        {
          $limit: 10,
        },
      ],
      count: [
        {
          $count: "count",
        },
      ],
      ets: [
        {
          $group: {
            _id: null,
            earliestTimestamp: {
              $min: "$createDate",
            },
          },
        },
      ],
    },
  },
  {
    $addFields: {
      total: {
        $arrayElemAt: ["$count.count", 0],
      },
      earliestTimestamp: {
        $arrayElemAt: [
          "$ets.earliestTimestamp",
          0,
        ],
      },
    },
  },
]

The resulting explanation is extremely long but I can paste that in if it helps.

Ideally, I would like to avoid fetching the entire matching collection for the sake of a count and getting the earliest timestamp.

John_Sewell · October 24, 2023, 6:47pm

Why not just make 2 calls, one for the earliest and total and another for the data you want (try and avoid skip if you can and do a filter instead)? Do you need the total every time in case more documents are added as you’re paging? If that’s the case can you guarantee that the added document will be after the previously shown results?

This avoids the complexity and possibly limitations of a $facet stage.

You seem to have a lot of the query written already, what’s the problem you’re facing? If you’re seeing non-index hits then your index is not matching your filtering.

Reka_Burmeister · October 24, 2023, 9:01pm

I was thinking it might have to be that - was just hoping that everything can be done “nicely” within the aggregation but what I would like resembles a tree structure of pipelines rather than a linear one.

The query is written and working but the performance benchmarking vs the earlier version (doing separate calls) showed that the new solution is worse. It was very surprising as that version fetched the collection and did the sort + pagination in java instead. Still, it’s faster.

We think the problem is a FETCH after the IXSCAN - so the index is being used but the facet then seems to do a fetch for all hits.

I have another aggregation that’s similar but isn’t supposed to fetch any items, just their count, grouping them into 5 different facets after applying the groups’ individual matches - again, this is slower than doing 5 different calls.

As you say, might just be a limitation of the facet - I confess I didn’t read its documentation thoroughly enough.