Flexible data & aggregation with filtering (Indexes, Just MongoDB vs Parquet + Spark?)

jellyx · January 22, 2023, 12:35pm

Hello,

My data is flexible so much. I have a collection contacts and this is how one document looks like:

{
  "workspace_id": 1
  "attributes": {
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@example.net"
  },
  "events": [
    {
      "event": "order",
      "event_data": {
        "total": 100,
        "items": [
          {
            "product": "T-Shirt",
            "quantity": 1,
            "price": 100,
            "total": 100
          }
        ]
      }
    }
  ]
}

As you can see, there might be more events and also they can be different. Different users have different event structure.

Currently, I’m using wildcard indexes but wondering if this might be better:

{
  "workspace_id": 1
  "attributes": {
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@example.net"
  },
  "events": [
    {
      "event": "order",
      "event_data": [
        {
          "k": "total",
          "v": 100,
        },
        {
          "k": "items",
          "v": [
            {
              "k": "product",
              "v": "T-Shirt"
            },
            {
              "k": "quantity",
              "v": 1
            },
            {
              "k": "price",
              "v": 100
            },
            {
              "k": "total",
              "v": 100
            }
          ]
        }
      ]
    }
  ]
}

and then add indexes on “k” and “v”. Surely, attributes can also have the same.

I would also need aggregations. For example, filter all contacts who had event order 3 times in Q1 2023.

So, my questions are:

Which pattern would be more efficient to support unstructured flexible data?
How about streaming data to S3 bucket and transforming into Parquet and then using Spark for filtering and making aggregations.
How about sharding? Should we use it or just go with Spark?

I’m looking for the most efficient solution, including costs because Spark is a RAM consuming solution.

Thanks!