Need help defining the right document structure and indexes

Hi,

I’m using MongoDB 4.x and the PHP 8.1 library/extension.
I’m building a small customer feedback SaaS where the entries are stored in MongoDB.

The problem that I am facing is that I have to query pretty much every key at some point making it very hard deciding on structure and indexes.

This is wat I have so far:

{
    "_id": {
        "$oid": "62af099d90dcb82b1f750b12"
    },
    "survey_id": 205380,
    "account_id": 1005,
    "ws": 15,
    "date_reg": {
        "$date": "2022-06-19T11:33:49.000Z"
    },
    "time_reg": "11:33:49",
    "read": 0,
    "ip": "x",
    "ent": {
        "person": "Steve",
        "location": "California",
        "date": "2022-06-01"
    },
    "topics": ["terrible", "quality", "bug"],
    "tags": [10091, 15335, 235235, 23235],
    "sentiment": 1,
    "country": "",
    "star": 0,
    "ref": "",
    "page": "x",
    "meta": {
        "os": 1,
        "dev": 0
    },
    "answers": [{
        "answer": "sample answer 1",
        "original_answer": "demo antwoord 1",
        "type": 0,
        "qid": 5768,
        "datatype": 0,
        "so": 0,
        "q": "Let's get started! What is your first name?"
    }, {
        "answer": 2,
        "original_answer": "",
        "type": 4,
        "qid": 5770,
        "datatype": 16,
        "so": 1,
        "q": "Thanks Let's get started..., how likely are you to..."
    }, 
etc]
}

account_id + survey_id will be in pretty much every query.
All the other keys are keys that I have to query specifcally or multiple/all at once (when customer is selecting filters).

the answers array contain all the answers, also here is pretty much every key important except the original_answer and so.

Breakdown of what each key does:

  1. answer: is the answer that is given.
    It’s function: view the answer that was given and search queries

  2. original_answer: is the answer in the native language of the respondent.
    my customer can choose to auto translate the answer to their language.
    It’s function: search queries

  3. type: is the question type. E.g. textfield is 0, dropdown=1, textarea=2, checkbox=3 etc.
    function: for use in summaries/reporting, to quickly select by question type.
    Come to think of it, it could be less important because I can also query by question_id

  4. qid: the question id
    function: for use in summaries/reporting, to quickly select entries for that specific question.

  5. datatype: a integer for storing what kind of data it is. eg. multiple choice, sensitive data etc.
    function: used to quickly determine which entries have sensitive data.

  6. so: sortorder, only used in viewing feedback questions in the right order.

  7. q: the question that the respondent was answering.
    function: If my customer changes the question in the (MySQL) surveys table, I can tell them the results might be skewed because they changed the question when there was already live feedback.
    On the other hand, if the original question was deleted, I can still show a small snippet so even if it’s removed they can still see what the question was about.

As for indexes. I have a index on almost every field, but this is of course a no go.
I would make account_id + survey_id a compound index, but then when you select multiple keys it will not use that index or at least not effectively.

Please help.

Hi @Ralph_van_der_Sanden and welcome in the MongoDB Community :muscle: !

It’s really hard to answer without more numbers like the total size of the collection, nb of docs and cardinality of survey_id and account_id.

With the given information and supposing a “normal” distribution, I would just create 2 indexes (depending on cardinality to try to make them as filtering as possible). Either:

  • {account_id: 1, survery_id: 1} and {survey_id: 1}
    OR
  • {survey_id: 1, account_id: 1} and {account_id: 1}

My supposition is that the filter on one of these fields or both at the same time will already filter down the result set from 1 millions docs to 100 docs. The remaining 100 docs will need to be fetch from disc to resolve the rest of the query, whatever the other filters. It’s an “OK” trade off so inserts don’t suffer too much and queries are still performant.

Let it run like this during a month.

  • If nobody complains about the performances. => Go to Bahamas, you’re done.
  • If they complain, check the MongoDB Atlas performance advisor in Atlas or activate the Profiler to find the slowest queries. Find the most frequently used slow query and add an index to speed this up. You will probably notice that your users aren’t actually filtering on ALL the fields but rather always using the same 3 fields for filtering down. => Go to Bahamas.

Final comment, remember the ESR rule: Equality => Sort => Range.
I saw you mentioned a sort at some point. If you need your answers to be sorted, then make sure the sort is always before a range query (a $in is a range) to avoid in-memory sorts.

Cheers,
Maxime.

Hi Maxime,
The collection will have around 100 million documents and per survey (max) 1 million.
Having only the indexes as you suggested don’t work.

For example the tags key, I have to use a aggregration pipeline to unwind that array.
Even if account_id + survey_id brings it down to max. 1 million, it’s still too much without adding an index on “tags”.

Thanks.

The index on tags won’t help the $unwind, are we good on that?

If tags is commonly used for filtering then I would add it in the indexes. It’s a trade off between the size of the indexes (which use RAM) and how often you are using these queries.

Like : “Is it worth it to create this 2GB index for a query that is running 100 times a day and takes 1 sec?” Maybe yes, maybe no. It depends on the use case, budget, hardware, etc.

You indexes could be:

{account_id: 1, survery_id: 1, tags: 1}
{survery_id: 1, tags: 1}
{account_id: 1, tags: 1}

This is a bit sad but after a few weeks, you can run:

db.orders.aggregate( [ { $indexStats: { } } ] )

And check which index is really useful and which one could go away and maybe be replaced by another one more strategic / worth it that would be detected by the profiler or Atlas Performance Advisor.

I was consider adding wildcard indexes as a potential solution to your problem but I think they wouldn’t help as you have a specific & known schema.

I think the easiest solution to your problem here is to force some filters to your users when they perform a search so you can prevent highly inefficient ad-hoc queries in the DB.
Else they will click on the SEARCH button without any filter and basically send a find({}) to the DB… Nobody will like the result / perf of that one.

1 Like

Hi, thank you.

Will play around with indexes and $indexStats to get the right balance.

Another thing I run into is a query like this:
In MySQL I would do SELECT key FROM table WHERE condition=x and will only get that value.
In MongoDB I can do that, but it will return the whole document which is logical.
I can use projection to only view the key I need (answers.answer) but then all the other answers are returned too.

For example, I want to find all (not empty) answers.answer where answers.qid=5
I would get the answer I need but also all other answers with different qid’s. I would have to filter them in PHP but that’s not very effective.
How can I create a query that only returns the array/object that matched the conditions?

You need $filter in your projection.

If I insert with mongoimport the sample doc you provided in the top post, I can run this aggregation that is supported by an index on {“answers.qid”: 1}

[
  {
    '$match': {
      'answers.qid': 5768
    }
  }, {
    '$project': {
      'items': {
        '$filter': {
          'input': '$answers', 
          'as': 'item', 
          'cond': {
            '$eq': [
              '$$item.qid', 5768
            ]
          }
        }
      }
    }
  }
]

Final ouput looks like this:

[
  {
    _id: ObjectId("62af099d90dcb82b1f750b12"),
    items: [
      {
        answer: 'sample answer 1',
        original_answer: 'demo antwoord 1',
        type: 0,
        qid: 5768,
        datatype: 0,
        so: 0,
        q: "Let's get started! What is your first name?"
      }
    ]
  }
]

Does that work for you?

If you prefer to only retrieve a single doc at the end with a single array, you could add an extra couple of stages to group all the arrays into a single one (and then flatten it as it’s an array of arrays).

If I insert the same doc a second time in the collection with a different _id, and execute this pipeline:

[
  {
    '$match': {
      'answers.qid': 5768
    }
  }, {
    '$project': {
      'answers': {
        '$filter': {
          'input': '$answers', 
          'as': 'item', 
          'cond': {
            '$eq': [
              '$$item.qid', 5768
            ]
          }
        }
      }
    }
  }, {
    '$group': {
      '_id': null, 
      'all': {
        '$push': '$answers'
      }
    }
  }, {
    '$project': {
      '_id': 0, 
      'answers': {
        '$reduce': {
          'input': '$all', 
          'initialValue': [], 
          'in': {
            '$concatArrays': [
              '$$value', '$$this'
            ]
          }
        }
      }
    }
  }
]

I get this output now:

[
  {
    answers: [
      {
        answer: 'sample answer 1',
        original_answer: 'demo antwoord 1',
        type: 0,
        qid: 5768,
        datatype: 0,
        so: 0,
        q: "Let's get started! What is your first name?"
      },
      {
        answer: 'sample answer 1',
        original_answer: 'demo antwoord 1',
        type: 0,
        qid: 5768,
        datatype: 0,
        so: 0,
        q: "Let's get started! What is your first name?"
      }
    ]
  }
]

Cheers,
Maxime.

1 Like

Thank you so much Maxime and Steeve, that works great!
Ok, last few questions if you don’t mind:

  1. Do you think the current structure for the answers (nested array/objects) is OK when it gets to 100+ million documents per collection and about 1-2 million documents per account_id & survey_id combination? I will need to perform a lot of aggregrations for analytics purpose.
    The unwind could become the bottleneck I guess and how would I index the answers array/object the best?

  2. Searching is very problematic (at least with the things that I tried).
    I tried full text search, but that will only work if my customers are searching for a stemmed word which often will not be the case. Searching for anything else with my sample data of 2 million documents for 1 survey_id is just timing out.
    Any idea on how I can search fast in at least 500.000 - 1 million documents?
    answers.answer is the field that I would query.

  3. Is there a way to write a query that will return a word cloud (answers.answer), the most occurring words will have to be the largest font size, least occurring smallest font size. I can do the font-size part but how would you query this so that you can get results in let’s say max 2 seconds with a list of words and the occurrence per word? Or should I just use a tokenizer when a survey submission comes in and save those words in a different array inside that document?

  1. Yes I think it’s okay as long as the size of the array “answer” is limited. If all your docs are more in the KB zone than in the MB zone, then you are fine. Data that is access together should be stored together. If you don’t need the answers each time you access these docs, maybe they could be moved to another collection then to keep the docs lightweight and reduce the size of the working set in RAM, reduce IOPS, etc.

  2. With the right index and the right hardware, you can only go so far. If you want to speed things up, maybe it’s time to consider sharding this collection if everything else is maxed out and carefully tuned. FTS isn’t designed to improve the query speed. Aggregations can be really fast if they are not manipulating 500M docs in memory. We are reaching the limits of physics at some point. In any case, always use explain(true) to check which stage is slow in your query and see if you can improve it with indexes.

  3. MongoDB Atlas Search supports Highlight Search Terms in results. But it’s only available on Atlas. The sync between your data and the (hidden) Lucene engine is completely automated in Atlas and included in the pricing. No need for another Elastic or Solr licences & servers for example.

Cheers,
Maxime.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.