[Proposal] Boost the performance of bioinformatic annotation queries

Disclaimer

This is a copy of a proposal from the feedback portal, since the latter seems to be poorly visited.

The documents to be selected look something like this:

{
    "_id": {
        "$oid": "6272c580d4400d8cb10d5406"
    },
    "#CHROM": 1,
    "POS": 286747,
    "ID": "rs369556846",
    "REF": "A",
    "ALT": "G",
    "QUAL": ".",
    "FILTER": ".",
    "INFO": [{
            "RS": 369556846,
            "RSPOS": 286747,
            "dbSNPBuildID": 138,
            "SSR": 0,
            "SAO": 0,
            "VP": "0x050100000005150026000100",
            "WGT": 1,
            "VC": "SNV",
            "CAF": [{
                "$numberDecimal": "0.9381"
            }, {
                "$numberDecimal": "0.0619"
            }],
            "COMMON": 1,
            "TOPMED": [{
                "$numberDecimal": "0.88411856523955147"
            }, {
                "$numberDecimal": "0.11588143476044852"
            }]
        },
        ["SLO", "ASP", "VLD", "G5", "KGPhase3"]
    ]
}

For a basic annotation scenario, we need such query:

{'ID': {'$in': ['rs369556846', 'rs2185539', 'rs2519062', 'rs149363311', 'rs55745762', <...>]}}
, where <…> means hundreds/thousands of values.

The above-mentioned query is executed in a few seconds.

More complex annotation queries:
{'$or': [{'#CHROM': 1, 'POS': 1499125}, {'#CHROM': 1, 'POS': 1680158}, {'#CHROM': 1, 'POS': 1749174}, {'#CHROM': 1, 'POS': 3061224}, {'#CHROM': 1, 'POS': 3589337}, <...>]}

{'$or': [{'ID': 'rs149434212', 'REF': 'C', 'ALT': 'T'}, {'ID': 'rs72901712', 'REF': 'G', 'ALT': 'A'}, {'ID': 'rs145474533', 'REF': 'G', 'ALT': 'C'}, {'ID': 'rs12096573', 'REF': 'G', 'ALT': 'T'}, {'ID': 'rs10909978', 'REF': 'G', 'ALT': 'A'}, <...>]}

Despite the involvement of IXSCAN, they run many hours.

Please test aforementioned queries thoroughly and improve the performance of their execution. This will help science!

Hi @Platon_workaccount

Anything we can do to help science would be great!

Although MongoDB is by no means perfect, we do try to improve performance and features all the time so you’ll get to your data quicker. There are techniques to help you get faster results, and pretty much all of them involve creating indexes. There are many resources for this, such as:

There are also other hardware-related methods such as increasing the amount of RAM for the deployment, deplyong larger instances, and scaling horizontally using sharding.

Also, newer MongoDB versions usually contain performance improvements. If you’re not using the latest version (currently 5.0.9) it’s probably worth a try.

However in the immediate term, I’m more interested in this:

Despite the involvement of IXSCAN, they run many hours.

There may be some analysis/recommendations that can be made. Could you provide more details:

  • What’s the db.collection.stats() output of the collection in question? This would tell us the scale of the data you’re working with.
  • What’s the query explain output of the queries that are fast, and the queries that are slow? Could you provide them?
  • Could you provide more example documents to see more variation that is possible?
  • Could you provide the actual query that is taking many hours?

Best regards
Kevin

1 Like

Here are some details. The file names include the approximate query execution time. Be warned that these are quite toy queries; real queries take hours or even days to execute.

6_seconds_pretty.json (71.6 KB)
6_seconds_plain.json (38.3 KB)
6_seconds_explain.txt (264.3 KB)

7_minutes_pretty.json (207.3 KB)
7_minutes_plain.json (90.7 KB)
7_minutes_explain.txt (2.4 MB)

21_minutes_pretty.json (277.4 KB)
21_minutes_plain.json (127.5 KB)
21_minutes_explain.txt (2.8 MB)

stats.txt (14.2 KB)

Hi @Platon_workaccount

I found some interesting things on the queries and the explain plain you posted:

6 seconds

This query is of the form:

{ID: {
  {$in: [
    "rs149434212"
    ... <2841 similar items>
  ]}}}

Execution stats:

    "executionSuccess": true,
    "nReturned": 12,
    "executionTimeMillis": 83,
    "totalKeysExamined": 2743,
    "totalDocsExamined": 12,

Index used:

"indexName": "ID_1_REF_1_ALT_1",
"isMultiKey": true,
"multiKeyPaths": {
    "ID": [],
    "REF": [],
    "ALT": [
    "ALT"
    ]
},

Notes: the query is a very large $in query that examined 2743 keys, 12 documents, and returned 12 documents. In other words, it scanned a lot of index keys, but the number of documents scanned vs. documents returned is 1:1. My take is, other than the very large $in it’s actually not too bad. Since the ID field is a prefix of the index ID_1_REF_1_ALT_1 it can be used for this query, and definitely contributes to the speed.

7 minutes

The query is of shape:

{"$or": [
  {"#CHROM": 1, "POS": 1499125},
  ...<2841 similar items>
]}

Execution stats:

    "executionSuccess":true,
    "nReturned":12,
    "executionTimeMillis":63,
    "totalKeysExamined":12,
    "totalDocsExamined":12,

Index use:

"indexName":"#CHROM_1_POS_1",
"isMultiKey":false,

Notes: now we’re seeing queries in the minutes range, and the big difference is the use of $or. Notably $or queries behave very differently from other queries, since basically they are multiple queries that can use different indexes for each $or term. In essence, this is like running ~2500 single queries all at once. For this query using $in might yield better result, and believe you’ll see the same output. See $or vs. $in

21 minutes

This query is of the form:

{$or: [
  {"ID": "rs149434212", "REF": "C", "ALT": "T"},
  ... <2841 similar items>
]}

Execution stats:

    "executionSuccess": true,
    "nReturned": 12,
    "executionTimeMillis": 109,
    "totalKeysExamined": 12,
    "totalDocsExamined": 12,

Index use:

"indexName": "ID_1_REF_1_ALT_1",
"isMultiKey": true,
"multiKeyPaths": {
    "ID": [],
    "REF": [],
    "ALT": [
        "ALT"
    ]
},

Notes: this is the longest query, I think due to two things: the use of $or as in previous query, and the use of multikey index. Notably multikey index creates multiple index entries per array element, so if you’re indexing something like {a: 1, b: [array of 10 elements]}, it will create 10 index entries. In contrast, a non-multikey index of {a: 1, b: 1} will create one index entry. It’s 1 vs. 10 in this example. I believe this query is slow due to $or coupled with a lot more index entries.

What is interesting in all the explain output is that the executionTimeMillis are quite fast, even in the last query there. I’m guessing that you have a bottleneck somewhere else, perhaps when the actual documents are being fetched from disk? How are you running these queries? Are you using the Mongo shell, or are you using other software that’s interfacing with MongoDB?

Also, in your example document the field ALT doesn’t seem to be an array, so I wonder if you have other document examples where ALT is an array, and how many elements they usually contain.

Moving on to check out the collection stats, it doesn’t look like we’re dealing with gigantic amount of data: it appears that the collection in question is ~1GB in size. Is this correct?

So from the hardware spec, how much RAM are you setting the WiredTiger cache? What’s the overall spec of the machine the MongoDB server is running, and are you running MongoDB in the same machine as other resource-intensive processes such as another server?

Best regards
Kevin

2 Likes

For this query using $in might yield better result, and believe you’ll see the same output. See $or vs. $in

It sounds like I can place into $in not only single values, but also subqueries. The documentation doesn’t mention that.

I’m guessing that you have a bottleneck somewhere else, perhaps when the actual documents are being fetched from disk? How are you running these queries? Are you using the Mongo shell, or are you using other software that’s interfacing with MongoDB?

To measure time (6s, 7m, 21m) I applied my high-perf-bio annotate program. To get the explain output I used Compass.

Also, in your example document the field ALT doesn’t seem to be an array, so I wonder if you have other document examples where ALT is an array, and how many elements they usually contain.

In the given example, the collection contains 1055709 documents. {'$expr': {'$isArray': '$ALT'}} returns 271401 documents.

Moving on to check out the collection stats, it doesn’t look like we’re dealing with gigantic amount of data: it appears that the collection in question is ~1GB in size. Is this correct?

I provided only toy examples of both queries and DB data. The real data is much larger. Consequently, with real $or-based queries and real collections, execution time skyrockets to many hours or days.

So from the hardware spec, how much RAM are you setting the WiredTiger cache?

I use the default low-level settings.

What’s the overall spec of the machine the MongoDB server is running, and are you running MongoDB in the same machine as other resource-intensive processes such as another server?

My hardware info:

  • VivoBook-ASUSLaptop-X712FA
  • Intel® Core™ i5-8265U CPU @ 1.60GHz
  • 15,5 GiB
  • 498,7 GB

are you running MongoDB in the same machine as other resource-intensive processes such as another server?

No.

The real data is much larger. Consequently, with real $or -based queries and real collections, execution time skyrockets to many hours or days.

From the link you posted, it seems that the collection size are about 25GB, and there are two of them. So worst case, total of a 50GB collection. Is this correct?

My hardware info:

  • VivoBook-ASUSLaptop-X712FA
  • Intel® Core™ i5-8265U CPU @ 1.60GHz
  • 15,5 GiB
  • 498,7 GB

If I understand correctly, the total collection size absolutely dwarfs the total available RAM in the machine. Unfortunately this makes it somewhat impossible to have good performance, given that there is not enough hardware to cater for the magnitude of the data you’re working with. This is something that is not MongoDB specific, by the way. I believe you would see similar performance issues using any other database products in the market.

Depending on your requirements, you might be interested in checking out MongoDB Atlas, since it would allow you to scale your deployment up (or down) depending on your query needs. I understand that you may not need to have the query returned in seconds using the biggest deployment, but you may find it easier to get a good balance between price & performance using Atlas.

Best regards
Kevin

Let’s additionally discuss examples of highly truncated data. I think this small data is enough to demonstrate problems on the MongoDB side.

I’ve explored the memory usage of the longest (~21m) query. It looks like the problem is not there. Here is the memory consumption before running the query:

After about 15 minutes of calculations, the memory expense increased only slightly:

I still ask the MongoDB developers to optimize the DBMS for typically bioinformatic queries.