[Proposal] Boost the performance of bioinformatic annotation queries

Disclaimer

This is a copy of a proposal from the feedback portal, since the latter seems to be poorly visited.

The documents to be selected look something like this:

{
    "_id": {
        "$oid": "6272c580d4400d8cb10d5406"
    },
    "#CHROM": 1,
    "POS": 286747,
    "ID": "rs369556846",
    "REF": "A",
    "ALT": "G",
    "QUAL": ".",
    "FILTER": ".",
    "INFO": [{
            "RS": 369556846,
            "RSPOS": 286747,
            "dbSNPBuildID": 138,
            "SSR": 0,
            "SAO": 0,
            "VP": "0x050100000005150026000100",
            "WGT": 1,
            "VC": "SNV",
            "CAF": [{
                "$numberDecimal": "0.9381"
            }, {
                "$numberDecimal": "0.0619"
            }],
            "COMMON": 1,
            "TOPMED": [{
                "$numberDecimal": "0.88411856523955147"
            }, {
                "$numberDecimal": "0.11588143476044852"
            }]
        },
        ["SLO", "ASP", "VLD", "G5", "KGPhase3"]
    ]
}

For a basic annotation scenario, we need such query:

{'ID': {'$in': ['rs369556846', 'rs2185539', 'rs2519062', 'rs149363311', 'rs55745762', <...>]}}
, where <…> means hundreds/thousands of values.

The above-mentioned query is executed in a few seconds.

More complex annotation queries:
{'$or': [{'#CHROM': 1, 'POS': 1499125}, {'#CHROM': 1, 'POS': 1680158}, {'#CHROM': 1, 'POS': 1749174}, {'#CHROM': 1, 'POS': 3061224}, {'#CHROM': 1, 'POS': 3589337}, <...>]}

{'$or': [{'ID': 'rs149434212', 'REF': 'C', 'ALT': 'T'}, {'ID': 'rs72901712', 'REF': 'G', 'ALT': 'A'}, {'ID': 'rs145474533', 'REF': 'G', 'ALT': 'C'}, {'ID': 'rs12096573', 'REF': 'G', 'ALT': 'T'}, {'ID': 'rs10909978', 'REF': 'G', 'ALT': 'A'}, <...>]}

Despite the involvement of IXSCAN, they run many hours.

Please test aforementioned queries thoroughly and improve the performance of their execution. This will help science!

Hi @Platon_workaccount

Anything we can do to help science would be great!

Although MongoDB is by no means perfect, we do try to improve performance and features all the time so you’ll get to your data quicker. There are techniques to help you get faster results, and pretty much all of them involve creating indexes. There are many resources for this, such as:

There are also other hardware-related methods such as increasing the amount of RAM for the deployment, deplyong larger instances, and scaling horizontally using sharding.

Also, newer MongoDB versions usually contain performance improvements. If you’re not using the latest version (currently 5.0.9) it’s probably worth a try.

However in the immediate term, I’m more interested in this:

Despite the involvement of IXSCAN, they run many hours.

There may be some analysis/recommendations that can be made. Could you provide more details:

  • What’s the db.collection.stats() output of the collection in question? This would tell us the scale of the data you’re working with.
  • What’s the query explain output of the queries that are fast, and the queries that are slow? Could you provide them?
  • Could you provide more example documents to see more variation that is possible?
  • Could you provide the actual query that is taking many hours?

Best regards
Kevin

1 Like

Here are some details. The file names include the approximate query execution time. Be warned that these are quite toy queries; real queries take hours or even days to execute.

6_seconds_pretty.json (71.6 KB)
6_seconds_plain.json (38.3 KB)
6_seconds_explain.txt (264.3 KB)

7_minutes_pretty.json (207.3 KB)
7_minutes_plain.json (90.7 KB)
7_minutes_explain.txt (2.4 MB)

21_minutes_pretty.json (277.4 KB)
21_minutes_plain.json (127.5 KB)
21_minutes_explain.txt (2.8 MB)

stats.txt (14.2 KB)

Hi @Platon_workaccount

I found some interesting things on the queries and the explain plain you posted:

6 seconds

This query is of the form:

{ID: {
  {$in: [
    "rs149434212"
    ... <2841 similar items>
  ]}}}

Execution stats:

    "executionSuccess": true,
    "nReturned": 12,
    "executionTimeMillis": 83,
    "totalKeysExamined": 2743,
    "totalDocsExamined": 12,

Index used:

"indexName": "ID_1_REF_1_ALT_1",
"isMultiKey": true,
"multiKeyPaths": {
    "ID": [],
    "REF": [],
    "ALT": [
    "ALT"
    ]
},

Notes: the query is a very large $in query that examined 2743 keys, 12 documents, and returned 12 documents. In other words, it scanned a lot of index keys, but the number of documents scanned vs. documents returned is 1:1. My take is, other than the very large $in it’s actually not too bad. Since the ID field is a prefix of the index ID_1_REF_1_ALT_1 it can be used for this query, and definitely contributes to the speed.

7 minutes

The query is of shape:

{"$or": [
  {"#CHROM": 1, "POS": 1499125},
  ...<2841 similar items>
]}

Execution stats:

    "executionSuccess":true,
    "nReturned":12,
    "executionTimeMillis":63,
    "totalKeysExamined":12,
    "totalDocsExamined":12,

Index use:

"indexName":"#CHROM_1_POS_1",
"isMultiKey":false,

Notes: now we’re seeing queries in the minutes range, and the big difference is the use of $or. Notably $or queries behave very differently from other queries, since basically they are multiple queries that can use different indexes for each $or term. In essence, this is like running ~2500 single queries all at once. For this query using $in might yield better result, and believe you’ll see the same output. See $or vs. $in

21 minutes

This query is of the form:

{$or: [
  {"ID": "rs149434212", "REF": "C", "ALT": "T"},
  ... <2841 similar items>
]}

Execution stats:

    "executionSuccess": true,
    "nReturned": 12,
    "executionTimeMillis": 109,
    "totalKeysExamined": 12,
    "totalDocsExamined": 12,

Index use:

"indexName": "ID_1_REF_1_ALT_1",
"isMultiKey": true,
"multiKeyPaths": {
    "ID": [],
    "REF": [],
    "ALT": [
        "ALT"
    ]
},

Notes: this is the longest query, I think due to two things: the use of $or as in previous query, and the use of multikey index. Notably multikey index creates multiple index entries per array element, so if you’re indexing something like {a: 1, b: [array of 10 elements]}, it will create 10 index entries. In contrast, a non-multikey index of {a: 1, b: 1} will create one index entry. It’s 1 vs. 10 in this example. I believe this query is slow due to $or coupled with a lot more index entries.

What is interesting in all the explain output is that the executionTimeMillis are quite fast, even in the last query there. I’m guessing that you have a bottleneck somewhere else, perhaps when the actual documents are being fetched from disk? How are you running these queries? Are you using the Mongo shell, or are you using other software that’s interfacing with MongoDB?

Also, in your example document the field ALT doesn’t seem to be an array, so I wonder if you have other document examples where ALT is an array, and how many elements they usually contain.

Moving on to check out the collection stats, it doesn’t look like we’re dealing with gigantic amount of data: it appears that the collection in question is ~1GB in size. Is this correct?

So from the hardware spec, how much RAM are you setting the WiredTiger cache? What’s the overall spec of the machine the MongoDB server is running, and are you running MongoDB in the same machine as other resource-intensive processes such as another server?

Best regards
Kevin

2 Likes

For this query using $in might yield better result, and believe you’ll see the same output. See $or vs. $in

It sounds like I can place into $in not only single values, but also subqueries. The documentation doesn’t mention that.

I’m guessing that you have a bottleneck somewhere else, perhaps when the actual documents are being fetched from disk? How are you running these queries? Are you using the Mongo shell, or are you using other software that’s interfacing with MongoDB?

To measure time (6s, 7m, 21m) I applied my high-perf-bio annotate program. To get the explain output I used Compass.

Also, in your example document the field ALT doesn’t seem to be an array, so I wonder if you have other document examples where ALT is an array, and how many elements they usually contain.

In the given example, the collection contains 1055709 documents. {'$expr': {'$isArray': '$ALT'}} returns 271401 documents.

Moving on to check out the collection stats, it doesn’t look like we’re dealing with gigantic amount of data: it appears that the collection in question is ~1GB in size. Is this correct?

I provided only toy examples of both queries and DB data. The real data is much larger. Consequently, with real $or-based queries and real collections, execution time skyrockets to many hours or days.

So from the hardware spec, how much RAM are you setting the WiredTiger cache?

I use the default low-level settings.

What’s the overall spec of the machine the MongoDB server is running, and are you running MongoDB in the same machine as other resource-intensive processes such as another server?

My hardware info:

  • VivoBook-ASUSLaptop-X712FA
  • Intel® Core™ i5-8265U CPU @ 1.60GHz
  • 15,5 GiB
  • 498,7 GB

are you running MongoDB in the same machine as other resource-intensive processes such as another server?

No.

The real data is much larger. Consequently, with real $or -based queries and real collections, execution time skyrockets to many hours or days.

From the link you posted, it seems that the collection size are about 25GB, and there are two of them. So worst case, total of a 50GB collection. Is this correct?

My hardware info:

  • VivoBook-ASUSLaptop-X712FA
  • Intel® Core™ i5-8265U CPU @ 1.60GHz
  • 15,5 GiB
  • 498,7 GB

If I understand correctly, the total collection size absolutely dwarfs the total available RAM in the machine. Unfortunately this makes it somewhat impossible to have good performance, given that there is not enough hardware to cater for the magnitude of the data you’re working with. This is something that is not MongoDB specific, by the way. I believe you would see similar performance issues using any other database products in the market.

Depending on your requirements, you might be interested in checking out MongoDB Atlas, since it would allow you to scale your deployment up (or down) depending on your query needs. I understand that you may not need to have the query returned in seconds using the biggest deployment, but you may find it easier to get a good balance between price & performance using Atlas.

Best regards
Kevin

Let’s additionally discuss examples of highly truncated data. I think this small data is enough to demonstrate problems on the MongoDB side.

I’ve explored the memory usage of the longest (~21m) query. It looks like the problem is not there. Here is the memory consumption before running the query:

After about 15 minutes of calculations, the memory expense increased only slightly:

I still ask the MongoDB developers to optimize the DBMS for typically bioinformatic queries.

Can I ask a couple of things first:

  1. What’s the content of the file /etc/mongod.conf
  2. Why are you running ~40 mongod processes in your laptop?

Best regards
Kevin

  • What’s the content of the file /etc/mongod.conf
# mongod.conf

# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/

# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
#  engine:
#  wiredTiger:

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

# network interfaces
net:
  port: 27017
  bindIp: 127.0.0.1


# how the process runs
processManagement:
  timeZoneInfo: /usr/share/zoneinfo

#security:

#operationProfiling:

#replication:

#sharding:

## Enterprise-Only Options:

#auditLog:

#snmp:

Why are you running ~40 mongod processes in your laptop?

Perhaps this is caused by the simultaneous use of Compass and PyMongo-based program.

UPD. Immediately after rebooting the laptop I see 33 mongodb processes. I don’t know why that is.

Is there any chance that the issue will be fixed in time for MongoDB 7.0 release?

Hi @Platon_workaccount

I believe the last time we touched base on this, I was suggesting for you to try using a more powerful server hardware since you’re running all this work on a laptop; you have 16GB of RAM in the laptop, with ~50GB of data to process.

The hardware of the laptop doesn’t look like it’s nearly enough to cover the workload, and you’re also running ~40 mongod processes in it. Have you been successful in sourcing a more powerful hardware in the meantime, and see if there’s any improvement in performance?

On another note, if actual hardware is difficult to source, I also suggested to try using MongoDB Atlas and select a deployment with sufficient RAM size for the workload. Note that you can also pause an Atlas cluster if you’re not using it 24/7 to save some expense, and can unpause it when you need to.

Unfortunately for such a big work, I don’t believe the laptop is up to the task :slight_smile:

Best regards
Kevin

Now I did a fresh test on a new and relatively powerful computer. The difference between a single field query and a multiple fields query is still huge.

Hardware specifications:

RAM: Kingston Fury Beast RGB DDR4-3600; 64 GB
Processor: 12th Gen Intel® Core™ i5-12400 × 12
SSD: WD Black SN750; PCI-E 3.0; 1 TB

Software specifications:

OS: Ubuntu 22.04.1 LTS
MongoDB version: 6.0.2
mongod.conf change: cacheSizeGB: 56

Toy data used in the test:

Queried items quantity: 2842
Collection's docs quantity: 1055709
Fields indexes: #CHROM_1_POS_1; ID_1

Abbreviated versions of the queries and the corresponding execution time (minutes:seconds):
{'ID': {'$in': ['rs149434212', 'rs72901712', 'rs145474533', 'rs12096573', 'rs10909978', '...']}}
00:02

{'$or': [{'#CHROM': 1, 'POS': 1499125}, {'#CHROM': 1, 'POS': 1680158}, {'#CHROM': 1, 'POS': 1749174}, {'#CHROM': 1, 'POS': 3061224}, {'#CHROM': 1, 'POS': 3589337}, '...']}
02:23

{'$or': [{'ID': 'rs149434212', 'REF': 'C', 'ALT': 'T'}, {'ID': 'rs72901712', 'REF': 'G', 'ALT': 'A'}, {'ID': 'rs145474533', 'REF': 'G', 'ALT': 'C'}, {'ID': 'rs12096573', 'REF': 'G', 'ALT': 'T'}, {'ID': 'rs10909978', 'REF': 'G', 'ALT': 'A'}, '...']}
07:47