Ideal hardware setup for a PhD researcher drowning with a huge dataset?

I saw your reply in the linked thread. I did not participate at first because using Atlas was making so much technical sense. But it looks like it is out of the question. So I will try to give you some of my ideas.

The most important factor after

for performance is to have your working set (which included the appropriate indexes) in memory, otherwise you end up with disk I/O bottleneck. The indexes are taken care:

and you are aware of

Absolutely none. In particular with:

See https://docs.mongodb.com/manual/core/wiredtiger/ for the cache size calculation. If your I/O is your bottleneck then it means your working set is roughly bigger than ( (64 - 1) / 2) = 30.5 Gb. With 4 RPI, you can have 4 x ( (8-1)/2 ) = 14 Gb of WT cache. To get close to 30.5Gb of WT cache you would need more than 8 RPI and you would need to need to implement sharding for the data itself but your would need so RPI to run the config server. Having a cluster of machine does not help performance if you do not shard, it helps availability only. But

I agree with the additional complexity. Do not consider sharding UNLESS there is no other way. So forget your 4 RPI or your 4 jail broken phones. I am not here to put down RPI. I love RPI. I own 3 and would not consider anything else for what I do with them.

Since

using your disks in RAID 0 configuration might be a better avenue that having them in a single logical volume but what you gain in performance you loose in resiliency. But I suspect you do much more read than write and you are not live so if you have a good backup you do not need that much resiliency. You could do RAID 0 if and only if your current LVM2 is less than 50% usage.

The best way to remove I/O bottleneck would be to increase RAM.

If you have budget for SSDs you might want to look at storage.wiredTiger.engineConfig.directoryForIndexes so that index files are stored in different disks so that reads of indexes that do not fit the WT cache are faster.

If I can conclude:

  • increase RAM as much as you can on your current machine
  • distribute disk I/O as much as you can, RAID 0, index on other disks, log on other disks (or null syslog),
  • shard if and only if the above 2 are not working
2 Likes