Ideal hardware setup for a PhD researcher drowning with a huge dataset?

Gabriel_Fair · December 6, 2020, 7:45pm

Background

I am a computational social science phd student (and citizen scientist) and love mongodb! All my research data is organized using mongodb (much to the chagrin of my advisor that would prefer to use flat files, which I begrudgingly admit can be much faster sometimes for analysis purposes).

This data collection project started out small but quickly grew into something I’m barely able to wield control of.

Attempts to Improve Performance

I have read every article and documentation regarding optimizing and speeding up my database and they have been very useful. They are mostly OS and Software (index, query) related. But I have reached a point where my hardware setup is holding me back from analyzing this data.

Desire

I am trying to grok where my bottlenecks are so I can make cost-benefit determinations regarding hardware options and changes.

Request

Can I get some advice on things I can do to make my research project more successful? Any advice could really help me. While limited on funds, If I knew more about various hardware optimizations, I could determine if it is worth spending my student-loan money on some upgrades.

Technical Details

I have one research digital anthropology project regarding social media interactions that are stored in 1 database with 4 collections with just the primary shard (no sharding). On a computer solely dedicated to mongodb. And I have many useful columns indexed. The media collection can be ignored, that setup is straightforward and while it doesn’t (to my knowledge) have any performance impact, I don’t even use it in my analysis.

Other than occasional data fixes, this database is offline and thus does not handle transactions, and is used for data science. (WORM)

Collection	Documents	Fields	(When flattened)	Indexes
Posts	45,012,061	32	69	13
Comments	24,551,804	21	40	8
User Profiles	819,957	30	35	2
Media	12,119,957	3	3	1

Current Setup:

MongoDB version: 3.6.21
Processor: 16 core Intel Xeon CPU E5-2630 v3 @ 2.40GHz
Memory: 64gb DDR4 ECC
Swap: Dedicated 64gb ssd
GPU: GTX 1070, 8gb vRAM, 1920 CUDA cores
OS: Ubuntu 18.04
OS Disk: Dedicated 16gb ssd
Database Location: 4TB LVM2 formatted XFS (4- 1TB Seagate Constellation 7200RPM)

Questions

In addition to any feedback from the community, I do have some specific questions as well:

What would the benefit be if I were to use a cluster of raspberry pi’s instead?
I did not consider sharding due to the additional complexity and risk when I was first setting up this project. But based on what I have read, it seems that I stand to gain significant performance by sharding the database. But I’m unsure of the various ways I could achieve this. Kubernetes, Raspberry pi’s, some dell Optiplex 7040 thin clients, etc… I am also worried about corrupting the data somehow and not noticing it until too late.
Based on watching my resource monitor during various operations and tasks, it is clear that I am I/O bottlenecked. Would it be worth it to move to shards or stick with my computer detailed above and use SSDs for the LVM2 instead? My neophyte knowledge prevents me from making a good mental comparison of the trade-off.
- What if I scaled up my LVM2 to use 8 500gb ssds instead of sharding? Or for that price, I could get 4x Raspberry v4 (model B-8Gb) pi’s and attach one of the 1TB hard drives to each.
Would it be smart to upgrade to version 4? Last year I tried to upgrade to mongodb version 4 and learned the hard way to carefully read the instructions. And lost about three weeks when the database had to be rebuilt using a single thread. (Mongo really needs multithreaded rebuilds!)
What would the trade-off be for creating a sharded cluster using shards that were not identical? Could I unintentionally cause a config bottleneck?
- I ask b/c I also have 4 different samsung android phones running ubuntu that meet the ARM microarchitecture requirement for running mongodb. Would it be worth the time & risk to enlist them in the cause?

Pavel_Duchovny · December 7, 2020, 6:21am

Hi @Gabriel_Fair,

That sounds like a fascinating project and we have a great series of blogs on performance improvement including sharding configuration:
Performance Best Practices: Benchmarking | MongoDB Blog

I recommend you go through them.

Having said that, I cannot understand why would someone want to.manage and tune HW without specific knowledge of the software when they can Use MongoDB Atlas and all its tools to scale up or harizontal seamlessly.

If the cluster does not require to be up all the time you cab pause it. We offer students packages and other promotions like Day Zero so you could try the product with topped up credits.

I think this is the solution I would go for using NVME storage for example.

Best
Pavel

Rhys_Campbell · December 7, 2020, 10:24am

Before looking at any hardware optimizations I would take a look at the queries an indexes first. More often than not it’s the area where you get the biggest payback in term of performance.

Gabriel_Fair · December 7, 2020, 1:48pm

Thank you @Pavel_Duchovny for the best practices. Very useful when I start dabbling in using shards.

I have learned a lot from administering my mongodb database. I have a background in enterprise HDFS admin, so it was neat to learn something new. I am also a tinkerer. All my phones are jailbroken and I’ve gotten debian running on my kitchen oven. So manage and tuning HW is like managing my truck and tuning it so I can get better mileage and performance. It also helps me better defend and recommend technology solutions to my colleagues when they come to me with a problem.

Thank you for your recommendation of trying Atlas. I guess I did not consider it since I do not have to pay for electricity, egress or ingress with my self managed solution. So in the two years of using this project I have had to pay zero dollars. I would be spooked to move everything to Atlas, only to find out 6 months later that my department’s volatile funding for students is canceling my grant, and then have to pay more to get everything off Atlas.

I looked around the web for an Atlas cost calculator and couldn’t find one.

Also my use case is analysis and discovery. Which means I don’t yet fully understand my data. So a lot of my quieres run for a bit only to come back empty. Which is why I’m worried about trying Atlas and blowing through my budget with queries that in hindsight were pretty dumb. But of course, I would be more careful if I were to start using Atlas. I will consider it.

Pavel_Duchovny · December 7, 2020, 10:16pm

Hi @Gabriel_Fair

I appreciate your interest in managing MongoDB and mastering it, so if you like it keep on going.

I would say that going to 4.0 or 4.2 will open a new world of opportunity to you as they introduce new aggregation/new indexes, like wild card and materialized views which can assist you digest your arbitrary queries better .

I suggest you to look for our whats new blogs at MongoDB Blog

Now regarding atlas you can open a free account and play with topologies and see the hourly and monthly bill for your deployment. In atlas you are paying for the time the cluster is up (when paused you pay a small storage fee)

So there is no billing per queries , you can run 100 or 1M and the cost will be the same.

I suggest you to explore this option, perhaps. You can consider offloading some small workload to ease the current system or try some new version without upgrading the current env.

Best
Pavel

steevej · January 17, 2021, 1:38am

I saw your reply in the linked thread. I did not participate at first because using Atlas was making so much technical sense. But it looks like it is out of the question. So I will try to give you some of my ideas.

The most important factor after

for performance is to have your working set (which included the appropriate indexes) in memory, otherwise you end up with disk I/O bottleneck. The indexes are taken care:

and you are aware of

Absolutely none. In particular with:

See https://docs.mongodb.com/manual/core/wiredtiger/ for the cache size calculation. If your I/O is your bottleneck then it means your working set is roughly bigger than ( (64 - 1) / 2) = 30.5 Gb. With 4 RPI, you can have 4 x ( (8-1)/2 ) = 14 Gb of WT cache. To get close to 30.5Gb of WT cache you would need more than 8 RPI and you would need to need to implement sharding for the data itself but your would need so RPI to run the config server. Having a cluster of machine does not help performance if you do not shard, it helps availability only. But

I agree with the additional complexity. Do not consider sharding UNLESS there is no other way. So forget your 4 RPI or your 4 jail broken phones. I am not here to put down RPI. I love RPI. I own 3 and would not consider anything else for what I do with them.

Since

using your disks in RAID 0 configuration might be a better avenue that having them in a single logical volume but what you gain in performance you loose in resiliency. But I suspect you do much more read than write and you are not live so if you have a good backup you do not need that much resiliency. You could do RAID 0 if and only if your current LVM2 is less than 50% usage.

The best way to remove I/O bottleneck would be to increase RAM.

If you have budget for SSDs you might want to look at storage.wiredTiger.engineConfig.directoryForIndexes so that index files are stored in different disks so that reads of indexes that do not fit the WT cache are faster.

If I can conclude:

increase RAM as much as you can on your current machine
distribute disk I/O as much as you can, RAID 0, index on other disks, log on other disks (or null syslog),
shard if and only if the above 2 are not working

system · February 1, 2021, 5:40pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.