I am a computational social science phd student (and citizen scientist) and love mongodb! All my research data is organized using mongodb (much to the chagrin of my advisor that would prefer to use flat files, which I begrudgingly admit can be much faster sometimes for analysis purposes).
This data collection project started out small but quickly grew into something I’m barely able to wield control of.
Attempts to Improve Performance
I have read every article and documentation regarding optimizing and speeding up my database and they have been very useful. They are mostly OS and Software (index, query) related. But I have reached a point where my hardware setup is holding me back from analyzing this data.
I am trying to grok where my bottlenecks are so I can make cost-benefit determinations regarding hardware options and changes.
Can I get some advice on things I can do to make my research project more successful? Any advice could really help me. While limited on funds, If I knew more about various hardware optimizations, I could determine if it is worth spending my student-loan money on some upgrades.
I have one research digital anthropology project regarding social media interactions that are stored in 1 database with 4 collections with just the primary shard (no sharding). On a computer solely dedicated to mongodb. And I have many useful columns indexed. The media collection can be ignored, that setup is straightforward and while it doesn’t (to my knowledge) have any performance impact, I don’t even use it in my analysis.
Other than occasional data fixes, this database is offline and thus does not handle transactions, and is used for data science. (WORM)
MongoDB version: 3.6.21
Processor: 16 core Intel Xeon CPU E5-2630 v3 @ 2.40GHz
Memory: 64gb DDR4 ECC
Swap: Dedicated 64gb ssd
GPU: GTX 1070, 8gb vRAM, 1920 CUDA cores
OS: Ubuntu 18.04
OS Disk: Dedicated 16gb ssd
Database Location: 4TB LVM2 formatted XFS (4- 1TB Seagate Constellation 7200RPM)
In addition to any feedback from the community, I do have some specific questions as well:
What would the benefit be if I were to use a cluster of raspberry pi’s instead?
I did not consider sharding due to the additional complexity and risk when I was first setting up this project. But based on what I have read, it seems that I stand to gain significant performance by sharding the database. But I’m unsure of the various ways I could achieve this. Kubernetes, Raspberry pi’s, some dell Optiplex 7040 thin clients, etc… I am also worried about corrupting the data somehow and not noticing it until too late.
Based on watching my resource monitor during various operations and tasks, it is clear that I am I/O bottlenecked. Would it be worth it to move to shards or stick with my computer detailed above and use SSDs for the LVM2 instead? My neophyte knowledge prevents me from making a good mental comparison of the trade-off.
- What if I scaled up my LVM2 to use 8 500gb ssds instead of sharding? Or for that price, I could get 4x Raspberry v4 (model B-8Gb) pi’s and attach one of the 1TB hard drives to each.
Would it be smart to upgrade to version 4? Last year I tried to upgrade to mongodb version 4 and learned the hard way to carefully read the instructions. And lost about three weeks when the database had to be rebuilt using a single thread. (Mongo really needs multithreaded rebuilds!)
What would the trade-off be for creating a sharded cluster using shards that were not identical? Could I unintentionally cause a config bottleneck?
- I ask b/c I also have 4 different samsung android phones running ubuntu that meet the ARM microarchitecture requirement for running mongodb. Would it be worth the time & risk to enlist them in the cause?