Performance issue loading data into MongoDB

I have an export from a database running on SQL Server that I am trying to load into a MongoDB cluster, and after about 48 hours the load rate comes to a screeching halt and I need help understanding it and/or figuring out how to efficiently get the data loaded.

The export consists of about a hundred tab-delimited files totaling about 5 billion rows and each row consists of [ hash (hash of usernname+teamnumber), username (varchar75)), teamnumber (int), stamp (timestamp), score (bigint), wu (bigint)]. The combination of username+teamnumber is unique with the dimension of timestamp.

My MongoDB 5.0.6 cluster lives in AWS EC2 r5.xl windows instances (4CPU, 64GB RAM, 50gb OS C:, 500GB Data/Logs D:) with 12 database nodes (6 pairs of 2-node replica sets in primary/secondary shards). I have a database sharded across the 6 pairs with a collection using the column “hash” as the shard key. The collection has an additional index across [username_teamnumber_stamp].

The collection currently has 885 million documents within, so it is approximately 20% complete.

The data represents 3100 days of user data, and initially it was taking on average 5.25 minutes to load a day’s worth of data. However the most recent two days of data have taken 111 minutes each.

At 5-6 minutes per daily data set, it would take an estimated 12 days to load this data. At almost two hours,
it will tape almost two hundred days to load this data.

This can’t possibly be “normal” performance, right?

Further information:

For each of the 12 mongod servers, data files and log files are stored on the same 500gb gp3 EBS volume capable of 3000 IOPS and monitoring is showing an average of 1300 IOPS with the occasional spike up to 2200. The four CPUs are steady at around 12% and 40% of the 32GB RAM allocated is free with less than 1% swap file use.

So it doesn’t feel like a hardware problem, meaning I’ve probably configured something incorrectly. But I don’t know what to look for.

I’ve turned off the balancer and deleted my one index and it seems to have sped things up ( down from 111 minutes per batch to ~ 7 minutes ).

So far I’m pretty unimpressed by the performance of MongoDB vis-a-vis SQL Server. SQL Server loaded this dataset, WITH indexes WITH concurrent data processing steps thrown in, in less than a week.

With all the hooplah around MongoDB, it makes me think I’m missing something, some kind of misconfiguration or hidden bottlenexk / overhead somewhere, that is keeping me from the promised land.

What is the hardware configuration of your SQL setup?

I am not sure if I understand your setup correctly. Do you mean that you have 12 mongod servers running on a single EC2 instance? Running more that 1 mongod on a single machine is definitively not a good setup for performance. And definitively a bad implementation for data redundancy, especially if you are writing on the same disk, you lose the disk you lose it all.

2-node replica set is not a recommended installation.

Where is mongos running?

How do you insert the documents? Could you please share your code? Is the insert code running close to the mongo system or remotely?

What is the total size of your input files?

Here is my physical server configuration:
1 mongos router server (quad 2.5GHz / 16GB RAM)
2 config servers (quad 2.5GHz / 16GB RAM)
12 mongod servers (quad 2.5GHz / 32GB RAM)

each mongod server has a single instance of mongod.exe running and each of the 12 is paired as a replica set, so I have 6 replica sets each holding a single shard. I think.

(Yes, I know the recommended installation is a 3-node replica set, but this is not a production environment so a 2-node will suffice for me for now.)

Here is the code. It is running on a standalone server in a different subnet than the MongoDB instances (~50 ms away). The workflow is straightforward: (a single-threaded DOS script) in the list of data csv files, copy them one at a time locally, import it into MongoDB using mongoimport.exe, delete the local file, grab next file.

There are 105 files, each roughly 2.2G in size for a total of 233 gigs of data in 5.6 billion rows.

FOR /F  %%G IN ('dir \\<server>\f$\export\*.csv /b') DO (

copy \<server>\f$\export\%%G C:\Scripts\MongoDB\daily\

mongoimport.exe --db <database> --collection users_daily --file %%G --type csv --headerline --host=<router> --port=27017 

erase C:\Scripts\MongoDB\daily\%%G

)

Since you do not care about data integrity (all mongod are writing on the same volume), you might be better off having 12 single node replica sets on your 12 EC2 instances.

That would reduce the replication traffic (from primary to secondary) to its minimum and give something like 12 * 32/2 = 192GB of memory (compared to 96) to the storage engine.

Wait, what? What about this setup indicates that I don’t care about data integrity? Is it the two-member replica set? Damn…

Also: how does one edit a post on this forum? (I accidently his ctrl-enter which is apparently “post”)

Am I understanding right that in order to tolerate a failure of a node, that node needs to be part of a three node replica set otherwise it won’t failover properly?

If that is the case, then to build a six-shard database across replica sets it would require 18 servers? All to get performance that is (so far) sub-par to a single SQL Server with 128 gigs of RAM.

And no, each shard has its own 500gb volume for data. I mis-typed it in my earlier post. Each mongod node has its own 500gb volume.

Apples and oranges.

Have you tried a standalone mongod on the same hardware? That’s the only way you can compare.

Yes. To have a primary node you need majority of nodes of the replica set to vote for one. With 2 nodes, if one fail, you end up with 1 node and no majority possible.

Please give details about your shard key.

2 Likes

I think you are right. I’m going to let the load run for a bit longer, then I’ll probably tear it all down and set up a single r5.4xl mongod server and see how it’s performance behaves.

That’s a bumer about the way replica sets work. I’d hoped that a two-node cluster would simple become a one-node cluster sing one node is automatically a majority. After reading up on it, it looks like I need an arbiter to make it work right. Which begs the question: can I set up a single machine that can be an arbiter for all six 2-node replica sets?

And my shard key is a hash. Technically it is a hash of a hash: The unique record for this collection is [(userID),(teamnumber)] so in preprocessing I create an additional column called “hash” which is the hash of the combination of userID and teamNumber. I ran the command

sh.shardCollection("database.collection",{hash: "hashed"})

to create the hash key for the shard collection.

It seems to be working correctly, as network traffic, storage used and memory used is similar across all six primary nodes.

Probably. Arbiters are not data bearing, so are not really involve in replication. You might even be able to use the mongos intance for that purpose.

As for the index, I think you incur processing overhead by making it a hashed index. I think mongos will rehash your hash value before distributing it. I think you are better off having a normal index on your hash field or forgo the hash field altogether and make hash index on your userid, teamnumber fields. The former will save processing time but the latter will save space. Space saving is often saving time too as there is less data to move and to read/write.

Hi @A_Grikk

There are a lot of info to uncompress here, and I believe @steevej has touched on all of them.

Having said that, I’d like to take a step back a little and maybe add my take on it :slight_smile:

I have a database sharded across the 6 pairs

Is there a reason for you to start with a sharded cluster which is arguably the most complex deployment method that requires deeper knowledge? Is the workload cannot be served by using a replica set with larger machines instead of a sharded cluster with smaller machines?

At 5-6 minutes per daily data set, it would take an estimated 12 days to load this data. At almost two hours,
it will tape almost two hundred days to load this data.

Since you didn’t mention about the sharded collection setup and you mentioned that turning off the balancer sped things up, what I’m guessing happened is that the chunks was not pre-distributed into the shards before import, thus the import is happening at the same time as the cluster constantly trying to rebalance itself, thus slowing down the import. In a sharded collection, the recommended approach is to pre-distribute the chunks before importing (see Create Chunks in a Sharded Cluster).

(Yes, I know the recommended installation is a 3-node replica set, but this is not a production environment so a 2-node will suffice for me for now.)

As @steevej mentioned earlier, the recommended setup is 3 node replica set, for many reasons (high availability, majority write, majority read, sharded transaction support, etc.). Unless you can avoid it, having arbiters in a replica set is not a recommended setup, or unless you fully understand the implications of using arbiters entails.

If that is the case, then to build a six-shard database across replica sets it would require 18 servers?

Yes. A sharded cluster is not a small deployment. Also in your case with 6 shards, it would need 3 more nodes minimum to act as the config servers, so total of 21. It was designed for very large workloads.

I guess my question is twofold: First, have you considered using a single replica set which will more closely mimic your monolithic SQL Server setup, and second, if you really need the performance of a sharded cluster and the complexity that comes with it, have you considered using Atlas so all of this operational concerns are managed for you?

Best regards
Kevin

3 Likes

First off: thank you both for your time and effort in helping a MongoDB beginner get his bearings! I really appreciate it.

As for the index, I think you incur processing overhead by making it a hashed index. I think mongos will rehash your hash value before distributing it. I think you are better off having a normal index on your hash field or forgo the hash field altogether and make hash index on your userid, teamnumber fields. The former will save processing time but the latter will save space. Space saving is often saving time too as there is less data to move and to read/write.

It was my understanding that there can only be one hashed key, so only userID or TeamNumber can be the hash. Right? So I’d create a hashed index like this:

db.collection.createIndex( { "userName" : "hashed", "teamNumber" : 1 } )

or alternately, I could just create an index on

db.collection.createIndex( { "hash" : 1} )

then shard on it

sh.shardCollection("database.collection", { "hash" : 1, } )

Right?

Is there a reason for you to start with a sharded cluster which is arguably the most complex deployment method that requires deeper knowledge? Is the workload cannot be served by using a replica set with larger machines instead of a sharded cluster with smaller machines?

Because I was interested in using sharding and replica set functionality of MongoDB. I already had a working set up with SQL Server, so I wanted to investigate the performance characteristics of a multi-shard NoSQL solution. I freely admit to being a NoSQL beginner but standing up a single server creates an environment exactly the same as a MySQL/MSSQL/Postgres deployment, albeit with a NoSQL twist and teaching myself the complexities of a sharded deployment while trying to get a more performant solution is the goal here.

Since you didn’t mention about the sharded collection setup and you mentioned that turning off the balancer sped things up, what I’m guessing happened is that the chunks was not pre-distributed into the shards before import, thus the import is happening at the same time as the cluster constantly trying to rebalance itself, thus slowing down the import. In a sharded collection, the recommended approach is to pre-distribute the chunks before importing (see Create Chunks in a Sharded Cluster).

You are correct. I did not predeploy chunks. I’ll be reading up on chunks now, thank you. :slight_smile:

As @steevej mentioned earlier, the recommended setup is 3 node replica set, for many reasons (high availability, majority write, majority read, sharded transaction support, etc.). Unless you can avoid it, having arbiters in a replica set is not a recommended setup, or unless you fully understand the implications of using arbiters entails.

I was trying to cut down on the number of instances by having a 2-note RS instead of three but I understand now that three is the way to go.

I guess my question is twofold: First, have you considered using a single replica set which will more closely mimic your monolithic SQL Server setup, and second, if you really need the performance of a sharded cluster and the complexity that comes with it, have you considered using Atlas so all of this operational concerns are managed for you?

With all of these posts, I think I’m going to cancel my current upload effort and work with a single mongod replica set to see how it performs. Then if I can get it to perform near to what SQL Server does I’ll move on to a sharded deployment.

And the reason why I’m not using Atlas is because I want to learn about those operational concerns to better understand the platform from an architecture perspective. I’m not using Atlas because I get AWS access for free so I’m using EC2 instances for my build.

3 Likes

Hi @A_Grikk

In that case, welcome to the community!

Ah now I understand the goal of this exercise. Glad to be able to help on the start of your journey.

However I must mention that although both MongoDB and SQL Servers main function is to store and retrieve data, they’re built very differently, have different ways to optimize things, and have different approach to modelling data (tabular vs. document). To jump start your journey, may I recommend some free MongoDB University courses:

There are other courses as well that touch on performance, data modeling, and many more, but considering your interest, I’d suggest starting with M001 and M103.

Best regards
Kevin

2 Likes

I want to add 3 things.

Field names

Field names are part of each JSON/BSON documents so in this case to reduce the amount of data sent and stored, you may want to use user and team rather than username and teamnumber.

The field hash

If the hash field is only present to be able to use an hash shard key, you could remove it and let the system hash user and team. Like I mentioned earlier, I think you incur the cost of computing the hash even if you supply a field that is a hash value.

Day to day use-cases

Loading the data in the system is one use-case. Before throwing out your shard setup you should consider your day to day use-cases, which are probably more important to optimize. Would your typical queries better served on a monolithic server or shard server? This shard key also has impact on you day to day use-cases. With a hash shard key you end with most queries being a scatter/gather.

1 Like

Field names - Field names are part of each JSON/BSON documents so in this case to reduce the amount of data sent and stored, you may want to use user and team rather than username and teamnumber.

Interesting. That’s good to know.

The field hash - If the hash field is only present to be able to use an hash shard key, you could remove it and let the system hash user and team. Like I mentioned earlier, I think you incur the cost of computing the hash even if you supply a field that is a hash value.

How is hashing user and team together accomplished? I thought that a command like

db.collection.createIndex( { "userName" : "hashed", "teamNumber" : 1 } )

hashes only on userName and leaves teamNumber unhashed. How can I create a hash based on userName+teamNumber?

Day to day use-cases - Loading the data in the system is one use-case. Before throwing out your shard setup you should consider your day to day use-cases, which are probably more important to optimize. Would your typical queries better served on a monolithic server or shard server? This shard key also has impact on you day to day use-cases. With a hash shard key you end with most queries being a scatter/gather.

The day to day operation of this collection is a daily add of a single batch of 2.3 million additional documents, with constant end-user read query requests being performed throughout the day. The most common query will be SELECT * FROM collection WHERE userName=? AND teamNumber=? ORDER BY stamp

So a hash key on username, teamnumber will ensure that the most common query by far will be accessing specific data located on a single shard.

I have to admit that you are stretching my knowledge. B-(

One thing that I am certain is that adding stamp will improve the

You probably asked that because you tried:

c.createIndex( { "name" : "hashed" , "team" : "hashed" } )

and got:

[MongoServerError:](#) A maximum of one index field is allowed to be hashed but found 2 for 'key' { name: "hashed", team: "hashed", stamp: 1 }

What is nice is that a field does not have to be a simple value it can be an object. So if your data looks like:

{ _id: 0,
  actor: { name: 'steevej', team: 'Elder' },
  stamp: 0 }

you could:

c.createIndex( { "actor" : "hashed" , stamp : 1 } )

which would effectively compute the hash of name+team. But I never used thanthat in a real application setup so I have no clue about the implications.

What I cannot confirm or infirm is

Thanks everyone for your replies and insights! It’s been very helpful as I dive into this platform.

I ended up tearing down my whole environment and building a single mongod server on a r5.4xl box (16CPU 128GB of RAM) with a collection with a single index on username/teamnumber/stamp and initially the rate of data ingestion had it completely done in 2.7 days. I recognize that the index will slow things down a bunch but the SQL Server setup has this index on it and to make it an apples-to-apples comparison I wanted the environment to be as similar as possible.

When I woke up this morning the rate of ingestion had slowed to a completion of 4.32 days. Regardless of the completion date I’m going to let this one run its course to be able to manipulate the data set and see how it behaves in its more typical use case. If it runs well /then/ I’ll consider replica sets and sharding to see if I can boost performance.

1 Like

Hi @A_Grikk

Here’s some additional thoughts from me:

  1. Reading the production notes and ensuring the deployment conforms to the recommended settings is worthwhile. This will ensure minimal issues in the long run.
  2. The recommended production setup for MongoDB remains a replica set instead of a standalone instance. (minimum of 3 data bearing nodes, sans arbiter). Properly setup, a replica set should be competitive with a standalone instance in terms of performance. Plus, you gain additional features not available in a standalone node such as change streams and transactions.
  3. Schema design and indexing can make or break MongoDB performance, and optimization methods are very different from tabular databases. So please don’t be discouraged if the typical SQL optimization methods don’t really work. I’m sure people in the community would be able to help you should you have issues with either subjects. Pro tip: be familiar with the explain results to optimize queries and index usage.

Best regards
Kevin

2 Likes