Poor driver performance, how to optimize?

Hi,

We are comparing MongoDB vs our current SQL database solution.
One thing we noticed is that in the C# driver when we want to retrieve around 1 million documents it’s super slow versus de SQL version.

Some background information: we have a collection containing 70 million documents. One query we are currently testing is to retrieve some documents based on a filter. When that result is retrieved, it takes over 1 minute before the list is filled. In SQL that same query and table takes 15 seconds to fill the same list.

Is there a way to optimize this?

Have you performed your comparison on the same hardware?

Have you performed your comparison on cold or hot server? Filling the cache with data from disk might be slow so if SQL was hot and Mongo is cold then you are comparing apples and oranges.

Do you have indexes that support your query?

I think that a use-case that needs to download 1 million documents/rows should be redesigned. For example, if you download 1 million documents to do some calculations, then you should redesign so that the calculations are done on the server with the aggregation framework.

Note that being schema less mongo needs to send field names over the wire. With 1M documents you might have hit a bandwidth bottleneck. One way to optimize is to $project only the fields needed for the computation.

Also blindly copy one SQL table to one MongoDB collection fails to tap into the flexible nature of JSON documents. For example, if rows/docs with dates, you could use the bucketing pattern and group all rows/docs with same date with a single document. This will reduce the total number of documents and data duplication. Your 1_000_000 docs might end up being only 100_000.

That is it for now. I hope others will join the conversation.

Hi Steeve,

Thanks for your inisghts. Let me indeed provide some more info:
Yes, comparison is done on the same hardware. Also on the same exact server so I don’t think hot or cold servers is the issue. We have indeed an index that supports the query, both on SQL as in Mongodb.

I also think you’re right about the use case, that it’s bad design. In this case it was merely a benchmark.

Hopefully others will chime in as well :slight_smile:

1 Like

Hi there,
I’m using c# too, and also doing a benchmark project on an isolated environment(yes, my laptop, i7 SSD)
I replicated the same dataset on MS SQL Express and MongoDB Community 7.0, around 931k recorders(documents), and both indexed the properties that the query needed.

In MongoDB Compass, the result is very very impressive, a filtered result with nearly 20k records and the explainer said it used 400ms, really cool. But in c# and Mongodb driver, the same query condition and back the same records, it took nearly 10 seconds to convert into a model array.

Can anyone tell me what’s happened to the c# driver or how can I improve the reading speed?
thanks!

Do you have the source code you’re running to do the comparison?

What SQL flavour are you using? Maria or MS or Oracle or something else?

/Edit I see on the follow up from James, he’s using SQL Express

In the C# code is it pulling ALL results back or first set of results? I assume that in Compass it’ll open a cursor and get the first set of results back, which could be quick. You could check the network usage between the two use-cases to verify that the same amount of data is flowing.

Hi John,
Thanks for your feedback,

var collection = IntraMongoDB.GetCollection<RawBsonDocument>("br_Report");
            filter.Add("dbname", new BsonDocument()
        .Add("$in", new BsonArray(new string[] { "TMT" }))
);
            filter.Add("Data_year", new BsonDocument()
                    .Add("$in", new BsonArray()
                            .Add("2023")
                    )
            );
        var projection = Builders<RawBsonDocument>.Projection.Exclude("_id");

        var rawBsonDoc = collection.Find(filter).Project<RawBsonDocument>(projection).ToList();

I’ve tested another type of object “RawBsonDocument” via seeing another forum article, and boom, the result was outstanding. it just took ~1100ms to get all the data (around 22k records). about 8~9 times faster than BsonDocument.
But all I need is my data model, so I used BsonSerializer.Deserialize to convert RawBsonDocument to my model, unfortunately, that’s very slow.
Here’s the forum thread I mentioned.
https://www.mongodb.com/community/forums/t/c-net-core-3-1-driver-2-11-1-slow-tolist-data-manifestation/8783/28

So I think the bottleneck shouldn’t be the network cause I can fetch data in a very fast way, (and yes the MongoDB is located on “localhost”)
Might it be the mapping or converter’s issue?

Depending on your model, could you create your own de-serialiser optimised to your model?

Some questions:

Do you really really need to download 22_000 documents and convert them?

Are you sure that whatever you do with that many documents cannot be done with an appropriate aggregation pipeline?

You only exclude the _id, are all other fields needed for what ever you do with these documents? Could you include only the fields that are needed? Even in SQL, doing select * from is wrong.

1 Like

Good point…our model has 1700 fields…everything we do has a projection down to just what we’re dealing with. If we tried to deserialise a full document, that’s a serious overhead…when you may only need a handfull.

1 Like

Yes, maybe others don’t need such a huge amount of data.

But my main point is, the driver seems need amount of time to do deserilization thing.

However, in my case, we need to acquire this data to perform some post-calculation (pivot analysis) within our application.
If we want the flexibility of the pivot, the data should be almost raw data. Otherwise we need to prepare 10+ kinds of aggration pipe lines.

*We’ve also grouped the data before putting it into MongoDB to reduce the dataset, but currently, these ~22k data points are all we need.

Since mongod is localhost on a laptop, I suspect that your laptop is memory starved at this point. I do not know much about C# but I suspect that RawBsonDocument is how the data is received from the server. When you start converting to your model then you might need twice the amount of memory.

Do you have any metrics about CPU and RAM usage during the process?

Rather than using high level API like ToList() you might one to try to convert to your model in a loop one document at a time making sure the RawBsonDocument object is deleted from the received list (to release memory) before converting the next one.

What does a document look like? Can you give a sample one?

The most likely cause to this isn’t the driver itself, but instead is an issue of syntax and formatting.

Make sure that you’re not building your data model the same as you would an SQL model, because I can tell you as someone who has built and constructed a 700TB blockchain node, that SQL doesn’t even hold a candle to the speed and performance of the MongoDB C# driver.

Please send up your data model and how you have it laid out and I’m sure we can help you optimize it.

SQL will never be even close to being as fast as JSON/BSON No-SQL…

This is being dead honest, there’s nothing close. And this is just recently with 970 million transactions a minute across a multi sharded cluster in what’s already not the best idea to setup.

I would encourage maybe checking out MongoDB University “MongoDB For SQL Developers” class, as that may highlight some issues you might not realize you’ve built.

This also isn’t some fanboy thing either, I’m very agnostic of database admin and usages. But also make sure that your use case is even relevant to NoSQL DBs, as NoSQL DBs inherently are for horizontal, not vertical data model designs. So do make sure your use case is even relevant for using MongoDB for it and vice-versa.

Hi Brock and John,

Thanks for your advices, I’ll dive into the MongoDB University.
But as you said, RDB can never handle a mega dataset while MongoDB can, perhaps I didn’t find the right way to work with MongoDB yet.

Cause my dataset contains sentive data, please give me some time to create a similar situation for simulation my issue. Thanks.

Hi John, I’ve created a sample project to demostrate the issue I encountered.
Please take a look

MongoDbBenchmark
The GenerateFakeData can generate 1M documents.
And the MongoDbBenchmark measured two model, List and List

FYI

Thanks James, I’m away from home at the moment but shall take a look when Im back tomorrow.

When I worked at MongoDB, just for general references, there were customers with hundreds of thousands of transactions a second just using the Atlas Functions and JavaScript….

Just for reference of capabilities and some customers had even more than that. Just on Atlas Functions when you have third party intermediary services via middleware and proxies.

It’s not like DragonflyDB where it’s a cached service design though, but in your use case it hasn’t even scratched surface level. So most commonly it may be an index or aggregation issue.

And can be just how you modeled your data if that makes sense. I’ll take a look later at your sample project.

I don’t advocate much of at all for the device sync product line, but the core product is extremely solid, solid enough 10% of its features can handle 90% of actual production.

Full ACID support among many other things as well, which is why when I see complaints like this it’s usually just a training lesson.

2 Likes

Out of interest how long does the generation routine take on your machine? It’s taking a long time to run on my workstation with local mongo.

Debugging it seems that nearly all of the time is the re-creation of the faker class within each function call.

Not using the faker, takes about 10ms to create 100 data points, with faker takes about 500ms per 100.

Pulling the faker class outside the loop and passing into the generation routines takes about the same (10ms) so spinning up a faker class is REALLY expensive:

            .RuleFor(x => x.GrossProfitWoReturn_TMC_Spec, f => f.Random.Double(0, 2000));

            var faker = new Faker();

                    var issueDate = GenIssueDate(faker);
                    s.IssueDate = issueDate.IssueDate;
                    s.IssueYear = issueDate.IssueYear;

        private static (DateTime IssueDate, int IssueYear, string IssueMonth, string IssueQuater) GenIssueDate(Faker faker)
        {
            var issueDate = faker.Date.Between(new DateTime(2012, 1, 1), DateTime.Now);

            int issueYear = issueDate.Year;
            string issueMonth = issueDate.Month.ToString().PadLeft(2, '0');
            string issueQuater = ConvertMonthToQuater(issueDate.Month);

            return (issueDate, issueYear, issueMonth, issueQuater);
        }

Example without faker class:

        private static (DateTime IssueDate, int IssueYear, string IssueMonth, string IssueQuater) GenIssueDate2()
        {
            int range = (DateTime.Now - new DateTime(2012, 1, 1)).Days;
            var issueDate = DateTime.Now.AddDays(_random.Next(range));

            int issueYear = issueDate.Year;
            string issueMonth = issueDate.Month.ToString().PadLeft(2, '0');
            string issueQuater = ConvertMonthToQuater(issueDate.Month);

            return (issueDate, issueYear, issueMonth, issueQuater);
        }

        private static (int Qty, decimal UnitCost, decimal Amount, decimal AmountLocalCurrency) GenSalesCost2()
        {
            var qty = _random.Next(2000);
            var unitCost = _random.Next(2000);
            var amount = qty * unitCost;
            var amountLocalCurrency = amount * (0.5m + (decimal)_random.NextDouble() * 30m);

            return (qty, unitCost, amount, amountLocalCurrency);
        }

Shall take a look at the retrieval code next.

/Edit I re-did the test and the graph is now as expected, I had a limited collection for testing so once it hit the size of the smaller collection the graph flattened as it was processing the same data volume! My bad.

I ran a quick check on scaling, this is doing a .ToList() so you’re pulling all the data back as opposed to streaming it and processing it as it comes in.

If there is a large overhead on the converting to an in-memory object and you do a lot of processing of the object you probably want to be able to run them in parallel, so as data comes in (perhaps in batches) you pass them to another routine to process the data, while the system is preparing the next batch.

I wanted to check if there was a straight performance hit or there was a scaling involved, seems that it’s a scalar performance (which to be honest is as expected):

image

X-Axis is data volume and Y is time in ms to process.

I’m tied up most of today on work deliveries but shall have a play and see what could be done to bring the three lines together.

I guess another question is WHY are you pulling this data into the app, what are you going to do with it? I may have missed it earlier in the thread but what calculation did you want to do or procesing and then what do you want to do with it (i.e. chuck it back to Mongo, update documents with calculated fields or push to another system)?

I’m not sure about some of the comments on performance on this thread, a variety of database engines can all easily deal with large data volumes, it’s what you want to do with it that’s the key and what you want the client to do.
I work with a variety of platforms from Mongo to IBM i-series machines, the iSeries is a complete beast for transaction processing in volumes, as the banking and insurance world can attest to, processing all direct debits in the UK for a banks clients for example involves rather a large amount of transactions. As far as I’m aware most airlines still run on IBM hardware and that’s also a lot of data volumes.

As has been mentioned, if you’re doing a grouping on this data, do it on the server and THEN pull back, there is no point pulling all the data into a .net model to then do a sum by year, when you can just do this on the server, were the data lives.

If you’re doing something crazy on the client, then you may want to look at a view model, project the data down and then have a model that just represents what you want.

Anyway, I’ll have a play more later when some time opens up.

2 Likes

Using a cut down viewmodel to just return some of the data:

Which makes sense, it takes a while to take a document from the server, convert to a .net object with type-checking etc and then serve that up.

It seems that the time difference between returning a model and a BSON document is on-par, it’s the conversion from RAW data to the BSON representation in .net is taking the time.

I’ve not delved into the dotnet driver but the use cases seem to be if you want raw performance to plough through the data as fast as possible, get them as raw BSON objects and deal with that, if you are going to process them and pass to a strongly typed model in your app, then take the overhead of the conversion but just do this when you’re not pulling a million records from the server.

Checking about a bit I found this:
https://jira.mongodb.org/browse/CSHARP-666

Which seems to allow to get the document as raw and then just deserialise the elements you needs as used, as opposed to on the whole document. This could be a halfway approach, grab the data as fast as you can and then process as and when, at which point you take the overhead. Note the details in that CR for how it de-serialises nested documents.

/Edit I create a pull request so you can see what I did to generate the above, I was trying to use generics to pass in different view models but I kept failing at syntax and didn’t have the time to work through it:

Apologies for anyone who finds the code layout offensive :slight_smile: