Slow performance when iterating over large documents takes around

Arnas_Stonkus · March 17, 2022, 11:17am

I currently have a collection of ~50 documents (each around 1mb), where each document contains some metadata and an array of up to 10000 basic elements containing time, value.

When iterating over queried cursor, getting each document takes up to 0.5s. Is that normal performance for larger files or is there something wrong?

steevej · March 17, 2022, 5:04pm

It could be normal. It depends on many factors.

Please, share the characteristics of your installation.

Personally, I am worry about having a small number of documents that contains an order of magnitude more basic elements. I feel it is unbalanced. It looks like the bucket pattern has been over exploited.

Do you really need the 10k basic elements in the majority of your use-cases? If so and depending of what you are doing with those 10k data points, may be you should consider having the server doing the work using the aggregation framework.

Arnas_Stonkus · March 18, 2022, 9:37am

Installation:
I have the default configuration of MongoDB 3.6 installed on a machine with 64GB RAM.

Regarding use case:
In majority of use cases I end up going through all of 10k data points iteratively, doing some statistical analysis, that’s a bit too complicated for the aggregation framework.
I tried to reduce the size of buckets (from 10k to 100), but that provided only a minor performance improvement.

I thought about just storing each data point as a separate document, but I figured bucket pattern would be of decent use here.

If there is any difference, I use Pymongo to access the database and perform my analysis.

steevej · March 22, 2022, 1:36am

Is that a dedicated server or something is running on it?

Is your client code running on the same machine or remote?

What kind of permanent storage?

What kind of CPU?

Downloading that much data for you normal use-case will be hard to optimized.

Despite being complicated it would be a good idea to try to do it with the aggregation framework.

Is the data changing regularly?

You could pre-compute part of the intermediate results something along the line of Building With Patterns: The Computed Pattern | MongoDB Blog.

Arnas_Stonkus · March 22, 2022, 9:56am

No

The client connects to the machine remotely. To clarify more, client connects to read test data for an application from the DB

SSD

16 core cpu

The DB is mostly used for reading data, the data is modified rarely.

I understand that for this use-case it’s hard to optimize the data reading, but I thought that perhaps I did some DB design/usage mistakes that made an impact on the performance.
Also I wondered what speed could be expected from MongoDB+my setup when 1mb file is being fetched.

steevej · March 22, 2022, 1:04pm

Definitively a good case for the Computed Pattern mentioned earlier.

the one I see is

because

Juho_Lepisto · March 24, 2022, 10:31am

Hello, I’m colleague of Arnas. We are using MongoDB as a test data database, i.e. we read test inputs for our test application from the database, so this unfortunately rules out computed pattern. The raw data is always needed.

So is MongoDB feasible only for use cases where only small amount of data is needed at once? If this is the case we might need to return to the drawing board.

steevej · March 24, 2022, 1:40pm

Nothing stops you from keeping the raw data along with some computed metrics of data that will not change.

I understand the need to keep the raw data for some Ad hoc queries.

Any data storage system that you use to download that much data for every use case will be slow.

Yes. Starting with

wan · March 26, 2022, 8:40am

Hi @Arnas_Stonkus ,

As you said that in majority of use cases you end up going through all 10k points in iteration.

Does 0.5 seconds the time that is taken to iterate or just to fetch ? And how do you measure this ?

I would suggest to perform a test from the same instance server (DB and application) to remove network latency. Your network performance would play an important factor here.

Also, do you have any database server metrics / performance monitoring on to check where the bottlenecks are ?

Regards,
Wan.