Queries with large volumes of returned data

I have a large timeseries data collection. Think 10Hz time series data sampled data running 24/7. It’s mostly jagged tabular data.
The primary index on this is time.
If you run a query to return a weeks worth of data, this is roughly 6K rows per column in each document.
The goal is to do some fairly compute intense calculations on this data.
Running the compute cycles in the database is what I’d expect for a query followed by a main CPU based execution engine. However, returning that kind of data to be processed by different computing resources is slower than I’d expect for an I/O operation. I suspect this is because the data is returned in a text format and not a binary format. I’m not sure how to change how data is returned. If this is possible, I’m not looking for the right things.

What is the best way to deal with large volumes of data being returned from a MongoDB query?

Hi @winterberry,

Welcome to the MongoDB Community forums :sparkles:

Can you be a little specific here with the dataset size? Like what is the collection size of your time-series collections?

Please share the sample document.

Also, share the output of

db.collection.stats()

and

db.getCollectionInfos({name:<time-series collection name>})

Do you mean 6K documents? MongoDB doesn’t have the concept of rows & columns, can you clarify this?

Also, what specific query you are executing to get the result?

Can you clarify your approach to calculating the data? Will it be done at the database level using aggregation pipelines, at the application level, or through some other method?

What do you mean by “CPU-based execution engine” here? Also, kindly help me understand what specific computing resources you are referring to when you mention “different computing resources”?

The query returns the cursor of the Result Set in a text format, specifically in JSON after which we can iterate over the result set.

Could kindly help me understand what you mean by “best” and what “deal” refers to? Also, I was curious if having a lot of data returned by a query would be a problem.

Also, provide us with information about your MongoDB deployment. Specifically, please let us know the following:

  1. What version of MongoDB are you currently running?
  2. Is this an on-premises deployment or is it hosted deployment such as MongoDB Atlas?

Furthermore, refer to the Best Practices for Time Series Collections to read how to improve performance and data usage for time series collections.

Best,
Kushagra

5 posts were split to a new topic: MongoDB Queries - Fetch and optimize