Time-series inserts unreasonably slow

Andrew_Melnick · March 14, 2023, 1:23am

My team is investigating using time-series collections in Mongo 6. From our experiments, inserting the same dataset into a time-series collection with the same insertion code, on the same hardware (3x replica set, via a mongos with a single shard, self-hosted), with a fresh collection each time, as compared to a standard, unindexed collection, is slower by a factor of about 60. That’s not 60 percent more time, that’s 60 times as much time (an hour vs a minute). Obviously, we were expecting some loss of write speed in exchange for the promised improvements in query speed and storage size, but this is egregious, which leads us to believe we are doing something profoundly wrong.

The data in question is sensitive, so we cannot provide it, or our code, but we can share the following:

Each document consists of a timestamp (which we provide as a native timestamp), a single numeric series id (with cardinality on the order of 100 to 1000, set as the metadata field), plus anywhere between ~10 and several dozen numeric measurement fields, with no arrays or nested documents.
All documents with the same series id share identical sets of measurement fields, and fields do not change type between documents.
The data we receive is locally ordered by time, but not guaranteed to be globally ordered (i.e. we receive “chunks” of sorted data, but chunks may be received out of order).
No indexes except those created automatically by mongo are used.
We are using the java sync driver
We have tried batching anywhere between 100 and 100,000 documents before calling insertMany, with no noticeable change in total upload time.
We have tried both single-threaded and multi-threaded (up to 32 threads on a 16 core machine), with no noticeable change in total insert time (measured from log timestamps, only counting the time from the first insertMany is called, to when the last returns), with no noticeable change in total upload time. We have not tried parallelizing at the node level, but we have confirmed we are not bottle-necked by network I/O.
We have tried disabling ordered writes, with either no noticeable change, or a slight increase in total upload time (which runs counter to our expectations, based on the documentation).
No other workloads are accessing the cluster during test runs
mongostat shows “spikes” of a 1000-2000 inserted in one printout row, followed by several seconds to a few minutes of absolutely no activity
No errors or warnings are logged from the shard data servers, mongos servers, config servers, or the java driver.
Our estimated working set of one full run (based on raw data size multiplied by a fudge factor) comfortably fits within a single server’s memory, several times over. We have confirmed via Prometheus metrics that we are not getting anywhere close to our CPU or memory limits, and we do not have swap enabled.

Any advice as to what we can investigate or change is greatly appreciated.

Kushagra_Kesav · March 16, 2023, 2:34am

Hi @Andrew_Melnick,

Welcome to the MongoDB Community forums

We understand but can you provide the sample document, (not the actual one), just to understand what the documents look like?

Also, share the following information to better understand the problem:

The information about the size and structure of the dataset you are inserting into the time-series collection?
The index details of both time-series & non-time-series collections?
The Java sync driver configuration and the Java driver version?
The MongoDB version you are using?
The metafield and granularity while creating the time series collection?
The hardware configuration?
Have you considered using a different driver or client library to see if there are any performance improvements?

Best,
Kushagra

Andrew_Melnick · March 16, 2023, 4:04pm

The information about the size and structure of the dataset you are inserting into the time-series collection?

We understand but can you provide the sample document, (not the actual one), just to understand what the documents look like?

The metafield and granularity while creating the time series collection?

As mentioned, each document consists of a BSON native timestamp field, an integer metadata field, and anywhere between 10 and a few hundred numeric measurement fields, depending on the document (a mix of ints, doubles, and strings), in a flat structure. The same set of fields are present each time for a given metadata field value. The granularity is set as “seconds”, as each time-adjacent document with the same metafield are generally within the same second.

A straw-man version of one of the documents might look like:
{ “timestamp”: ISODate(“…”), “metadata”: 1, “a”: 0, “b”: “1”, “c”: 2.5, … }

The total data size of the test runs is around 250MiB of (highly compressed) raw data, I don’t have a good estimate of the on-disk size once it hits mongo. The time taken for both collection types seem to scale linearly up from smaller test data sets. We do not have a good _id field, so we allow the driver to generate it client-side.

The index details of both time-series & non-time-series collections?

No indexes were created for either. The timeseries collection has the clustered index that gets silently created by virtue of being a timeseries collection, and both have the implicit unique index on the _id.

The Java sync driver configuration and the Java driver version?

The MongoDB version you are using?

We are running Mongo 6.0.1 on Debian 11, and using mongo-driver-sync 4.8.2

Have you considered using a different driver or client library to see if there are any performance improvements?

We have not. The input data is in a proprietary format for which the conversion code is written in java, so using a different language would introduce additional complexity, and aside from the C++ driver, we didn’t see any supported languages that we thought would produce substantial enough performance gains to be worth investigating. We have tried to follow all of the best practices, such as using a single MongoClient for the application, using multiple threads to perform the pre-insert conversion, batching documents and using insertMany, disabling ordered writes, etc. As mentioned, almost the entirety of the time is spent in the insertMany calls.

The hardware configuration?
We have 2 mongos, 3 configsvr, and 3 data nodes which form a replica set for a single shard. The exact hardware details are abstracted from us, as we are running on a private cloud, but we have at least 16 cores and 32GiB memory for each data server. The inserter application is dynamically allocated hardware at a rate of 1 core per 2 threads.

While we obviously plan to utilize more powerful hardware in production, possibly even dedicated (virtual) nodes, I’d like to stress the fact that our issue is not “this particular mongo cluster is too slow” but “timeseries collections are so much slower that they break our plans and budget”. We obviously plan to add more shards as volume grows, but we planned to do so at a given rate, and going from 1 to 60 shards right now to try and claw back that 60x slowdown is not in our budget.

Thank you for reaching out, let me know if there are any other details that would be of use, and I can try to get them.

Kushagra_Kesav · March 22, 2023, 2:52am

Hey @Andrew_Melnick,

Welcome to the MongoDB Community forums

I’ve generated some random sample data from a script. Could you please confirm if the format of the data below matches what your data?

{
  "_id": {
    "$oid": "6417ffc4918a044f1b529663"
  },
  "timestamp": {
    "$date": "2023-03-20T12:02:09.265Z"
  },
  "meta": 124,
  "field_0": "irvwmcvctzyeuzuxicg",
  "field_1": "hljgf",
  "field_2": "frkkzeytdwhdvfs",
  "field_3": "ndzdkxv",
   ... <120 more fields in some documents, fewer in some other documents, randomly> ...
  "field_123": "tkxwugdqsfnlgmmzpctn"
}

While inserting the 1 million documents in my environment, it takes twice the time which is far more than 60 times you are experiencing.

Can you confirm specifically here if, it is 100 or more than that?

In general, time series collections work best if the schema is consistent, so it can take advantage of the columnar storage pattern it was created in mind with. An inconsistent schema runs counter to an efficient columnar pattern and may result in suboptimal storage/performance of time series collections.

For more information refer to the Best Practices for Time Series Collections

What is the approx/actual number of documents you are inserting? Also, share the collstats of your regular collection

Is it 60 times when you are doing the workflow simulation or while importing the data?

Further, the bottleneck could be anywhere in the system from the insertion process to getting acknowledged to the server.

However, based on your shared information, it appears that TimeSeries is probably not the right solution for your use case. Having said that likely regular collection is more suitable for the use case.

Best,
Kushagra

Andrew_Melnick · March 22, 2023, 5:54pm

We believe we have found the solution. The issue stemmed from a disconnect from the description of the data we were receiving and what was actually in the data.

A consequence of this is that the metafield was not being produced correctly, resulting in 2 unique values, as opposed to the ~1000 that were expected. This resulted in all of the data all getting funneled into the same bucket. As you mentioned, we were striving for a consistent schema (each unique metafield value uniquely determines a set of fields in this data, with all fields present in every document for that metafield value), but the incorrect metafield values disrupted that.

Once we updated our pre-processing code to work with the data as it is, not as it was described, to generate the correct metafields, the performance returned to on par with the unindexed collections.

system · April 4, 2023, 2:50pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.