How to deal with big array of objects in document? Poor performance if array is getting bigger

James_Z · November 30, 2023, 8:53am

Hello guys, could anyone give me some insights on how to solve my DB’s performance issue?
This is the scenario: I have an API endpoint that will return an array of objects, and each object contains a set of coordinates:

[
    {
        cid: 1234,
        name: "john",
        latitude: -27.98161,
        longitude:-66.71234

    },
    ...
]

Since I need to collect the coordinates to build a track, I’m fetching this data every 15 seconds and append the coordinates to the document, this is a simplified version of my schema:

const trackSchema = new mongoose.Schema(
  {
    cid: {
      type: Number,
    },
    track: [
             {
                latitude: {
                    type: Number
                },
                longitude: {
                    type: Number
               },
          }
     }
)

I have a function that removes the documents from the collection if they are not in the latest fetched data, however, most of the time data with the same cid will be fetched over hours, which means a HUGE track array will be built. In fact, it really didn’t take that long to slow my response time from 4 seconds to almost 2 minutes. The first query after I drop the collection and import data would respond within 5 seconds, and it’s slower and slower, In fact, it really didn’t take that long to slow my response time to almost 2 minutes.
This is the result if I call TrackModel.find({}):

I’ve ran explain("executionStats") and find({}) in mongosh, this is the results:

I don’t know why the executionTimeMillis is showing 6 but the actual response time would take over a minute.

So my question is: what is the best practice to solve this issue? I’m thinking of creating another collection solely to store track objects, and associate the object with cid, meanwhile maintaining an order when I query the data. But I’m not sure if this is the best way to do this.

steevej · November 30, 2023, 1:24pm

Arrays are great but not necessarily the best schema for all scenarios. Continuously updating an array is such a scenario because the storage engine does copy-on-write, so the piece of data is written over and over.

Performance issues might be caused by many factors other than the schema.

Server hardware specification
Client hardware specification
Network specification
Size of data returned

If returning 410 documents take a minute on the wall clock and the server executes the query in 6ms, either your documents are huge and network latency explain everything or your whole setup lacks the resources required for your use-case.

James_Z · November 30, 2023, 6:18pm

Thanks, I’m thinking the same thing, perhaps I need to change my schema to something else.

I’m testing this on my local machine with 36G of RAM and 500Mb internet, my CPU is AMD 5600x, and the DB is hosted in Atlas.

This is my network chart from Altas.

I already disable the update and write the function so that if I query the document I won’t perform any operation except .find(). However the problem still exists if I only use TrackModel.find({}), this will still take over a minute, with no copy/writing operations.
Is it possible that it has something to do with my index?
This is my current index, each array will have a _id.

steevej · December 1, 2023, 2:22pm

On a shared tier or dedicated.

Shared means shared by others. Everything others do on their cluster affects your cluster. Everything you do on your cluster affects others’ network. To mitigate this effect, some limits are imposed.

James_Z · December 2, 2023, 1:19pm

You are absolutely correct, I contacted the support yesterday, and turned out my network outbound usage was hitting the limit due to this, and that caused the throttle.
So are there some ways to reduce network usage?

steevej · December 2, 2023, 3:10pm

Downloading the whole collection with

is certainly not the best in terms of server and network usage. The best way to reduce network usage is to download less data. This is where the aggregation framework is important. What ever computation you do with your documents on the local machine should be moved to the server using an aggregation pipeline.

James_Z · December 2, 2023, 6:34pm

These are great suggestions, but if I can reduce my data outbound and inbound within 3mb per 15 seconds, that would be around 17G per day, the support suggested serverless option since it doesn’t have any network limit. However, after reading Usage Cost Summary for serverless instances I’m not sure what the data transfer would be.

So for example, if I update 1mb of data and output 3mb for every 15 seconds, that would be counted as 4mb’s data transfer right?

steevej · December 3, 2023, 2:58pm

The answers to your last post is outside the realm of my knowledge.