Is it better to make frequent read operations or to pull all data periodically?

I have a collection that is updated frequently and contains a large amount of data. The schema is essentially:

{
  productType: Array,
  productName: Array,
  productPrice: Array
}

My users are concerned with finding products with prices lower than $100. Prices are constantly changing, so I query a database that contains over 100,000 products every 10 minutes to filter to all products under $100. This brings the dataset down to about 20,000 items per array with a total size of about 2MB. Since these values change so frequently, I feel it’s best to store them in a single collection instead of writing to each product to their own collection.

On my frontend, the users would like to search for products by name or type to find the lowest prices. My current design is to pull the entire collection into memory with an API call that takes about 0.5-1 seconds. Then when they perform the search, all prices are stored in memory and easily accessible. If they open the search window later and greater than 30 minutes has passed, they pull the updated collection into memory.

The other alternative is to create API calls to show them only the product name or type they’re currently searching for. Each character typed would make a call where the search results show them the lowest price for the input.

My question is, is it better to store the entire collection in memory which results in faster speed and the client’s machine or is it better to make frequent API calls to only show them relevant information? Which solution is most sustainable in the long run? If I have 1000s of users, is less calls pulling more data better or many frequent calls?

Hi @John_N_A1 ,

I am not sure I fully understand the use case …

What do you mean you have arrays with 20000 elements? Is it the way it is stored in the database?

Also not sure what you mean by a collection per product?

If so it sounds like foundomantly wrong for mongo as we have an antipattern for massive arrays.

Usually a data design for product storing is done with an item document per product where price field can be indexed. This allows a quick find of products where price is lower than 100$.

Now if this is a constant critical query in the database you can have separate collections for items lower than 100 dollars and other collections for additional ranges. The problem here is that when a price is updated to more than 100 it should involve a delete from one collection and move to another, potentially using transactions… This can be a performance problem if this is a frequent scenario…

Can you elaborate more on the document you store and what is the ultimate goal of this design… Is it mainly for reporting?

Hi Pavel. Yes the document is structured like this (forget the productType):

productName: ['nikeShoe1', 'nikeShoe2', 'acura1', 'honda1',... 20,000 items],
productPrice: ['100', '120', '20000', '10000',... 20,000 items]

The dataset I’m pulling from actually has around a million items. I pull that database every 10 minutes because the price of the products frequently changes new products are constantly added. It wouldn’t be efficient to loop through and create or update a single collection per product because I need all the updates as quickly as possible.

Then on my frontend, I read in all 20,000 elements in a single API call that is stored in memory. I then convert the collection to a dataframe that is searchable by the user. So if a user types in “ni” it will display all Nike shows with the lowest price displayed first. I like this solution because all of the data is stored in the user’s memory for fast access and less frequent API calls.

Or would it be better if I somehow did and insert many (or update if it already exists) to create a new collection per item? then I could make the API call be a search with something like:

db.users.find({'name': {'$regex': req.params.searchtext}})

?
Then I wouldn’t get the benefit of storing everything in memory and it would require many API calls on the client’s side.

Each product also has a unique identifier called productID

Hi @John_N_A1 ,

Having those massive arrays is a bad design for MongoDB.

I don’t see how it is easier to search that on application side, do you treverse arrays all the time. As you see it also puts you in the problem of no real time change reflection.

So if you need the price update as soon as possible you need a structure to query data live from the database every user interaction, but not with such a a design like you have at the moment.

Read on anti pattern here :

Why a document per product is not good ? Have you tried it?

Regex searches are not effective when using indexes and we recommend text search solutions like Atlas search if you use atlas or text indexes.

I don’t follow why you search the users collection in the example?

I can try to do a bulkWrite instead of a single document. It seems like it would be slower to pull all documents instead of a single document that contains everything.

My application is written in node js, so I don’t think I’d be able to use Atlas search. What do you mean by text indexes? If I have a document per product, I can try pulling all documents at once in an API call and cleaning up the data in a dataframe like I’m currently doing. But again, this seems like it would be slower to pull 20,000 documents at once rather than a single one.

Sorry, that code example was copied from stackoverflow. In my case it would be db.products.find

Hi @John_N_A1 ,

Atlas search is agnostic to the driver used . Its an index build on the collection you are working against.

Read more here :

The driver only issue the aggregate with $search stage to do a full text search on the indexed collection.

Now I agree that loading all products if they are in one document will be faster than loading 20000 documents but it sounds like you need to do different filters and searches in real time.

Why do you need to load all 20000 documents at ones ? Do you run analysis? You keep mention dataframes is it related to pandas or machine learning ? Can you explain what are those dataframes you use?

Updating and filtering inside those massive arrays is not effective. MongoDB has indexing and text index is a less perfomant alternative to Atlas search but it is still better way to aearch text than regex operators.

Thanks
Pavel

The reason for loading it all in one document is for speed. I have a script that make sure the data is clean and pushes to the one document. So the user can load the single document and I can use a dataframe (https://danfo.jsdata.org/) to quickly sort the data and display what the user is searching for. The size of the document is relatively small, so it’s stored in memory for quick access.

If I use many documents, that means that I’ll potentially have thousands of concurrent API calls from different users. My main question is, will my MongoDB database see more stress and eventually increased costs if the user makes infrequent API calls pulling a single document or if they’re making thousands of calls to query for multiple documents that fit their search criteria?

Hi @John_N_A1 ,

I cannot say as it depends on multipe factors and require testing in a load environment.

It sounds like pulling the data every 10min does not work for you and frequently pulling all data into memory sounds like a waste of resources if its just to filter a small porition.

I wonder as the arrays have single values how do you align the price to the product? By the array index? I mean every nikeShoe1 = 100 nikeShoe2 = 120?

Isn’t it a problematic way of aligning data? What if a shift happens , all of the prices will be wrong?

If you go to store all data in a single document doesn’t it make sense to use array of documents?

products: [ { name : 'nikeShoe1' , price :  100 } ... ]

Also you store the prices as strings or its just in the example?

You might know your application and this Pandas framework better than me, but it sounds like this design contradicts lots of our best practices, and potentially can be un-scalable…

I don’t think pulling all data into memory is a waste of resources, because it’s a fairly small amount of data (1-2MB) and the users may query many different products.

Yes, each product name and price will always have the same number of values and this is ensured in the script that loads the document. Then the Danfo dataframe can convert everything to a single record like the example:
{ name : 'nikeShoe1' , price : 100 }

An array of documents would require more memory and yes the prices are actually floats.

Right now my solution does seem to be working efficiently. But I’ve also tried loading all products as single documents and will test Atlas Search. The bulkWrite takes about 30 seconds to a minute, and if I write an API call to load all documents it takes about 1.5 seconds which is twice as long as pulling a single document with all info. Also, I’m not sure if the Atlas Search is limited (I’m still reading the documentation), but if I search for something like:
nike
it will only return results like this:
nike, nike shoes
and it won’t return a result like this:
nikeshoes
I have wildcard enabled.

@John_N_A1 ,

I am not sure why refer to bulkWrite when comparing to reading a single document. Bulkwrite is for writing chunks of data…

You probably understand your use case better, but I would not go with this single massive data array solution…

I said that I used bulkWrite to change my entire schema so I could test Atlas Search like you suggested. I loaded each product into its own document like this:

var updates = [];
  for (var i = 0; i < allProducts.length; i++) {
    console.log(i)
    updates.push( 
     {
      updateOne: {
          filter: { productID: allProducts[i].productID},
          update: {
            $set: { productID: allProducts[i].productID, productName: allProducts[i].productName, productPrice: allProducts[i].productPrice }
        },
          upsert: true
      }
    }
    )
  }
  var result = await searches.bulkWrite(updates)

So that allowed me to have a single document per product like you suggested. But then I saw the slower speeds I mentioned in my above post and the issue with Atlas Search. Is there a way to alter the search parameters so it would pick up nikeshoes if I searched nike?

Perhaps the regex operator?

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.