Read entire collection with each query OR pre-process data into a separate smaller collection?

Z_Knight_Z · October 13, 2022, 5:12am

Imagine I have a large collection of articles (say 20,000 - which may not be that large). Each article has a rating that changes as people interact with it - view it, share it, etc. The front page of my website will present the top 500 rated articles - so each time a user opens the front page they will extract these from the database.

Would it be better to serve this data using:

Letting each user query the entire collection to extract the articles sorted by rating and limited to 500 returned articles. I presume this requires the database to process the query by searching the entire collection until the top 500 articles are found - this would be repeated for each person requesting the data. I think this is essentially a order-n process.
Periodically, say every hour, have the backend process the collection and find the 500 top articles, but store their IDs in a separate collection: say called “top_articles”. Then, when each user access the front page, they would instead request the “top_articles” and then POPULATE the results with the actual article data before returning this list. Theoretically the users wouldn’t be constantly asking the database to sort the articles based on rating because that would be done once every hour and hence reduce the number of times the users do it - effectively preventing constant sorting that seems duplicative.

I’m trying to understand which is a better design? Or if you have an even better suggestion? I realize that if I am ranking the articles by their interactivity numbers (shares, likes, etc), those things would change during the hour and hence I would theoretically have stale rankings until the next hour where the backend does the processing - that’s ok.

Pavel_Duchovny · October 13, 2022, 2:09pm

Hi @Z_Knight_Z ,

If you index your rating field (the field in your article collection that indicate its rating) and have it sorted in descanding order , a query fetching the top 500 documents will only need to query the first indexed 500 (no colleciton scan)

For example if my article collection looks like the following:

{ _id : ARTICLE_ID,
  title : "SOME TITLE",
  rating : 10,
...
 }

Indexing {rating : -1} will allow me to quickly get the top 500 articles:

db.articles.find({}).sort({rating : -1}).limit(500);

Is that what you’ve been looking for?

Thanks
Pavel

Z_Knight_Z · October 14, 2022, 2:36am

Thank you for this - it helps a lot. I think this will work assuming me indexing already on title/slug (for search purposes) won’t interfere. Although I would still be curious to know which of my original options would be more efficient - or are you also saying indexing is more efficient too? What happens if I had say 100,000 items in the collection, instead of 20,000 - would indexing still be advisable?

Pavel_Duchovny · October 14, 2022, 5:28am

The bigger the collection the bigger the index , this makes the chance of the entire index to fit into memory leas likely (but it depnds on the ram as one field index in a int should not be large).

Otherwise 20k and 100k with an index still scan the same number of entries (500) …

Therefore this is the advisable way.

Thanks
Pavel

system · October 19, 2022, 5:29am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.