Imagine I have a large collection of articles (say 20,000 - which may not be that large). Each article has a rating that changes as people interact with it - view it, share it, etc. The front page of my website will present the top 500 rated articles - so each time a user opens the front page they will extract these from the database.
Would it be better to serve this data using:
-
Letting each user query the entire collection to extract the articles sorted by rating and limited to 500 returned articles. I presume this requires the database to process the query by searching the entire collection until the top 500 articles are found - this would be repeated for each person requesting the data. I think this is essentially a order-n process.
-
Periodically, say every hour, have the backend process the collection and find the 500 top articles, but store their IDs in a separate collection: say called “top_articles”. Then, when each user access the front page, they would instead request the “top_articles” and then POPULATE the results with the actual article data before returning this list. Theoretically the users wouldn’t be constantly asking the database to sort the articles based on rating because that would be done once every hour and hence reduce the number of times the users do it - effectively preventing constant sorting that seems duplicative.
I’m trying to understand which is a better design? Or if you have an even better suggestion? I realize that if I am ranking the articles by their interactivity numbers (shares, likes, etc), those things would change during the hour and hence I would theoretically have stale rankings until the next hour where the backend does the processing - that’s ok.