Modal advice, indices and transitioning from Datastore to Mongodb

Hello there - new to this community but love how active it seems and looking forward to learn more.

I’ll keep it short, i’m looking for a good model that will fit my requirements. Will give short example of the bottlenecks i’m experiencing.

I have 2 ‘tables’, Company and Reviews

Company has id, name, timezone
Reviews has id, companyid, reviewer, reviewee, score, etc.

To keep it simple, assume i have >20k rows in Reviews.
I am currently, in my application, querying through the Reviews and filtering by the companyid at runtime when a user lands on a page.

It is taking ~16 seconds to query and get that data (around 5-6k entities) - I am using google cloud Datastore and transitioning now to mongodb thus looking for best practices on this.

  • What is high level advice which I can look into? companyid seems like a good index to start with.
  • Do you see anything wrong with this model growing quickly? the Reviews can be expected to go upwards of 500k in the next 3 months.
  • I assume the best thing is to keep the Reviews rows on their own - or can i use the concept of ‘parents’ aka the companyid being the parent of a list of review rows - is that pretty much just another ‘index’? I’m bringing terms I have seen on GC Datastore so take it easy on me on this one :slight_smile: Historically, I’ve worked with relational databases.
  • Update 1: Inserted 25000 sample Reviews rows and created an index on the companyid. To get all reviews for a company still took ~11 seconds - assuming right out of the gate that this is not the way to go about it. Now considering having a collection for each month and storing the review ids there. But then I’d have to be making 2 calls to the DB everytime i need to get reviews (not the worst solution, but looking for scalable alternatives)

Thanks,
Mihai

Hi @Mihai_Oprescu,

Welcome to MongoDB community.

It sounds as a company might have a large number of reviews (more than thousnds).

Therefore it doesn’t make sense to embed those in the company collection (doc size is 16mb max).

Storing the reviews in its own collection make sense and indexing the companyid as the relationship value is also good

If you query the reviews with a desc date you should add this field with -1 direction into the index.

It might be worth considering a partitioning strategy per month but I have a question.

How is the data presented in the application, do you need all reviews in a single page or just a batch of first documents?

What you might do is fetch only X first documents per company and when a user clicks on more get the next from the cursor. Or do sort by _id and limit , performing a $gt and limit for next batches…

You can maintain a totalReviews field in the company document incrementing its value each review so users can still get how many reviews there are without the need to count them over and over.

Thanks
Pavel

1 Like

Hi Pavel!

Yes, the bigger the company, the more reviews are coming in. I’m actually (pleasently?) surprised by how fast it’s growing but at the same time, some fires are here to be put out :smiley:

I will clarify that the reviews are currently in their own collection and I perform queries on the reviews collection.

I’m really glad you brought up the indexing advice. thanks for that ^.

Usually, there is a 3 month default range for which the reviews are needed and the user has quick date range options (6 months, 1 year, All Time)

I like the partitioning strategy per month. Furthermore, I was thinking of having user + month partitioning of reviews. For example, I will store a user’s reviews for january in collection user_12345_01 (user_:userid_:month)
Thoughts on that? If i rewrite the model, I might as well try to go for one which I won’t have to rewrite in 6-12 months.

  • Using mongo with nodejs, does the operation .find(…) on an index actually lookup by hash? I clearly need to get more intimately familiar with implementation details of mongodb.

  • Yes, the reason why all reviews for those date ranges need to be queried at once is because there are scores calculated using them.

As an immediate solution, I’ve added an in-memory caching for the functions which are bottle-necking right now.

Thanks Pavel!

If i create a (dynamic) collection for each user for each month, am i going to create too many indexes and then everything gets slow? for each user for each month sort of seems overkill, but i’d like to know the cons of it.

A collection per company per month might be a better choice.

Hi @Mihai_Oprescu,

Yes creating a collection per user per month sounds like an antipattern as you will have to many collections.

Plus I don’t think that user only needs to see ita own review… So that doesn’t sound like a good way to store the data.

What do you think about a database per company storing only company specific reviews or collection per company reviews with company id prefix ?

This sounds like reviews are strongly coupled with a company…

Now why wouldn’t you use a $merge materialize view that will pre calculate the scores every x minutes and than the user will use a query to score by the prepopulated score … You can actually index the score field to show top score then paging this make sense.

MongoDB identify a query shape and best index is with order of Equility Sort and Range ordered fields in the index.

So every field which is used as equality should be first than any sort field and finally the range filters …

Indexes are btrees which are sorted so finding an equality is considerably fast if its cardinality is low. So is sorting which is just advancing in the index.

Thanks
Pavel

1 Like