The best model between relational collections

IATN · June 24, 2024, 10:30am

for example i have 3 collection:
user:
db.users.insert({ _id: 1, username: ‘mark’ });
db.users.insert({ _id: 2, username: ‘leon’ });
db.users.insert({ _id: 3, username: ‘leontin’});

providers:
db.providers.insert({ _id: 1, name: ‘prov1’ });
db.providers.insert({ _id: 2, name: ‘prov2’ });
db.providers.insert({ _id: 3, name: ‘prov3’ });
db.providers.insert({ _id: 39, name: ‘prov39’ });

data:
db.data.insert({
_id: ObjectId(‘6675309f82bdfbe3b00febfa’),
user: 1,
provider: 1,
message: ‘94451 Laos Sweden Iran’,
});
db.data.insert({
_id: ObjectId(‘6675309f82bdfbe3b00febfb’),
user: 2,
provider: 3,
message: ‘message 22222’,
});
db.data.insert({
_id: ObjectId(‘6675309f82bdfbe3b00febfc’),
user: 3,
provider: 39,
message: ‘message 3333’,
});

P.S. i have more 2 related collection with ‘date’ but i write the example only with 2 related (normally are 4 collections)

i search ‘2 likes’ … over 3.200.000 rows in data, 20.000 users, 10000 providers

db.data.aggregate([
{
“$lookup”: {
“from”: “users”,
“localField”: “user”,
“foreignField”: “_id”,
“as”: “user_info”
}
},
{
“$lookup”: {
“from”: “providers”,
“localField”: “provider”,
“foreignField”: “_id”,
“as”: “provider_info”
}
},
{
$match: {
$expr: {
$and: [
{
$regexMatch: {
input: {
$first: “$user_info.username”
},
regex: “^leo”,
options: “i”
}
},
{
$regexMatch: {
input: {
$first: “$provider_info.name”
},
regex: “^prov3”,
options: “i”
}
}
]
}
}
}
])

example query: Mongo playground

the result are about 15+ seconds and i need over 100.000.000 in data collections (users and providers are ok)

i have index in data.user, data.provider, users.username, providers.name

how to improve this relations ? thinking that the 4 collections will constantly change the data (users.username, providers,name, …, …)

Regards, IATN

steevej · June 25, 2024, 2:20pm

I think the main problem with your issue is that you $match after 2 $lookup by starting the aggregation on data. You lookup all your data for both users and providers and match to only get a subset.

Assuming you have more users and providers, start the aggregation on users matching right away with your regex. This way you would only $lookup for providers for matching users.

An alternative way would be to move the $match on users, in a pipeline of the first $lookup, then $match out empty user_info. The do the same with providers.

After thinking about it, the alternative way is easier to implement from the what you already have. See this playground modified from yours.

steevej · June 26, 2024, 12:21pm

If I may add. A model that is normalized a-la SQL and which does not analyze the use-case will most likely caused performance issues.

In your case I would ask:

Do I allow duplicate user names? Probably not.

Do I allow duplicate provider names? Most likely not.

Is changing a user name a frequent use case? Hardly,

Is changing a provider name a frequent use case? That would be confusing.

Is finding data with user name and/or provider name a frequent use-case? I would think so since a provider would like to see its data and ditto for a user.

So in your case I would:

Store the user name and provider name directly in my data keeping or not the _id. Your find data use-case above would become trivial and an order of magnitude faster as it would not invoice $lookup. Yes changing a user name and provider name would be complex and slow. But I prefer simple and fast frequent use-case at the expense of complex and slow rare use-case.

steevej · July 5, 2024, 10:37pm

@IATN, thanks for appreciating the feedback you got.