Collection Modelling for user segmentation

Hi,

I’m pretty new to Mongo and what mongo can do, so I’ve decided to ask for some help related to a problem that I’m having.
I’m building a customer data platform where you can segment users based on different user attributes and then push those users to Facebook Audiences / Google Audiences, etc.

Explanation of what I currently have:

Let’s assume we have the following collections with their schemas:

Customers:

_id: ObjectId
first_name: string
last_name: string
email: string
gender: 'male' | 'female'
source_id: ObjectId (ref customer sources)
date_added: Date

Customer sources (the sources where the customers come from):

_id: ObjectId
name: string
config: {
   // different configs about where to pull data from and how to do it
}

Segments config:

_id: ObjectId
config: {
   attributes for filtering Customers
}

Segments (an aggregation of customers) - using Bucket/Outlier Pattern:

_id: ObjectId
users: Array<ObjectId>
segment_config_id: ObjectId
version: number
count: number
hasExtras: boolean

So basically, I have a customer source that adds customers to the Customers collection. I can then create a segment config based on different attributes. For example:

I want to create a segment with the following attributes:
gender: male
date_added LTE last 7 days
customer source id 123122321321

So that will create a segment config:

_id: 12345
config: {
   customer_source_id: ObjectId(123122321321)
   gender: male
   date_added: {$lte: 7 days}
}

and a Segment (bucket pattern):

_id: ObjectId(random)
users: [ObjectId(userId1day1), ObjectId(userId2day1), ... , ObjectId(userId450day1)}
segment_config_id: ObjectId(12345)
version: 1
count: 2
hasExtras: false

Now let’s say that the next day my customer source updates the customers collection with 2000 more users and the first 200 users in the segment do not match the segment config.
I have a cron that checks if a segment needs to be updated. So my Segment will become:

_id: ObjectId(random)
users: [ObjectId(userId201day1), ObjectId(userId450day1),ObjectId(userId1day2) ...,(ObjectId(userId750day2)],
segment_config_id: ObjectId(12345)
version: 2
count: 1000,
hasExtras: true
_id: ObjectId(9809)
users: [ObjectId(userId751day2), ObjectId(userId752day2)...,(ObjectId(userId1250day2)],
segment_config_id: ObjectId(12345)
version: 2
count: 1000,
hasExtras: true
_id: ObjectId(9809)
users: [ObjectId(userId1251day2), ObjectId(userId1002day2)...,(ObjectId(userId2000day2)],
segment_config_id: ObjectId(12345)
version: 2
count: 750,
hasExtras: false

Basically i’ve created a new segment version. This will happen over and over again as my segment config will update the segments that i have created.

Later on, when I need to deploy a segment on Facebook Audiences for example, I’ll do an aggregation pipeline which looks like this (I’m paginating every segment while hasExtras = true):

[{
    $match: {
        segment_config_id: ObjectId('12345'),
        version: 2
    }
}, {
    $skip: 0
}, {
    $limit: 1
}, {
    $lookup: {
        from: 'customers',
        localField: 'users',
        foreignField: '_id',
        as: 'users'
    }
}]

This will output a Segment with customer details which will then be pushed to Facebook Audience:

_id: ObjectId(random)
users: [{_id: ObjectId(), first_name: 'Test', last_name: 'Test', email: 'test@mongodb.com'}, ... etc],
segment_config_id: ObjectId(12345)
version: 2
count: 1000,
hasExtras: true

Why the version key is needed you may ask: well, since I also want to inform Facebook Audience which users are no longer part of a segment / which were added, I’m doing a difference (in NodeJS) using Sets for current version and previous version of the audience which will output which users were added and which were removed.

What I’m trying to find out:

  • Is keeping the reference to customers bucketed and then have a pipeline which will get the user details good? I’ve tested it with 1 million test users and the performance is good-ish (7 seconds to iterate over all pages one by one and get the full data) - i’m avoiding getting more than 1 document since I don’t want to get past mongo’s limits.
  • Should I store the segments an user belongs to in the customer collection? Meaning i will have to add an additional segment key to the customers table. But that will have to be bucketed as well if the user will be part of more than 1000 segments.
  • Another idea that I had was to store customers belonging to a segment in a separate collection but that would’ve create a duplicated data problem. (storing millions/billions of duplicated users when you actually have less than 1 million is a big problem).
  • Is there anything you would change on this?

My main concern is around how I’m storing the segments and their versions and I wonder if there is any other good pattern that I’ve missed / I wasn’t aware of.