Choosing a composite shard key involving an identifier and a date field

Hello, Folks!

We have a collection which has two important fields - userid and date (at which the document was created/inserted)

use cases/system behaviour

  1. many users create many documents over the time. e.g in the order of 500 k docs for a 2 year date range
  2. some users are kind of relatively less active generating about 200-300 docs in their life time in our system

goals

  1. distribute the docs generated by a user across the shards
  2. get the most recent docs created by a given user. this is a ranged query based on the date field mentioned above

shard key choices

  1. userid (hashed) - this will help target the queries to exactly one shard but makes it hotter probably skewing the write distribution
  2. userid (hashed), date (ranged) - expectation is docs created by a user belonging to a certain time range (probably in a hour, day, week or a month) are grouped into chunks while other chunks reside in other shards

Problem
We’re inserting around 10 k docs each having the same userid but different dates (each differing by an hour compared to the previous doc). All these 10 k docs end up on only one shard even with the composite shard key mentioned above

shouldn’t MongoDB split the chunks and distribute the data into both the shards? As per the MongoDB blog on-selecting-a-shard-key-for-mongodb, consecutive docs should reside on same shard while other docs may reside on other shards. Please clarify what’s going wrong in our test

if “Asya” is a prolific poster, and there are hundreds of chunks with her postings in them, then the {time_posted:1} portion of the shard key will keep consecutive postings together on the same shard

Setup details
We are running a 2-shard MongoDB v5.0.13 cluster on-premise. More details are posted in this Stackoverflow post

Please ask any question/detail you would like to see. Thank you so much!

trying to revive this thread

sorry for tagging but just wanted to get some visibility on this. any info would be very helpful
@Aasawari @Stennie

Hi @A_S_Gowri_Sankar,

We’re inserting around 10 k docs each having the same userid but different dates (each differing by an hour compared to the previous doc). All these 10 k docs end up on only one shard even with the composite shard key mentioned above

Based off those 2 index keys and your test data, the behaviour mentioned appears as expected - All the 10,000 documents having the same 'userid' value would belong to the same shard due to the hashing function against the same 'userid' value. Just to clarify here, are you seeing these 10K documents on one shard or one chunk? Lastly, could you advise if you set a custom chunk size?

shouldn’t MongoDB split the chunks and distribute the data into both the shards?

Normally, MongoDB splits a chunk after an insert if the chunk exceeds the maximum chunk size. The balancer monitors the number of chunks and migrates accordingly, attempting to keep the number of chunks relatively even amongst shards. More details noted here on the migration threshold documentation for sharded clusters.

As per the MongoDB blog on-selecting-a-shard-key-for-mongodb , consecutive docs should reside on same shard while other docs may reside on other shards. Please clarify what’s going wrong in our test

The example on that blog page doesn’t appear to use hashed sharding where as the two shard keys you’ve provided both contain a hashed field which I presumed you based your testing on.

I think in this particular scenario, I would advise testing with a more relevant set of data closer to what you are getting on your production environment or even what you have mentioned (multiple users, some users creating more documents than others rather than that of just 10k documents from a single user).

Regards,
Jason

2 Likes

hello, @Jason_Tran . thank you for the response!

No, we use the default chunk size of 64 MB

all these are in the same chunk on the same shard as seen below

db.CompoundHashedShardKeyTest.getShardDistribution()
Shard mongos-01-shard-02 at mongos-01-shard-02/mongos-01-shard-02-01.company.net:27017,mongos-01-shard-02-02.company.net:27017
{
  data: '1.1MiB',
  docs: 20000,
  chunks: 1,
  'estimated data per chunk': '1.1MiB',
  'estimated docs per chunk': 20000
}
---
Totals
{
  data: '1.1MiB',
  docs: 20000,
  chunks: 1,
  'Shard mongos-01-shard-02': [
    '100 % data',
    '100 % docs in cluster',
    '58B avg obj size on shard'
  ]
}

that’s correct. After reading the blog initially, I was using the range sharding for both these fields and the distribution wasn’t good either. in fact, it was skewed towards the one shard no matter how many unique userid and date combinations of data are inserted into the collection. That’s when I inclined towards using the hash of the userid field while keeping the date field for the range sharding.

sure. That test was merely done to understand how to distribute documents belonging to one user among all the shards.

I have now inserted data corresponding to the example I mentioned. There are a total of 100 users each inserting 100 documents except for every 10th user who inserts 10 records each. So, this test setup simulates 90 users each having 100 docs while the remaining 10 users have 10 docs each.

I’m attaching the test script and results from two scenarios. first is with the shard key where there’s no hashing on either of these fields and second where there is hashing on userid field and ranged sharding on date field

My observations are

  1. Range sharding doesn’t seem to work as per the blog
  2. Hashed sharding is definitely better when we have a good amount of unique userid values but it’s not perfect (58-42)

I just want to solve this problem of distributing documents belonging to a userid containing different date values across shards to achieve better distribution and read locality when targeting documents belonging to a give input date range

Please let me know if you need more info. Thank you very much for helping! Much appreciated :slight_smile:

Test data script

for (var i = 1; i <= 100; i++) {
	inner: 
	for (var j = 1; j <= 100; j++) {
	   var date = new Date(1640995200000 + j * 1000 * 60 * 60);
	   db.CompoundHashedShardKeyTest.insert({"userid":i,"created_at":date});
	   if (i % 10 == 0 && j >= 10) {
	     print("Inserted " + j + " records for user " + i);
	     break inner;
	   }
	   print(i + "-" + date);
	}
}

Shard distribution with ranged shard key on both fields as mentioned in the blog

db.CompoundHashedShardKeyTest.getShardDistribution()
Shard mongos-01-shard-02 at mongos-01-shard-02/mongos-01-shard-02-01.company.net:27017,mongos-01-shard-02-02.company.net:27017
{
  data: '479KiB',
  docs: 9100,
  chunks: 1,
  'estimated data per chunk': '479KiB',
  'estimated docs per chunk': 9100
}
---
Totals
{
  data: '479KiB',
  docs: 9100,
  chunks: 1,
  'Shard mongos-01-shard-02': [
    '100 % data',
    '100 % docs in cluster',
    '54B avg obj size on shard'
  ]
}

Shard distribution with hashed-sharding on userid and range-sharding on date

db.CompoundHashedShardKeyTest.getShardDistribution()
Shard mongos-01-shard-01 at mongos-01-shard-01/mongos-01-shard-01-01.company.net:27017,mongos-01-shard-01-02.company.net:27017
{
  data: '203KiB',
  docs: 3860,
  chunks: 2,
  'estimated data per chunk': '101KiB',
  'estimated docs per chunk': 1930
}
---
Shard mongos-01-shard-02 at mongos-01-shard-02/mongos-01-shard-02-01.company.net:27017,mongos-01-shard-02-02.company.net:27017
{
  data: '276KiB',
  docs: 5240,
  chunks: 2,
  'estimated data per chunk': '138KiB',
  'estimated docs per chunk': 2620
}
---
Totals
{
  data: '479KiB',
  docs: 9100,
  chunks: 4,
  'Shard mongos-01-shard-01': [
    '42.41 % data',
    '42.41 % docs in cluster',
    '54B avg obj size on shard'
  ],
  'Shard mongos-01-shard-02': [
    '57.58 % data',
    '57.58 % docs in cluster',
    '54B avg obj size on shard'
  ]
}

@Jason_Tran just tagging you again in case you missed this one. thank you!