Need advise for sharding-key

hi all,

we store some 20M documents per day, documents have an average size of 3.5KB , storage in MongoDB accounts for some 20GB/day , indexsize is approx 21GB/day
To keep the sizes of our indexes reasonable (working set size in RAM, remember) we create collections per week of data.

We are not sure how we should define our sharding-key properly, for the following 2 requirements:

  • store this data evenly on our shards (we have 108 shards by now)
  • be able to run queries very fast, using selection criteria on one or more of some 5 crucial search-fields (mostly 7 to 12 digits, some have alphanumerics as well)

We had a sharding-key based on a single numeric field in our data, a field that is taken from a timestamp in the data, and of which we use the seconds of each hour in the data-timestamp combined.
This yields 3600 possible seconds, so we pre-split our new (empty) collections into 3600 chunks, then move these chunks manually over 108 shards (=33 chunks per shard). This is done prior to any data loading because otherwise the chunk movements go terribly slow.
With this sharding-key, our data gets very evenly distributed across all shards and chunks.

But our users can launch queries with criteria for these 5 fields, for a specific date or for a date-range. We don’t know which of these 5 fields they are filtering on, it could be one, two, or more of the fields. They very seldomly would select within a very short timerange that would translate (via the sharding-key) to very few chunks. So typically ALL chunks on ALL shards get queried each time, which is obviously far from optimal:

While data is being loaded into the current’s week collection, its indexes are present in RAM/cache. So queries on this week’s data go terribly fast (sub-second responses). That’s the good part.

But when we launch a query of another (thus older) week, the indexes of that other week’s collection need to be loaded into RAM/cache first. And this needs to be done on ALL shards. If data loading happens to be busy at the same time for this week’s collection, then not all indexes fit in RAM at the same time. It happens that the queries take 10-18mins in such conditions !

So we need to reduce the number of shards involved in these typical random queries, and the number of chunks involved.

I understood that, in order to help MongoDB identify in which limited number of chunks it should search for some query, we must include some selection-criterium also for the shard key values.

But we don’t know for which of the typical 5 selection fields a user would enter criteria.
So we don’t know how we possibly might be able to compute a selection-criterium for a sharding key.

Any hints/tips are most welcome !
Rob

An example:

if my 5 selection-fields are e.g. named “A”, “B”, “C”, “D”, “E” ;
then I have eg a document with “A=ab1c”, “B=d2ef”, “C=ghab”, “D=12389”, “E=526_ab”

Now users might run queries with given selection values (criteria) for eg only the field “A”, or for both fields “B” and “C” , or for only field “E”, or for a partial value of field “B” (like “*2ef”)

-> when I define the sharding key on all of the 5 fields, the sharding key can be computed at insert time (when we know the values of all 5 fields) to drop that document in some specific chunk
-> but when a user specifies query criteria for less than all 5 fields, we cannot compute the corresponding shard key where the original document has been stored

I found some web-article suggesting to use multiple sharded collections:

  • one for the original, entire documents, and
  • one for each filter-field (cfr my example above: one for the values of field “A”, another collection for the values of field “B” and so on); each document in such collection has the “_id” of the original full document.

Then the application makes queries in these sub-collections based on which filter-key has been specified by the user. Such queries returns a set of “_id” values.
The application should ‘intersect’ the obtained “_id” values from these sub-collections.
Then the application makes a second query in the ‘main’ collection where the full documents are stored, filtering on the “_id” values it got from the previous sub-queries.

Each such collection is fully sharded, so these sub-queries and final query are expected to run very fast, compensating the overhead of running multiple queries instead of one single query.