hi all,
we store some 20M documents per day, documents have an average size of 3.5KB , storage in MongoDB accounts for some 20GB/day , indexsize is approx 21GB/day
To keep the sizes of our indexes reasonable (working set size in RAM, remember) we create collections per week of data.
We are not sure how we should define our sharding-key properly, for the following 2 requirements:
- store this data evenly on our shards (we have 108 shards by now)
- be able to run queries very fast, using selection criteria on one or more of some 5 crucial search-fields (mostly 7 to 12 digits, some have alphanumerics as well)
We had a sharding-key based on a single numeric field in our data, a field that is taken from a timestamp in the data, and of which we use the seconds of each hour in the data-timestamp combined.
This yields 3600 possible seconds, so we pre-split our new (empty) collections into 3600 chunks, then move these chunks manually over 108 shards (=33 chunks per shard). This is done prior to any data loading because otherwise the chunk movements go terribly slow.
With this sharding-key, our data gets very evenly distributed across all shards and chunks.
But our users can launch queries with criteria for these 5 fields, for a specific date or for a date-range. We don’t know which of these 5 fields they are filtering on, it could be one, two, or more of the fields. They very seldomly would select within a very short timerange that would translate (via the sharding-key) to very few chunks. So typically ALL chunks on ALL shards get queried each time, which is obviously far from optimal:
While data is being loaded into the current’s week collection, its indexes are present in RAM/cache. So queries on this week’s data go terribly fast (sub-second responses). That’s the good part.
But when we launch a query of another (thus older) week, the indexes of that other week’s collection need to be loaded into RAM/cache first. And this needs to be done on ALL shards. If data loading happens to be busy at the same time for this week’s collection, then not all indexes fit in RAM at the same time. It happens that the queries take 10-18mins in such conditions !
So we need to reduce the number of shards involved in these typical random queries, and the number of chunks involved.
I understood that, in order to help MongoDB identify in which limited number of chunks it should search for some query, we must include some selection-criterium also for the shard key values.
But we don’t know for which of the typical 5 selection fields a user would enter criteria.
So we don’t know how we possibly might be able to compute a selection-criterium for a sharding key.
Any hints/tips are most welcome !
Rob