Nominate a community champion for the William Zola Outstanding Contributor Award here
We place a tremendous value on community engagement at MongoDB, specifically around working with you on your toughest technical problems. Even in the early days of the project, MongoDB’s co-founders Dwight and Eliot were committed to the success of all users, and we want to continue that spirit as the community grows.
Today we are pleased to announce the William Zola Award for Community Excellence to honor those whose support contributions make a significant difference to users around the globe.
It’s not easy to be a champion at community support, and yet, at MongoDB we see it every day, from community members who provide advice on the MongoDB User Forum, to our own engineers who diagnose and debug problems for users around the globe.
One of our strongest Community Support advocates was William Zola, who passed away unexpectedly this past summer. William, Lead Technical Services engineer, had a passion for creating user success, and helped thousands of users with support problems, much of it on his own time. William was so effective at meeting users in their time of distress that people often asked for him by name on the MongoDB User Forum. Most engineers at MongoDB went through his customer skills training to learn how to create an ideal user experience while maintaining technical integrity. William taught us:
- How the user feels is every bit as important as solving their technical problem
- We should work to solve the problem and not just close a case or ticket
- Every user interaction should drive the case one step closer to resolution
- It’s all about the user
Over time, William’s advice and philosophy towards user success came to permeate MongoDB’s entire organization and community.
This December at MongoDB SF we will announce the first winner of what we will call “The Zola”, awarded to a user who has offered exceptional support to our community in line with William’s philosophy. The winner of the Zola will receive a complimentary trip to San Francisco to receive the award at the event along with a $1,000 Amazon Gift Card and an award plaque.
Today we open nominations and begin the search for the first winners of the Zola. MongoDB users who support others on StackOverflow, the MongoDB User Forum, and in person through ad-hoc or structured mentoring are all qualified to receive the award.
Nominations will be accepted until November 10 through this form, so please send in names of people who have positively impacted your experience with MongoDB during your time as a user. Individuals will be judged on the impact of their work and their demonstration of William’s values.
William’s extraordinary contributions are remembered in users like you who pass along your knowledge of MongoDB and do it with gusto. Even if you do not qualify for the Zola now, there is always an opportunity for you to contribute to the MongoDB ecosystem by sharing your ideas and experience on StackOverflow, the MongoDB User Forum and in your local communities.
Tell us who should receive the first “Zola.”
Discuss on Hacker News
MongoDB Employees are not eligible for the “Zola”
Sharding Pitfalls Part II: Running a Sharded Cluster
By Adam Comerford , Senior Solutions Engineer In Part I we discussed important considerations when picking a shard key. In this post we will go through some recommendations when running a sharded cluster at scale. Scalability is one of the core benefits of sharding in MongoDB but this can give you a false sense of security; even with that flexibility, you still have to make smart decisions about how and when you deploy resources. In this post, we will cover a couple of common mistakes that people tend to make when it comes to running a sharded cluster. 3. Waiting too long to add a new shard (overloaded) You sharded your database and scaled horizontally for a reason, perhaps it was to add more memory or disk capacity. Whatever the reason, if your application usage grows over time so (generally) does your database utilization. Eventually, your current sharded cluster will pass a certain point, let’s call it 80% utilized (as a nice round estimate), such that it becomes problematic to add another shard. Why? Well, adding a new shard to a cluster is not free, and it is not instantaneous. It consumes resources and (initially) accepts very little traffic. Essentially, at the start of its existence, a newly added shard costs you capacity instead of adding capacity. The length of time it will stay in this state will depend on the balancer and how long it takes for a significant portion of “busy/active” chunks to move onto the new shard. It can often be easier to visualize this process, so let’s make up some hypothetical numbers and set the bar relatively low. Our imaginary existing cluster will be a set of 2 shards, with 2000 chunks (500 considered “active”) and to that we need to add a 3rd shard. This 3rd shard will eventually store one third of the active chunks (and total chunks). The question is, when does this shard stop adding overhead overall and instead become an asset? In reality, this will vary from cluster to cluster and have a lot of dependencies and variables - in other words you need to have good metrics about your cluster, particularly your load bottleneck. Therefore we will once again use our imaginations and go with a relatively low bar: when 5% of active chunks–that is, those chunks seeing most traffic–have migrated to the new shard, you should expect a net gain in performance. In our imaginary system we have evaluated our load levels, the expected impact of migrations and have determine that once that 5% threshold of active chunks has been migrated to the new shard it can be considered a net gain for the overall system. Once all chunks have been balanced, then the migration overhead disappears, but initially this will be an expected trade off. This chart shows how long it would take for new shards to reach net positive contribution in your cluster (the dotted line implies net gain): In this fabricated example, it takes almost 2 hours for the new shard to attain a viable level of active chunks and be considered a net gain for the overall system. Although these numbers are fictional, these numbers are based on setups we have seen in real systems with moderate load. From there it is relatively easy to imagine this set of migrations taking even longer on an overloaded set of shards, and taking far longer for our newly added shard to cross the threshold and become a net gain. As such it is best to be proactive and add capacity before it becomes a necessity. Possible Mitigation Strategies Manual balancing of targeted “hot” chunks (chunk that is being accessed more than others) to move activity to the new shard more quickly Add the shard at low traffic time so that there is less competition for resources Disable balancing on some collections , prioritise balancing busy collections first 4. Under-provisioning Config Servers Provisioning enough resources without being wasteful is always tricky, and all the more so in a complicated distributed system like a MongoDB sharded cluster. Everyone wants to use their hardware, virtual instances, virtual machines, containers and the like in the most efficient way possible, and get the best bang for their buck. Hence it is only natural to take a look at the various pieces of a distributed cluster and look for lower utilized pieces that could be put on less expensive resources. The most common pitfall here with MongoDB are the config servers, which are often neglected when stress testing a cluster. In testing environments and smaller deployments (unless specific measures are taken to stress them) they are relatively lightly loaded and usually identified as candidates for lesser instances/hardware. The problem is that these are critical pieces of infrastructure. They may not be heavily loaded all the time, but when they do see load and struggle to service requests, that can impact all queries (reads, writes, authentication) and add latency to all requests made of the cluster in question. In particular, the first config server in the list supplied to your mongos processes is vital. This is the config server that all mongos processes will default to read from when fetching or refreshing their view of the data distribution in your cluster. Similarly, this is the server that will be hit when attempting to authenticate a user. If it is under-provisioned and cannot service queries, or if it has problems with networking (packet loss, congestion), then the effects will be significant. Possible Mitigation Strategies Ensure the config servers are load tested, slightly over-provisioned (the first config server in particular) If using virtual machines or cloud based instances, investigate increasing available resources Turning off the balancer, disabling chunk splitting will reduce the chances of high read traffic to the config servers (no migrations, no meta data refresh) but this is only a temporary fix unless you have a perfect write distribution and may not eliminate issues completely. 5. Using the count() command on sharded collections This pitfall is very common, and it seems to hit somewhat randomly in terms of how long someone has been running a sharded environment. At some point, a question will arise along the lines of: “How are we tracking/verifying/checking how many documents we have in each collection on each shard, how balanced are they and do they agree with <some other system that holds the same data>?” Hopefully no one is actually constructing questions this way in your organization, but you get the basic idea. The most obvious way to do a quick check on this type of thing is to count the documents and see if the numbers make sense and/or agree with counts elsewhere. That thinking naturally leads people to the count command and they proceed to use it to gather figures for their documents and collections. Unfortunately, on a busy, mature sharded cluster, the results will very rarely be what is expected. The reason for this is that the count command as implemented today has several optimizations in place to make it faster to run in general and those speed optimizations essentially bypass a key piece of the sharding functionality needed to return accurate results in this case. This is a known bug and is being tracked in SERVER-3645 , but does not stop people from consistently hitting this issue. The nature of the issue means that count will report documents in the results that it should not, for example: Documents that are being deleted as part of a chunk migrations Documents that have been left behind from previous chunk migrations (also known as orphans) Documents currently being copied as part of an in-flight chunk migration A regular query (rather than a count) will have its results filtered by the respective primary and not suffer from the same problem. Hence, if you were to manually count the results from a query client-side you would get an accurate result. This quirk of sharded environments will eventually be fixed, but for now it will inevitably crop up from time to time in all active sharded clusters used by a large team. Possible Mitigation Strategies Do counts on the client side, or use targeted, range based queries (with a primary read preference) to count instead Use cleanUpOrphaned and disable the balancer (make sure it has finished current round) when performing counts across the cluster If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations . Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments. Read the whole series Part I Part III
Flowhub Relies on MongoDB to Meet Changing Regulations and Scale Its Business