Indeed has more than 25 million open jobs online at any one time. It stores more than 225 million resumes on Indeed systems, and it has 250 million unique users every month.
Indeed operates enterprise-wide global clusters in the cloud across multiple availability zones all around the world, including the United States, Asia-Pacific, Europe, and Australia. Indeed is also a MongoDB super user. About 50% of everything Indeed does is built on MongoDB. In a recent session at MongoDB World 2022, Indeed senior cloud database engineer Alex Leong shared real-world experiences of performance issues when spanning replica sets across multiple data centers. He also covered how to identify these issues and, most importantly, how to fix them. This article provides highlights from Leong’s presentation, including dealing with changes in sync sources, replication lags, and more.
Resilience and performance
Indeed maintains multiple data centers for resiliency. Having multiple data centers ensures there's no single point of failure and keeps data in close proximity to job seekers' locations. This approach facilitates faster response times and better overall end user experience.
Running multiple data centers can introduce other performance issues, however. One issue involves the initial sync of new nodes in the system, which needs to happen as quickly as possible to avoid returning stale data. Write concern is a critical consideration because, if there's an interruption on a primary node and a failover to a secondary, when you eventually roll back to the primary, any changes that were captured on the secondary while the system was running in failover mode must be preserved.
Also, when you're running multiple data centers, changes in sync sources can occur that go unnoticed. Replication lags can occur when data centers are located far apart from each other.
Overriding sync sources
When you have an environment with hundreds of millions of users and enormous volumes of data spanning several geographic regions, spinning up and synchronizing a new node in a replica set creates logistical hurdles. To start, you have to decide where the new node syncs from. It seems logical that the default decision would be to sync with the nearest node. But, as Leong said in his session, at times you may not get the nearest sync source, and you may have to override the default sync source to choose the best one. This decision needs to be made early, Leong said, because doing so later means any progress you've made toward syncing the new node will have been wasted.
Replication lags can occur between the primary and secondary nodes for several reasons, including downtime (planned or unplanned) on the primary server, a network failure, or disk failure. Whatever the reason, there are ways to speed things up. In his session, Leong illustrates how to use the WiredTiger cache size to accelerate replication between nodes.
Changes in sync sources
Leong uses the term sync topology to describe how primary and secondary nodes are configured for syncing data between them. In some scenarios, a secondary node can change its sync source (sync topology) from one node to another, perhaps because the first node was busy at the time. MongoDB makes this change automatically, and it might not be noticed without looking at the log.
Fixing cross-data center write concerns
According to Leong, when write performance decreases, 99% of the time it's because of a change in sync sources. To be proactive, Leong creates a write performance monitor to identify and self-heal decreases in write performance so he doesn't have to find out the hard way (from users).
Other critical performance issues covered in the session include chained replication, which is the process by which secondary nodes replicate from node to node, changing write concern when a secondary node goes down, and how to configure write concerns across Availability Zones in AWS.
For more details, watch the complete session from MongoDB World 2022: Performance Gotchas of Replicas Spanning Multi Datacenters.