Geospatial Performance Improvements in MongoDB 3.2
Background
MongoDB supports geospatial data and specialized indexes that make building applications with geospatial features easy and scalable. One of the most popular features, the $geoNear
operator, returns documents in order, from nearest to farthest with respect to a given point. To avoid sorting the entire collection in one go, the $geoNear
algorithm iteratively expands its search in distance intervals (the red annulus shown below), aiming to have a few hundred documents per interval. Searching all documents in an interval is accomplished by finding an index cell covering (the set of grey index cells shown below). This covering ensures that all of the documents in the interval can be found using an index scan. The documents in the covering but not in the interval are filtered out afterwards. After all of the documents in an interval are found, they are sorted and returned.

This cover-filter-and-sort procedure is repeated for each interval during the expansion of searching area, until a limit is reached or the whole world has been scanned. The following image shows the first three stages / intervals of this algorithm.

In MongoDB 3.0, the performance of $geoNear
was sometimes slow for queries on dense datasets. This post will discuss how the $geoNear
algorithm has been substantially improved in MongoDB 3.2 by explaining a series of issues and their solutions.
1. Repeated Index Scans
Overview
An issue arises when distance intervals have coverings that overlap significantly. Coverings must include the entire area of the interval in order to ensure the ordering of returned documents is correct, so there will always be some extra area covered. Since coverings can never be entirely accurate, two consecutive intervals will always have overlapping coverings with the same index keys being scanned multiple times. This results in a disproportionately high number of index keys scanned compared to the number of documents returned.

The original $geoNear
algorithm did not account for this problem. At every interval, it would create a covering independent of what was already covered during searches of previous intervals. Although it would filter out index keys that were disjoint to the interval to avoid too many extra document fetches, the repeated index scans could become very expensive in dense datasets where the search intervals were thin.
Solution
To avoid repeated index scans, the new $geoNear
algorithm fetches every document in the covering and buffers the documents that are not in the search interval to be used in a following search interval. Since it has already fetched every document in previous coverings, it also keeps track of the area covered by previous coverings to ensure that the current covering has no overlap. To accomplish this, the algorithm maintains a union of the cells that have been visited already. At each interval, it takes the difference between the current covering and the cell union of visited cells.


Since the algorithm now only needs to visit the cells that have not been visited before, the number of index scans required in large queries drops dramatically. Also, the cost of maintaining a union of visited cells and taking the difference between the coverings is negligible compared to the time saved on index scans because it only needs to be done once per interval. However, this new algorithm introduces unnecessary document fetches in queries with few search intervals.
2. Unnecessary Document Fetches
Overview
While the new algorithm significantly outperforms the original algorithms when there are many search intervals, it causes a slight performance regression in cases where there are very few search intervals. As seen in Figure 6 below, the original $geoNear algorithm avoided document fetches by filtering out index keys that were disjoint to the current interval before passing them to the fetch stage. The new algorithm prevents this behavior because every document in a previous covering has to have been fetched beforehand to ensure that the none of the documents are missed. In queries with very few search intervals, the cost of fetching the extra documents in the covering can sometimes outweigh the benefits of avoiding repeated index scans.

Solution Attempt
Instead of buffering all of the documents in the covering, buffering both index keys and documents seemed to be a promising approach. Since the algorithm could know before the fetch stage whether or not an index key was disjoint to the interval, it could buffer disjoint index keys to be fetched in a later stage. This would be done by inserting an extra stage between the index scan and the fetch that passed the scanned keys to the fetch stage if they intersected the interval or added them to the buffer if not. If it added a key to the buffer, it would need to calculate the distance from the key to the cell so that they could be ordered in a priority queue. This ordering would allow the following search stages to just check the top of the queue to find a key that may intersect with their interval.

The problem with this approach was that the cost of maintaining this key buffer and calculating the distances for the keys often outweighed the cost of fetching the documents. While it showed improvement in dense datasets on queries with few search intervals, it caused a performance regression in other queries. The benefits of this change were insufficient to warrant the additional complexity introduced by the extra stage. However, other changes such as improved query and indexing levels (explained below) significantly mitigated the problem of unnecessary fetches.
3. Large Initial Radius
Overview
$geoNear determines its initial search radius using a density estimator, which returns an estimated distance aiming to retrieve thirty documents in the first interval. However, the indexing and query constraints caused the density estimator to return a distance which was often much larger than needed. While this was not an issue for sparse datasets or queries with a high $maxDistance
, it performed poorly for dense datasets and queries over small areas. Figure 8 shows an example of this problem.

The density estimator would begin its search at the finest possible level data could be indexed. The algorithm then recurses up parent cells until it finds any indexed data. Because the density estimator is meant to be a lightweight precursor to the $geoNear
algorithm, it does not actually scan data. Instead, it only checks if there exists any indexed data in each cell it iterates. Once it finds data, it naïvely assumes that the average distance of that data is the average cell edge length of that level. Unfortunately, the finest possible level to index data was too coarse for many cases. It had an average cell size of 500m x 500m. This made the assumption often incorrect and inflated the initial estimated distance.
To give a concrete example of why this occurs, Figure 9 shows a toy example of a density estimator. In this example, points are indexed at level 1 (the coarsest level), and we try to estimate the distance necessary to retrieve one document starting at level 3 (the finest level).

Solution
Starting the density estimator at a finer level than the where data is indexed would not solve our problem. As shown in Figure 9, no data will be found until it at least reaches the level where data is indexed. Originally, data could not be indexed finer than the 500m x 500m level. This was strictly enforced by a constraint called the finest indexed level.
Increasing the finest index level constraint would solve this issue, but it causes other problems. Figure 10 shows the covering of an indexed polygon. Increasing the finest index level causes coverings of some geometries to have many more cells and increases the overall index size.

However, most use cases of $geoNear
queries involve querying for points. Since points are always covered by a single cell (with the exception of rare edge cases), increasing the finest indexed level exclusively on points does not increase the number of index keys generated. Now the density estimator can see points with a finer granularity and generate a more accurate estimated distance.

4. Different Index Keys
Overview
Changing the index keys of points meant that previously indexed points no longer had valid index keys.
Solution
We introduced version 3 of the 2dSphere index, which distinguishes the new index keys from old ones and added other performance improvements.
5. Longer Index Keys
Overview
Although indexing points to the finest level did not increase the number of index keys, it did increase the size of the index because of the way index keys were represented. The index keys were originally in string format, and adding to the cell level meant adding characters to the string form of the key.
Solution
Each index key was able to be represented in 64 bits. Since we were bumping the index version anyway, we decided to change the format to a NumberLong to bring down the index size. We also get the additional performance benefit from comparing numbers instead of strings. Figure 12 shows the differences of index sizes on 500,000 points indexed in three different formats: original string at coarser level, string at finest level, and NumberLong at finest level. BothMMAPv1 and WiredTiger are shown here.

6. Coarse Query Level
Overview
Fixing the density estimator still didn’t solve our problem of fetching too many documents in the first interval. Although the initial radius was now more accurate, the covering of the query region was still too coarse.

Solution
The query region was covered with the same level constraints as indexed geometries. However, the problem of increasing the finest level constraint, as shown in Figure 10, does not apply to the query region. The query region may have many cells without any significant impact. Therefore, we split up the level constraints into index and query. The query finest level was then set to be much finer (where the average cell edge length was 1m).

Finally, we can now see the improvement in our original query in Figure 15.

Conclusion
In implementing these changes, we saw significant improvements over several types of queries.
Queries over dense datasets saw improvement from avoiding repeated index scans. We measured the difference in queries over a sample dataset provided by Parse. As an example, the following query highlights the improvement.
db.parse.find({ location: { $near: {$geometry: {type: "Point", coordinates: [ 106.6331, 10.7395 ]}, $maxDistance: 10000}} })
We compared results in MongoDB 3.0 with the old index format against MongoDB 3.1.6 with the new index format. Here were our results:

Queries over small areas also saw a significant improvement from the more accurate initial radius and tighter query covering. Our internal performance tool, mongo-perf observed the following improvements in throughput over a uniformly distributed dataset:

Acknowledgements
We’d like to thank Siyuan Zhou and Eric Milkie for their mentorship throughout this project. We’re also grateful to David Storch and Max Hirschhorn for reviewing our work and Kamran Khan, Rui Zhang, and David Daly for their guidance on testing.
Want to learn more about MongoDB 3.2? Register for our upcoming webinar:
About the Authors - Brandon & Kevin
Brandon Zhang is a rising junior at Cornell University studying Computer Science. He worked on the Kernel team during his internship to improve the performance of $geoNear
queries.
Kevin Albertson is a rising senior at Rutgers University studying Computer Science and Mathematics. He worked on the Kernel team during his internship to improve performance on the 2dsphere index.