Geospatial Performance Improvements in MongoDB 3.2

Brandon Zhang and Kevin Albertson

#Technical

Background

MongoDB supports geospatial data and specialized indexes that make building applications with geospatial features easy and scalable. One of the most popular features, the $geoNear operator, returns documents in order, from nearest to farthest with respect to a given point. To avoid sorting the entire collection in one go, the $geoNear algorithm iteratively expands its search in distance intervals (the red annulus shown below), aiming to have a few hundred documents per interval. Searching all documents in an interval is accomplished by finding an index cell covering (the set of grey index cells shown below). This covering ensures that all of the documents in the interval can be found using an index scan. The documents in the covering but not in the interval are filtered out afterwards. After all of the documents in an interval are found, they are sorted and returned.

Figure 1: Interval and its corresponding Index Cell Covering

This cover-filter-and-sort procedure is repeated for each interval during the expansion of searching area, until a limit is reached or the whole world has been scanned. The following image shows the first three stages / intervals of this algorithm.

Figure 2: First three stages of a $geoNear search.

In MongoDB 3.0, the performance of $geoNear was sometimes slow for queries on dense datasets. This post will discuss how the $geoNear algorithm has been substantially improved in MongoDB 3.2 by explaining a series of issues and their solutions.

1. Repeated Index Scans

Overview

An issue arises when distance intervals have coverings that overlap significantly. Coverings must include the entire area of the interval in order to ensure the ordering of returned documents is correct, so there will always be some extra area covered. Since coverings can never be entirely accurate, two consecutive intervals will always have overlapping coverings with the same index keys being scanned multiple times. This results in a disproportionately high number of index keys scanned compared to the number of documents returned.

Figure 3: Stage 2 (left) and Stage 3 (right) of a `$geoNear` search. Red represents the overlapping coverings of the consecutive intervals.

The original $geoNear algorithm did not account for this problem. At every interval, it would create a covering independent of what was already covered during searches of previous intervals. Although it would filter out index keys that were disjoint to the interval to avoid too many extra document fetches, the repeated index scans could become very expensive in dense datasets where the search intervals were thin.

Solution

To avoid repeated index scans, the new $geoNear algorithm fetches every document in the covering and buffers the documents that are not in the search interval to be used in a following search interval. Since it has already fetched every document in previous coverings, it also keeps track of the area covered by previous coverings to ensure that the current covering has no overlap. To accomplish this, the algorithm maintains a union of the cells that have been visited already. At each interval, it takes the difference between the current covering and the cell union of visited cells.

Figure 4: Taking the difference between the current covering and the visited cells gives the cells that need to be scanned for the current interval.

Figure 5: Stage 2 (left) and Stage 3 (right) of a $geoNear search. The covering of the third stage has no overlap with any previous coverings.

Since the algorithm now only needs to visit the cells that have not been visited before, the number of index scans required in large queries drops dramatically. Also, the cost of maintaining a union of visited cells and taking the difference between the coverings is negligible compared to the time saved on index scans because it only needs to be done once per interval. However, this new algorithm introduces unnecessary document fetches in queries with few search intervals.

2. Unnecessary Document Fetches

Overview

While the new algorithm significantly outperforms the original algorithms when there are many search intervals, it causes a slight performance regression in cases where there are very few search intervals. As seen in Figure 6 below, the original $geoNear algorithm avoided document fetches by filtering out index keys that were disjoint to the current interval before passing them to the fetch stage. The new algorithm prevents this behavior because every document in a previous covering has to have been fetched beforehand to ensure that the none of the documents are missed. In queries with very few search intervals, the cost of fetching the extra documents in the covering can sometimes outweigh the benefits of avoiding repeated index scans.

Figure 6: The original $geoNear documents would only need to fetch the documents in the green area since it filtered out disjoint index keys

Solution Attempt

Instead of buffering all of the documents in the covering, buffering both index keys and documents seemed to be a promising approach. Since the algorithm could know before the fetch stage whether or not an index key was disjoint to the interval, it could buffer disjoint index keys to be fetched in a later stage. This would be done by inserting an extra stage between the index scan and the fetch that passed the scanned keys to the fetch stage if they intersected the interval or added them to the buffer if not. If it added a key to the buffer, it would need to calculate the distance from the key to the cell so that they could be ordered in a priority queue. This ordering would allow the following search stages to just check the top of the queue to find a key that may intersect with their interval.

Figure 7: Buffer index keys disjoint to interval (red) to be fetched later.

The problem with this approach was that the cost of maintaining this key buffer and calculating the distances for the keys often outweighed the cost of fetching the documents. While it showed improvement in dense datasets on queries with few search intervals, it caused a performance regression in other queries. The benefits of this change were insufficient to warrant the additional complexity introduced by the extra stage. However, other changes such as improved query and indexing levels (explained below) significantly mitigated the problem of unnecessary fetches.

3. Large Initial Radius

Overview

$geoNear determines its initial search radius using a density estimator, which returns an estimated distance aiming to retrieve thirty documents in the first interval. However, the indexing and query constraints caused the density estimator to return a distance which was often much larger than needed. While this was not an issue for sparse datasets or queries with a high $maxDistance, it performed poorly for dense datasets and queries over small areas. Figure 8 shows an example of this problem.

Figure 8: A $geoNear query with limit of 1, but the document is 1cm away!

The density estimator would begin its search at the finest possible level data could be indexed. The algorithm then recurses up parent cells until it finds any indexed data. Because the density estimator is meant to be a lightweight precursor to the $geoNear algorithm, it does not actually scan data. Instead, it only checks if there exists any indexed data in each cell it iterates. Once it finds data, it naïvely assumes that the average distance of that data is the average cell edge length of that level. Unfortunately, the finest possible level to index data was too coarse for many cases. It had an average cell size of 500m x 500m. This made the assumption often incorrect and inflated the initial estimated distance.

To give a concrete example of why this occurs, Figure 9 shows a toy example of a density estimator. In this example, points are indexed at level 1 (the coarsest level), and we try to estimate the distance necessary to retrieve one document starting at level 3 (the finest level).

Figure 9: Because points are not found until level 1, the distance is overestimated

Solution

Starting the density estimator at a finer level than the where data is indexed would not solve our problem. As shown in Figure 9, no data will be found until it at least reaches the level where data is indexed. Originally, data could not be indexed finer than the 500m x 500m level. This was strictly enforced by a constraint called the finest indexed level.

Increasing the finest index level constraint would solve this issue, but it causes other problems. Figure 10 shows the covering of an indexed polygon. Increasing the finest index level causes coverings of some geometries to have many more cells and increases the overall index size.

Figure 10: Covering of a polygon with a high finest indexed level

However, most use cases of $geoNear queries involve querying for points. Since points are always covered by a single cell (with the exception of rare edge cases), increasing the finest indexed level exclusively on points does not increase the number of index keys generated. Now the density estimator can see points with a finer granularity and generate a more accurate estimated distance.

Figure 11: Indexing points at a finer level

4. Different Index Keys

Overview

Changing the index keys of points meant that previously indexed points no longer had valid index keys.

Solution

We introduced version 3 of the 2dSphere index, which distinguishes the new index keys from old ones and added other performance improvements.

5. Longer Index Keys

Overview

Although indexing points to the finest level did not increase the number of index keys, it did increase the size of the index because of the way index keys were represented. The index keys were originally in string format, and adding to the cell level meant adding characters to the string form of the key.

Solution

Each index key was able to be represented in 64 bits. Since we were bumping the index version anyway, we decided to change the format to a NumberLong to bring down the index size. We also get the additional performance benefit from comparing numbers instead of strings. Figure 12 shows the differences of index sizes on 500,000 points indexed in three different formats: original string at coarser level, string at finest level, and NumberLong at finest level. BothMMAPv1 and WiredTiger are shown here.

Figure 12: Comparison of Index Sizes

6. Coarse Query Level

Overview

Fixing the density estimator still didn’t solve our problem of fetching too many documents in the first interval. Although the initial radius was now more accurate, the covering of the query region was still too coarse.

Figure 13: A coarse covering of a small region

Solution

The query region was covered with the same level constraints as indexed geometries. However, the problem of increasing the finest level constraint, as shown in Figure 10, does not apply to the query region. The query region may have many cells without any significant impact. Therefore, we split up the level constraints into index and query. The query finest level was then set to be much finer (where the average cell edge length was 1m).

Figure 14: A more selective covering

Finally, we can now see the improvement in our original query in Figure 15.

Figure 15: The before and after of a $geoNear query on one document 1 cm away

Conclusion

In implementing these changes, we saw significant improvements over several types of queries.

Queries over dense datasets saw improvement from avoiding repeated index scans. We measured the difference in queries over a sample dataset provided by Parse. As an example, the following query highlights the improvement.

db.parse.find({ location: { $near: {$geometry: {type: "Point", coordinates: [ 106.6331, 10.7395 ]}, $maxDistance: 10000}} })

We compared results in MongoDB 3.0 with the old index format against MongoDB 3.1.6 with the new index format. Here were our results:

Queries over small areas also saw a significant improvement from the more accurate initial radius and tighter query covering. Our internal performance tool, mongo-perf observed the following improvements in throughput over a uniformly distributed dataset:

This certainly does not mean performance has improved for all geospatial queries. But we have not found regression in any other tested geospatial queries.

Acknowledgements

We’d like to thank Siyuan Zhou and Eric Milkie for their mentorship throughout this project. We’re also grateful to David Storch and Max Hirschhorn for reviewing our work and Kamran Khan, Rui Zhang, and David Daly for their guidance on testing.


Want to learn more about MongoDB 3.2? Register for our upcoming webinar:
Register for What's new in MongoDB 3.2

About the Authors - Brandon & Kevin

Brandon Zhang is a rising junior at Cornell University studying Computer Science. He worked on the Kernel team during his internship to improve the performance of $geoNear queries.

Kevin Albertson is a rising senior at Rutgers University studying Computer Science and Mathematics. He worked on the Kernel team during his internship to improve performance on the 2dsphere index.