Single-Collection Designs in MongoDB with Spring Data (Part 1)
Graeme RobinsonPublished Oct 13, 2022 • Updated Oct 14, 2022
Rate this tutorial
Modern document-based NoSQL databases such as MongoDB offer advantages over traditional relational databases for many types of applications. One of the key benefits is data models that avoid the need for normalized data spread across multiple tables requiring join operations that are both computationally expensive and difficult to scale horizontally.
In the first part of this series, we will discuss single-collection designs — one of the design patterns used to realize these advantages in MongoDB. In Part 2, we will provide examples of how the single-collection pattern can be utilized in Java applications using Spring Data MongoDB.
In this blog post, we discuss a database design for collecting and analyzing Automatic Dependent Surveillance-Broadcast (ADSB) data transmitted by aircraft. ADSB is a component of a major worldwide modernization of air-traffic control systems that moves away from dependency on radar (which is expensive to maintain and has limited range) for tracking aircraft movement and instead has the aircraft themselves transmit their location, speed, altitude, and direction of travel, all based on approved Global Navigation Satellite Systems such as GPS, GLONASS, Galileo, and BeiDou. Find more information about ADSB.
A number of consumer-grade devices are available for receiving ADSB transmissions from nearby aircraft. These are used by pilots of light aircraft to feed data to tablet and smart-phone based navigation applications such as Foreflight. This provides a level of situational awareness and safety regarding the location of nearby flight traffic that previously was simply not available even to commercial airline pilots. Additionally, web-based aircraft tracking initiatives, such as the Opensky Network, depend on community-sourced ADSB data to build their databases used for numerous research projects.
Whilst most ADSB receivers retail in the high hundreds-of-dollars price range, the rather excellent Stratux open-source project allows a complete receiver system to be built using a Raspberry Pi and cheap USB Software Defined Radios (SDRs). A complete system can be built from parts totalling around $200 (1). The Stratux receiver transmits data to listening applications either over a raw TCP/IP connection with messages adhering to the GDL90 specification designed and maintained by Garmin, or as JSON messages sent to subscribers to a websocket connection. In this exercise, we will simulate receiving messages from a Stratux receiver — a working receiver is not a prerequisite for completing the exercises. The database we will be building will track observed aircraft, the airlines they belong to, and the individual ADSB position reports picked up by our receiver.
In a traditional RDBMS-based system, we might settle on a normalized data model that looks like this: Each record in the airline table can be joined to zero or more aircraft records, and each aircraft record can be joined to zero or more ADSB position reports. Whilst this model offers a degree of flexibility in terms of querying, queries that join across tables are computationally intensive and difficult to scale horizontally. In particular, consider that over 3000 commercial flights are handled per day by airports in the New York City area and that each of those flights are transmitting a new ADSB position report every second. With ADSB transmissions for a flight being picked up by the receiver for an average of 15 minutes until the aircraft moves out of range, an ADSB receiver in New York alone could be feeding over 2.5 million position reports per day into the system. With a network of ADSB receivers positioned at major hubs throughout the USA, the possibility of needing to be able to scale out could grow quickly.
MongoDB has been designed from the outset to be easy to scale horizontally. However, to do that, the correct design principles and patterns must be employed, one of which is to avoid unnecessary joins. In our case, we will be utilizing the document data model, polymorphic collections, and the single-collection design pattern. And whilst it’s common practice in relational database design to start by normalizing the data before considering access patterns, with document-centric databases such as MongoDB, you should always start by considering the access patterns for your data and work from there, using the guiding principle that data that is accessed together should be stored together.
In MongoDB, data is stored in JSON (2) like documents, organized into collections. In relational database terms, a document is analogous to a record whilst a collection is analogous to a table. However, there are some key differences to be aware of.
A document in MongoDB can be hierarchical, in that the value of any given attribute (column in relational terms) in a document may itself be a document or an array of values or documents. This allows for data to be stored in a single document within a collection in ways that tabular relational database designs can’t support and that would require data to be stored across multiple tables and accessed using joins. Consider our airline to aircraft one-to-many and aircraft to ADSB position report one-to-many relationships. In our relational model, this requires three tables joined using primary-foreign key relationships. In MongoDB, this could be represented by airline documents, with their associated aircraft embedded in the same document and the ADSB position reports for each aircraft further embedded in turn, all stored in a single collection. Such documents might look like this:
By embedding the aircraft information for each airline within its own document, all stored within a single collection, we are able retrieve information for an airline and all its aircraft using a single query and no joins:
Embedded, hierarchical documents provide a great deal of flexibility in our data design and are consistent with our guiding principle that data that is accessed together should be stored together. However, there are some things to be aware of:
- For some airlines, the number of embedded aircraft documents could become large. This would be compounded by the number of embedded ADSB position reports within each associated aircraft document. In general, large, unbounded arrays are considered an anti-pattern within MongoDB as they can lead to excessively sized documents with a corresponding impact on update operations and data retrieval.
- There may be a need to access an individual airline or aircraft’s data independently of either the corresponding aircraft data or information related to other aircraft within the airline’s fleet. Whilst the MongoDB query aggregation framework allows for such shaping and projecting of the data returned by a query to do this, it would add extra processing overhead when carrying out such queries. Alternatively, the required data could be filtered out of the query returns within our application, but that might lead to unnecessary large data transmissions.
- Some aircraft may be operated privately, and not be associated with an airline.
One approach to tackling these problems would be to separate the airline, aircraft, and ADSB position report data into separate documents stored in three different collections with appropriate cross references (primary/foreign keys). In some cases, this might be the right approach (for example, if synchronizing data from mobile devices using Realm). However, it comes at the cost of maintaining additional collections and indexes, and might necessitate the use of joins ($lookup stages in a MongoDB aggregation pipeline) when retrieving data. For some of our access patterns, this design would be violating our guiding principle that data that is accessed together should be stored together. Also, as the amount of data in an application grows and the need for scaling through sharding of data starts to become a consideration, having related data separated across multiple collections can complicate the maintenance of data across shards.
Another option would be to consider using
the Subset Pattern which limits the number of embedded documents we maintain according to an algorithm (usually most recently received/accessed, or most frequently accessed), with the remaining documents stored in separate collections. This allows us to control the size of our hierarchical documents and in many workloads, cover our data retrieval and access patterns with a single query against a single collection. However, for our airline data use case, we may find that the frequency with which we are requesting all aircraft for a given airline, or all position reports for an aircraft (of which there could be many thousands), the subset pattern may still lead to many queries requiring joins.
One further solution, and the approach we’ll take in this article, is to utilize another feature of MongoDB: polymorphic collections. Polymorphic collections refer to the ability of collections to store documents of varying types. Unlike relational tables, where the columns of each table are pre-defined, a collection in MongoDB can contain documents of any design, with the only requirement being that every document must contain an “_id” field containing a unique identifier. This ability has led some observers to describe MongoDB as being schemaless. However, it’s more correct to describe MongoDB as “schema-optional.” You can define restrictions on the design of documents that are accepted by a collection using JSON Schema, but this is optional and at the discretion of the application developers. By default, no restrictions are imposed. It’s considered best practice to only store documents that are in some way related and/or will be retrieved in a single operation within the same collection, but again, this is at the developers’ discretion.
Utilizing polymorphic collection in our aerodata example, we separate our Airline, Aircraft, and ADSB position report data into separate documents, but store them all within a single collection. Taking this approach, the documents in our collection may end up looking like this:
There are a couple of things to note here. Firstly, with the airline, aircraft, and ADSB position reports separated into individual documents rather than embedded within each other, we can query for and return the different document types individually or in combination as needed.
Secondly, we have utilized a custom format for the “_id” field in each document. Whilst the “_id” field is always required in MongodB, the format of the value stored in the field can be anything as long as it’s unique within that collection. By default, if no value is provided, MongoDB will assign an objectID value to the field. However, there is nothing to prevent us using any value we wish, as long as care is taken to ensure each value used is unique. Considering that MongoDB will always maintain an index on the “_id” field, it makes sense that we should use a value in the field that has some value to our application. In our case, the values are used to represent the hierarchy within our data. Airline document “_id” fields contain the airline’s unique ICAO (International Civil Aviation Organization) code. Aircraft document “_id” fields start with the owning airline’s ICAO code, followed by an underscore, followed by the aircraft’s own unique ICAO code. Finally, ADSB position report document “_id” fields start with the airline ICAO code, an underscore, then the aircraft ICAO code, then a second underscore, and finally an incrementing message number.
Whilst we could have stored the airline and aircraft ICAO codes and ADSB message numbers in their own fields to support our queries, and in some ways doing so would be a simpler approach, we would have to create and maintain additional indexes on our collection against each field. Overloading the values in the “_id” field in the way that we have avoids the need for those additional indexes.
Lastly, we have added a helper field called recordType to each document to aid filtering of searches. Airline documents have a recordType value of 1, aircraft documents have a recordType value of 2, and ADSB position report documents have a recordType value of 3. To maintain query performance, the positionType field should be indexed.
With these changes in place, and assuming we have placed all our documents in a collection named “aerodata”, we can now carry out the following range of queries:
Retrieve all documents related to Delta Air Lines:
Retrieve Delta Air Lines’ airline document on its own:
Retrieve all aircraft documents for aircraft in Delta Air Lines’ fleet:
Retrieve the aircraft document for Airbus A319 with ICAO code "a36f7e" on its own:
Retrieve all ADSB position report documents for Airbus A319 with ICAO code "a36f7e":
In each case, we are able to retrieve the data we need with a single query operation (requiring a single round trip to the database) against a single collection (and thus, no joins) — even in cases where we are returning multiple documents of different types. Note the use of regular expressions in some of the queries. In each case, our search pattern is anchored to the start of the field value being searched using the “^” hat symbol. This is important when performing a regular expression search as MongoDB can only utilize an index on the field being searched if the search pattern is anchored to the start of the field.
The following search will utilize the index on the “_id” field:
The following search will not be able to utilize the index on the “_id” field and will instead perform a full collection scan:
In this first part of our two-part post, we have seen how polymorphic single-collection designs in MongoDB can provide all the query flexibility of normalized relational designs, whilst simultaneously avoiding anti-patterns, such as unbounded arrays and unnecessary joins. This makes the resulting collections highly performant from a search standpoint and amenable to horizontal scaling. In Part 2, we will show how we can work with these designs using Spring Data MongoDB in Java applications.
The example source code used in this series is available on Github.
(1) As of October 2022, pandemic era supply chain issues have impacted Raspberry Pi availability and cost. However for anyone interested in building their own Stratux receiver, the following parts list will allow a basic system to be put together:
(2) MongoDB stores data using BSON - a binary form of JSON with support for additional data types not supported by JSON. Get more information about the BSON specification.