Why XML-in-RDBMS != MongoDB
Introduction
There are many use cases where an application must store complex data structures that are either:
Sufficiently complex so as to make proper relational modeling burdensome.
"Application bespoke" the data structure is flexible and dynamic and it is inefficient or effectively impossible to model in the RDBMS.
The MongoDB document model and APIs and query language make these tasks easy to implement, grow, and maintain.
Why the XML-as-Column Approach Comes with a Large Set of Gotchas
Of late, an alternative design option using XML in an RDBMS has surfaced as a way to address these use cases. The basic approach is to encode the complex structure into XML and store it as a column element "peer" to other simpler, scalar fields in a table. There are many reasons why this is not "the same thing" as what can be done in MongoDB and why the XML-as-column approach comes with a large set of gotchas.
Let's explore some of the more important reasons, and in doing so, we'll use a small piece of rich data for reference.
data = {
name: "buzz",
lastUpdate: ISODate("20140606"),
pets: [ "dog", "cat" ],
milestones: [
{ what: "bday", d: ISODate("20140101") }
{ what: "hireDate", d: ISODate("20150101") }
]
}
The reference example above is Javascript which is the easiest to use to get the point across without extra language overhead. It looks a lot like JSON but it is not. Although it is often convenient to think about MongoDB documents in terms of JSON from a data modeling standpoint, in fact very few of the APIs deal directly with JSON as a string. It is undesirable to deal with JSON as a string for some of the same reasons as dealing with XML as a string, which we'll see in a moment.
Rules and Assumptions:
We'll use Java as the language for implementation but the concepts apply to any typed language.
The data access layer (DAL) communicates to the application with real types: dates, doubles, lists, and maps.
Issues with XML and the Advantage of the MongoDB Document Model
1. XML is a string
There are no dates or lists or binary data in XML. It is a formatted string, period. This means that something in the data access layer must be responsible for converting incoming types like dates into some appropriate string representation -- and therein loses type fidelity:
String xmlString = convertMapToXML(data);
buzz
20140606
dog
cat
bday
20140101
hireDate
20150101
If we save this XML in a column and then retrieve it, how do we properly and accurately reproduce the original structure in a generic way, i.e. without custom coding the XML-to-data logic?
Map data = convertXMLtoMap(xmlString); // how do we know 20150101 is a date?
A solution seems to be storing the XSD for the XML in another peer column and using it to create the proper types. Now, beyond the obvious performance implications of both extracting the XSD and running it through a validating/data conversion utility, the more subtle issue is that there are great many options available to parse, validate, and convert XML. It is difficult to enforce and test the combinations of these options to ensure that, say, a floating point number in Java encoded to XML can be properly decoded by a python script into the same number. The problem is magnified when XML attributes are brought into the picture.
With MongoDB, the Map -- lists, substructures, dates, and all -- is simply set into a field and persisted. No conversions, no loss of type fidelity, no extra software stack dependencies, and a guarantee that data saved by a Java program:
          
collection.insert(data);
can be extracted precisely by a python program:
          
d = collection.find_one({“name”:”buzz”})
2. XML has poor handling for arrays
XML provides no native way to tell a consumer if an element is a scalar or an enclosing array. In the worst case, without a convention for enclosing tags (like "pets" in the above example), you could encounter this:
     
dog
// Is this a scalar or a single entry for an array?
How about this?
dog
cat
Without an accompanying XSD or a strong set of conventions like
dog
cat
there is no way to precisely reproduce the list datatype in the host language.
3. Attributes in XML are on a different axis than nested tags
The following XML is legal but clearly confusing:
bar
Which field named type “wins” when creating the object in the data access layer? In virtually all host languages, the available Map and similar objects have only one key axis, so some sort of convention must be applied to disambiguate “type” as an attribute from “type” as a nested tag. Typically, the approach looks like this:
name: {
@type: "foo",
type: "bar"
}
This irritation is a hallmark of XML's legacy as a SGML text markup format. It was not designed to easily and robustly carry rich structured data, and it can be avoided by religiously not using attributes.
4. Features/Capabilities on XML in a column vary widely from RDBMS to RDBMS
In the worst case, the XML is really treated as a string and no query or update capabilities exist; the entire XML CLOB is fetched and saved. In the best case, XPath operators exist but the following important caveats exist:
a. Indexing capabilities, including supported types and options, vary widely if they exist at all
b. The XPath query language is very different than SQL and it is a non-trivial
exercise to robust construct queries that blend both.
c. XPath is even harder to dynamically construct than SQL.
Here is a representative example of the complexity of RDBMS XML using SQLServer2005. Product ID is a numeric type (e.g. not a string) and ProductDate is a peer column to ProductXML.
SELECT ProductXML.query('//description') FROM ProductList
WHERE ProductXML.exist('//product/header/[number(./@id[1]) = sql:variable("@product_id")]') = 1 AND ProductDate > ‘20140101’
In MongoDB, bespoke data, dynamic data, and "regular" data are all treated exactly the same way. The query language lends itself to powerful parameterized construction and is applied easily and uniformly to the entire data shape regardless of its complexity:
ProductList.find({
"product.header.id", productId,
“ProductDate", {“$gt”: ISODate("20140101")}
},{ "description":1 });
5. Interaction with XML varies across languages
Because XML is fundamentally a string, software must exist in each language to both turn it into some set of objects that can be used, and to format those objects to an XML string. It becomes difficult and error-prone to maintain and test this software across languages, and it is even harder to ensure and enforce its use in all circumstances. Failure to do so will result in subtle errors where, for example, a floating point number persisted in one language will be fetched with a different precision in another. Another common gotcha is inconsistent (or outright incorrect) handling of datetime and timezone.
All datatypes supported by MongoDB behave the same way in each of the 10
drivers supported by the company.
Summary
Designing rich data in XML is not the difficult part of the overall problem. One hard part is precisely converting it to the objects you are using in your software. And an even greater challenge is easily and performantly interacting with the database to query, filter, and index this rich data. The XML-in-RDBMS architecture quickly leads to a platform where an excessive amount of developer-maintained utility code must be created and maintained and impairs the smooth operation of SQL-based tools and products that interact well only with the basic scalar types. A flexible, rich shape document model coupled with a query language designed to work well with these rich shapes makes MongoDB the better solution.
If you’re interested in learning more about the architecture of MongoDB, read our guide:
Read the MongoDB Architecture Guide
June 16, 2015