BSON and Data Interchange

MongoDB

#Releases

There’s a lot of good things about JSON – it’s a standards based, language independent, representation of object-like data. Also, it’s easy to read (for users and programmers alike). Each document is only about data, not complex object graphs and links. Thus it’s easy to inspect without knowing all the code of an application.

Further, JSON is “schemaless”. We do not have to predefine our (protocol) schema. This can be quite helpful: imagine RPC’ing data from client A to server B with a fixed schema for the messages. On a schema change both need to be ‘updated’ with the new schema. If there are many components to the system it’s even more complicated of course. There is some analogy here to XML, which can (optionally) be schemaless.

It would be nice to have a binary representation of JSON. That is what BSON is all about.

So what are the goals of BSON? They are:

  1. Fast scan-ability. For very large JSON documents, scanning can be slow. To skip a nested document or array we have to scan through the intervening field completely. In addition as we go we must count nestings of braces, brackets, and quotation marks. In BSON, the size of these elements is at the beginning of the field’s value, which makes skipping an element easy.
  2. Easy manipulation. We want to be able to modify information within a document efficiently. For example, incrementing a field’s value from 9 to 10 in a JSON text document might require shifting all the bytes for the remainder of the document, which if the document is large could be quite costly. (Albeit this benefit is not comprehensive: adding an element to an array mid-document would result in shifting.) It’s also helpful to not need to translate the string “9” to a numeric type, increment, and then translate back.
  3. Additional data types. JSON is potentially a great interchange format, but it would be nice to have a a few more data types. Most importantly is the addition of a “byte array” data type. This avoids any need to do base64 encoding, which is awkward.


One thing that is not a goal of BSON: compactness. The JSON document {“field”:7} represents the number seven as a single byte. That’s pretty good.

Perhaps the best example to date of usage of BSON is MongoDB. MongoDB uses it heavily – for sending documents over the network, persisting them to disk, as well as for internal data manipulations. In fact this is where BSON originated, although today it is a separate spec that should not be considered coupled to any one particular project.

There is BSON serialization and deserialization code for most languages; implementations are open source and mostly available under Apache 2 license. Quite a few implementations originated from a MongoDB drivers; work is underway in most drivers to fully decouple, although independent use works fine today.

If you or someone you know is using BSON in a project, please let us know by posting on the BSON mailing list. Check out bsonspec.org for more information.

Spencer & Dwight