Docs Menu
Docs Home
/ / /
Ruby Driver
/

Document Data Format: BSON

In this guide, you can learn about the BSON data format, how MongoDB uses BSON to organize and store data, and how to install the BSON library independently of the Ruby driver.

BSON, or Binary JSON, is the data format that MongoDB uses to organize and store data. This data format includes all JSON data structure types and adds support for types including dates, differently-sized integers (32-bit and 64-bit), ObjectIds, and binary data. For a complete list of supported types, see the BSON Types in the MongoDB Server documentation.

BSON is not human-readable, but you can use the Ruby BSON library to convert it to the human-readable JSON representation. You can read more about the relationship between these formats in the JSON and BSON guide on the MongoDB website.

You can install the BSON library (bson) from Rubygems manually or by using the bundler.

Run the following command to install the bson gem:

gem install bson

To install the gem by using bundler, include the following line in your application's Gemfile:

gem 'bson'

The BSON library is compatible with MRI v2.5 and later and JRuby v9.2 and later.

Serialization for classes defined in Active Support, such as TimeWithZone, is not loaded by default to avoid a hard dependency of BSON on Active Support. When using BSON in an application that also uses Active Support, you must require the Active Support code support:

require 'bson'
require 'bson/active_support'

You can retrieve a Ruby object's raw BSON representation by calling to_bson on the object. The to_bson method returns a BSON::ByteBuffer.

The following code demonstrates how to call the to_bson method on Ruby objects:

"Shall I compare thee to a summer's day".to_bson
1024.to_bson

You can generate a Ruby object from BSON by calling from_bson on the class you wish to instantiate and passing it a BSON::ByteBuffer instance:

String.from_bson(byte_buffer)
BSON::Int32.from_bson(byte_buffer)

bson v4.0 introduces the use of native byte buffers in MRI and JRuby instead of using StringIO for improved performance.

To create a ByteBuffer for writing, instantiate a BSON::ByteBuffer with no arguments:

buffer = BSON::ByteBuffer.new

To write raw bytes to the byte buffer with no transformations, use the put_byte and put_bytes methods. Each method takes a byte string as its argument and copies this string into the buffer. The put_byte method enforces that the argument is a string of length 1. put_bytes accepts any length of strings. The strings can contain null bytes.

The following code demonstrates how to write raw bytes to a byte buffer:

buffer.put_byte("\x00")
buffer.put_bytes("\xff\xfe\x00\xfd")

Note

put_byte and put_bytes do not write a BSON type byte to the buffer before writing the byte string. This means that the buffer does not information about the type of data that the raw byte string encodes.

The write methods described in the following sections write objects of particular types in the BSON specification. The type indicated by the method name takes precedence over the type of the argument. For example, if a floating-point value is passed to put_int32, it is coerced into an integer, and the driver writes the resulting integer to the byte buffer.

To write a UTF-8 string (BSON type 0x02) to the byte buffer, use the put_string method:

buffer.put_string("hello, world")

BSON strings are always encoded in UTF-8. This means that the argument to put_string must be either in UTF-8 or in an encoding convertable to UTF-8 (not binary). If the argument is in an encoding other than UTF-8, the string is first converted to UTF-8 and then the UTF-8 encoded version is written to the buffer. The string must be valid in its claimed encoding. The string can contain null bytes.

The BSON specification also defines a CString type, which is used, for example, for document keys. To write CStrings to the buffer, use put_cstring:

buffer.put_cstring("hello, world")

As with regular strings, CStrings in BSON must be UTF-8 encoded. If the argument is not in UTF-8, it is converted to UTF-8 and the resulting string is written to the buffer. Unlike put_string, the UTF-8 encoding of the argument given to put_cstring cannot have any null bytes, since the CString serialization format in BSON is null-terminated.

Unlike put_string, put_cstring also accepts symbols and integers. In all cases the argument is stringified prior to being written to the buffer:

buffer.put_cstring(:hello)
buffer.put_cstring(42)

To write a 32-bit or a 64-bit integer to the byte buffer, use put_int32 and put_int64 methods, respectively. Note that Ruby integers can be arbitrarily large; if the value being written exceeds the range of a 32-bit or a 64-bit integer, put_int32 and put_int64 raise a RangeError error.

The following code demonstrates how to write integer values to a byte buffer:

buffer.put_int32(12345)
buffer.put_int64(123456789012345)

Note

If put_int32 or put_int64 are given floating point arguments, the arguments are first coerced into integers and the integers are written to the byte buffer.

To write a 64-bit floating point value to the byte buffer, use put_double:

buffer.put_double(3.14159)

To retrieve the serialized data as a byte string, call to_s on the buffer:

buffer = BSON::ByteBuffer.new
buffer.put_string('testing')
socket.write(buffer.to_s)

Note

ByteBuffer keeps track of read and write positions separately. There is no way to rewind the buffer for writing. The rewind method affects only the read position.

To create a ByteBuffer for reading, or deserializing from BSON, instantiate BSON::ByteBuffer with a byte string as the argument:

buffer = BSON::ByteBuffer.new(string)

You can read from the buffer by using following methods that correspond to different data types:

buffer.get_byte # Pulls a single byte from the buffer
buffer.get_bytes(value) # Pulls n number of bytes from the buffer
buffer.get_cstring # Pulls a null-terminated string from the buffer
buffer.get_double # Pulls a 64-bit floating point from the buffer
buffer.get_int32 # Pulls a 32-bit integer (4 bytes) from the buffer
buffer.get_int64 # Pulls a 64-bit integer (8 bytes) from the buffer
buffer.get_string # Pulls a UTF-8 string from the buffer

To restart reading from the beginning of a buffer, use rewind:

buffer.rewind

Note

ByteBuffer keeps track of read and write positions separately. The rewind method affects only the read position.

The following list provides the Ruby classes that have representations in the BSON specification and have a to_bson method defined:

  • Object

  • Array

  • FalseClass

  • Float

  • Hash

  • Integer

  • BigDecimal

  • NilClass

  • Regexp

  • String

  • Symbol (deprecated)

  • Time

  • TrueClass

In addition to the core Ruby objects, BSON also provides some special types specific to the specification. The following sections describe other types that are supported in the driver.

Use BSON::Binary objects to store arbitrary binary data. You can construct Binary objects from binary strings, as shown in the following code:

BSON::Binary.new("binary_string")
# => <BSON::Binary:0x47113101192900 type=generic data=0x62696e6172795f73...>

By default, Binary objects are created with BSON binary subtype 0 (:generic). You can explicitly specify the subtype to indicate that the bytes encode a particular type of data:

BSON::Binary.new("binary_string", :user)
# => <BSON::Binary:0x47113101225420 type=user data=0x62696e6172795f73...>

The valid subtype specifications are :generic, :function, :old, :uuid_old, :uuid, :md5, and :user.

You can use the data and type attributes to retrieve a Binary object's data and the subtype, as shown in the following code:

binary = BSON::Binary.new("binary_string", :user)
binary.data
# => "binary_string"
binary.type
# => :user

You can compare Binary objects by using the <=> operator, which allows you to sort objects that have the same binary subtype. To compare Binary objects, ensure that you install v5.0.2 or later of the BSON library.

Note

BINARY Encoding

BSON::Binary objects always store the data in BINARY encoding, regardless of the encoding of the string passed to the constructor:

str = "binary_string"
str.encoding
# => #<Encoding:US-ASCII>
binary = BSON::Binary.new(str)
binary.data
# => "binary_string"
binary.data.encoding
# => #<Encoding:ASCII-8BIT>

To create a UUID BSON::Binary (binary subtype 4) from its RFC 4122-compliant string representation, use the from_uuid method:

uuid_str = "00112233-4455-6677-8899-aabbccddeeff"
BSON::Binary.from_uuid(uuid_str)
# => <BSON::Binary:0x46986653612880 type=uuid data=0x0011223344556677...>

To stringify a UUID BSON::Binary to an RFC 4122-compliant representation, use the to_uuid method:

binary = BSON::Binary.new("\x00\x11\x22\x33\x44\x55\x66\x77\x88\x99\xAA\xBB\xCC\xDD\xEE\xFF".force_encoding('BINARY'), :uuid)
# => <BSON::Binary:0x46942046606480 type=uuid data=0x0011223344556677...>
binary.to_uuid
# => "00112233-4455-6677-8899aabbccddeeff"

You can explicitly specify standard UUID representation in the from_uuid and to_uuid methods:

binary = BSON::Binary.from_uuid(uuid_str, :standard)
binary.to_uuid(:standard)

You can use the :standard representation only with a Binary value of subtype :uuid, not :uuid_old.

Data stored in BSON::Binary objects of subtype 3 (:uuid_old) can be persisted in one of three different byte orders depending on the driver that created the data. The byte orders are CSharp legacy, Java legacy, and Python legacy. The Python legacy byte order is the same as the standard RFC 4122 byte order. The CSharp legacy and Java legacy byte orders have some of the bytes in different locations.

The Binary object containing a legacy UUID does not encode which format the UUID is stored in. Therefore, methods that convert to and from the legacy UUID format take the desired format, or representation, as their argument. An application may copy legacy UUID Binary objects without knowing which byte order they store their data in.

The following methods for working with legacy UUIDs are provided for interoperability with existing deployments storing data in legacy UUID formats. In new applications, use the :uuid (subtype 4) format only, which is compliant with RFC 4122.

To stringify a legacy UUID BSON::Binary, use the to_uuid method and specify the desired representation. Accepted representations are :csharp_legacy, :java_legacy and :python_legacy. A legacy UUID BSON::Binary cannot be stringified without specifying a representation.

binary = BSON::Binary.new("\x00\x11\x22\x33\x44\x55\x66\x77\x88\x99\xAA\xBB\xCC\xDD\xEE\xFF".force_encoding('BINARY'), :uuid_old)
# => <BSON::Binary:0x46942046606480 type=uuid data=0x0011223344556677...>
binary.to_uuid
# => ArgumentError (Representation must be specified for BSON::Binary objects of type :uuid_old)
binary.to_uuid(:csharp_legacy)
# => "33221100-5544-7766-8899aabbccddeeff"
binary.to_uuid(:java_legacy)
# => "77665544-3322-1100-ffeeddccbbaa9988"
binary.to_uuid(:python_legacy)
# => "00112233-4455-6677-8899aabbccddeeff"

To create a legacy UUID BSON::Binary from the string representation of the UUID, use the from_uuid method and specify the desired representation:

uuid_str = "00112233-4455-6677-8899-aabbccddeeff"
BSON::Binary.from_uuid(uuid_str, :csharp_legacy)
# => <BSON::Binary:0x46986653650480 type=uuid_old data=0x3322110055447766...>
BSON::Binary.from_uuid(uuid_str, :java_legacy)
# => <BSON::Binary:0x46986653663960 type=uuid_old data=0x7766554433221100...>
BSON::Binary.from_uuid(uuid_str, :python_legacy)
# => <BSON::Binary:0x46986653686300 type=uuid_old data=0x0011223344556677...>

You can use these methods to convert from one representation to another:

BSON::Binary.from_uuid('77665544-3322-1100-ffeeddccbbaa9988',:java_legacy).to_uuid(:csharp_legacy)
# => "33221100-5544-7766-8899aabbccddeeff"

This type represents a string of JavaScript code:

BSON::Code.new("this.value = 5;")

This is a subclass of BSON::Document that provides accessors for the collection, identifier, and database of the DBRef.

BSON::DBRef.new({"$ref" => "collection", "$id" => "id"})
BSON::DBRef.new({"$ref" => "collection", "$id" => "id", "database" => "db"})

Note

The BSON::DBRef constructor validates the given hash and raises an ArgumentError if it is not a valid DBRef. The BSON::ExtJSON.parse_obj and Hash.from_bson methods do not raise an error if passed an invalid DBRef, and parse a Hash or deserialize a BSON::Document instead.

Note

All BSON documents are deserialized into instances of BSON::DBRef if they are valid DBRef instances, otherwise they are deserialized into instances of BSON::Document. This is true even when the invocation is made from the Hash class:

bson = {"$ref" => "collection", "$id" => "id"}.to_bson.to_s
loaded = Hash.from_bson(BSON::ByteBuffer.new(bson))
=> {"$ref"=>"collection", "$id"=>"id"}
loaded.class
=> BSON::DBRef

BSON::Document is a subclass of Hash that stores all keys as strings, but allows access to them by using symbol keys.

BSON::Document[:key, "value"]
BSON::Document.new

Note

All BSON documents are deserialized into instances of BSON::Document, or BSON::DBRef, if they are valid DBRef instances, even when the invocation is made from the Hash class:

bson = {test: 1}.to_bson.to_s
loaded = Hash.from_bson(BSON::ByteBuffer.new(bson))
# => {"test"=>1}
loaded.class
# => BSON::Document

BSON::MaxKey represents a value in BSON that always compares higher than any other value:

BSON::MaxKey.new

BSON::MinKey represents a value in BSON that always compares lower than any other value:

BSON::MinKey.new

BSON::ObjectId represents a 12 byte unique identifier for an object:

BSON::ObjectId.new

BSON::Timestamp represents a time with a start and increment value:

BSON::Timestamp.new(5, 30)

BSON::Undefined represents a placeholder for a value that is undefined:

BSON::Undefined.new

BSON::Decimal128 represents a 128-bit decimal-based floating-point value that can emulate decimal rounding with exact precision:

# Instantiate with a String
BSON::Decimal128.new("1.28")
# Instantiate with a BigDecimal
d = BigDecimal(1.28, 3)
BSON::Decimal128.new(d)

The BigDecimal#from_bson and BigDecimal#to_bson methods use the equivalent BSON::Decimal128 methods internally. This leads to some limitations on BigDecimal values that can be serialized to BSON and those that can be deserialized from existing decimal128 BSON values.

Serializing BigDecimal instances as BSON::Decimal128 instances allows for more flexibility when querying and performing aggregations in MongoDB. The following list describes the limitations on BigDecimal:

  • Decimal128 has a limited range and precision, while BigDecimal has no restrictions in terms of range and precision. Decimal128 has a max value of approximately 10^6145 and a min value of approximately -10^6145, and has a maximum of 34 bits of precision.

  • Decimal128 is able to accept signed NaN values, while BigDecimal is not. All signed NaN values that are deserialized into BigDecimal instances will be unsigned.

  • Decimal128 maintains trailing zeroes when serializing to and deserializing from BSON. BigDecimal, however, does not maintain trailing zeroes and therefore using BigDecimal may result in a lack of precision.

Note

In BSON library v5.0, Decimal128 is deserialized into BigDecimal by default. In order to have Decimal128 values in BSON documents deserialized into BSON::Decimal128, you can set the mode: :bson option when calling from_bson.

Some BSON types have special representations in JSON. The following table describes the serialization behavior for the specified types when you call to_json on them.

Ruby BSON Object
JSON Representation

BSON::Binary

{ "$binary" : "\x01", "$type" : "md5" }

BSON::Code

{ "$code" : "this.v = 5" }

BSON::CodeWithScope

{ "$code" : "this.v = value", "$scope" : { v => 5 }}

BSON::DBRef

{ "$ref" : "collection", "$id" : { "$oid" : "id" }, "$db" : "database" }

BSON::MaxKey

{ "$maxKey" : 1 }

BSON::MinKey

{ "$minKey" : 1 }

BSON::ObjectId

{ "$oid" : "4e4d66343b39b68407000001" }

BSON::Timestamp

{ "t" : 5, "i" : 30 }

Regexp

{ "$regex" : "[abc]", "$options" : "i" }

Times in Ruby have nanosecond precision. Times in BSON have millisecond precision. When you serialize Ruby Time instances to BSON or Extended JSON, the times are rounded to the nearest millisecond.

Note

Time values are rounded down. If the time precedes the Unix epoch (January 1, 1970 00:00:00 UTC), the absolute value of the time increases:

time = Time.utc(1960, 1, 1, 0, 0, 0, 999_999)
time.to_f
# => -315619199.000001
time.floor(3).to_f
# => -315619199.001

Because of this rounding behavior, we recommend that you perform all time calculations by using integer math, as inexactness of floating point calculations might produce unexpected results.

Note

JRuby 9.2.11.0 rounds pre-Unix epoch times up rather than down. To learn more about this behavior, see the related GitHub issue. The BSON library corrects this behavior and floors the times when serializing on JRuby.

BSON supports storing time values as the number of seconds since the Unix epoch. Ruby DateTime instances can be serialized to BSON, but when the BSON is deserialized the times will be returned as Time instances.

The DateTime class in Ruby supports non-Gregorian calendars. When non-Gregorian DateTime instances are serialized, they are first converted to Gregorian calendar, and the respective date in the Gregorian calendar is stored in the database.

BSON supports storing time values as the number of seconds since the Unix epoch. Ruby Date instances can be serialized to BSON, but when the BSON is deserialized the times will be returned as Time instances.

When Date instances are serialized, the time value used is midnight on the Date in UTC.

Both MongoDB and Ruby provide support for working with regular expressions, but they use regular expression engines. The following subsections detail the differences between Ruby regular expressions and MongoDB regular expressions.

MongoDB uses Perl-compatible regular expressions implemented by using the PCRE library. Ruby regular expressions are implemented by using the Onigmo regular expression engine, which is a fork of the Oniguruma library.

The two regular expression implementations generally provide equivalent functionality but have several important syntax differences, which are described in the following sections.

There is no simple way to programmatically convert a PCRE regular expression into the equivalent Ruby regular expression, as there are currently no Ruby bindings for PCRE.

Both Ruby and PCRE regular expressions support modifiers. These are also called "options" in Ruby contexts and "flags" in PCRE contexts. The meaning of s and m modifiers differs in Ruby and PCRE in the following ways:

  • Ruby does not have the s modifier. Instead, the Ruby m modifier performs the same function as the PCRE s modifier, which is to make the period (.) match any character including newlines. The Ruby documentation refers to the m modifier as enabling multi-line mode.

  • Ruby always operates in the equivalent of PCRE's multi-line mode, enabled by the m modifier in PCRE regular expressions. In Ruby the ^ anchor always refers to the beginning of line and the $ anchor always refers to the end of line.

When writing regular expressions intended to be used in both Ruby and PCRE environments, including MongoDB Server and most other MongoDB drivers, avoid using the ^ and $ anchors. The following sections provide workarounds and recommendations for authoring portable regular expressions that can be used in multiple contexts.

In Ruby regular expressions, the ^ anchor always refers to the beginning of line. In PCRE regular expressions, the ^ anchor refers to the beginning of input by default, and the m flag changes its meaning to the beginning of line.

Both Ruby and PCRE regular expressions support the \A anchor to refer to the beginning of input, regardless of modifiers. The following suggestions allow you to write portable regular expressions:

  • Use the \A anchor to refer to the beginning of input.

  • Use the ^ anchor to refer to the beginning of line if you set the m flag in PCRE regular expressions. Alternatively, use one of the following constructs which work regardless of modifiers:

    • (?:\A|(?<=\n)): handles LF and CR+LF line ends

    • (?:\A|(?<=[\r\n])): handles CR, LF and CR+LF line ends

In Ruby regular expressions, the $ anchor always refers to the end of line. In PCRE regular expressions, the $ anchor refers to the end of input by default and the m flag changes its meaning to the end of line.

Both Ruby and PCRE regular expressions support the \z anchor to refer to the end of input, regardless of modifiers.

The following suggestions allow you to write portable regular expressions:

  • Use the \z anchor to refer to the end of input.

  • Use the $ anchor to refer to the beginning of line if you set the m flag in PCRE regular expressions. Alternatively, use one of the following constructs which work regardless of modifiers:

    • (?:\z|(?=\n)): handles LF and CR+LF line ends

    • (?:\z|(?=[\n\n])): handles CR, LF and CR+LF line ends

Since there is no simple way to programmatically convert a PCRE regular expression into the equivalent Ruby regular expression, the BSON library provides the BSON::Regexp::Raw class for storing PCRE regular expressions.

You can create instances BSON::Regexp::Raw by using the regular expression text as a string and optional PCRE modifiers:

BSON::Regexp::Raw.new("^b403158")
# => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
BSON::Regexp::Raw.new("^Hello.world$", "s")
# => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">

The BSON::Regexp module is included in the Ruby Regexp class, such that the BSON:: prefix can be omitted:

Regexp::Raw.new("^b403158")
# => #<BSON::Regexp::Raw:0x000055df63186d78 @pattern="^b403158", @options="">
Regexp::Raw.new("^Hello.world$", "s")
# => #<BSON::Regexp::Raw:0x000055df6317f028 @pattern="^Hello.world$", @options="s">

The following code converts a Ruby regular expression to a BSON::Regexp::Raw instance:

regexp = /^Hello.world/
bson_regexp = BSON::Regexp::Raw.new(regexp.source, regexp.options)
# => #<BSON::Regexp::Raw:0x000055df62e42d60 @pattern="^Hello.world", @options=0>

The BSON::Regexp::Raw constructor accepts both the Ruby numeric options and the PCRE modifier strings.

To convert a BSON regular expression to a Ruby regular expression, call the compile method on the BSON regular expression:

bson_regexp = BSON::Regexp::Raw.new("^hello.world", "s")
bson_regexp.compile
# => /^hello.world/m
bson_regexp = BSON::Regexp::Raw.new("^hello.world", "")
bson_regexp.compile
# => /^hello.world/
bson_regexp = BSON::Regexp::Raw.new("^hello.world", "m")
bson_regexp.compile
# => /^hello.world/

The s PCRE modifier was converted to the m Ruby modifier in the first example in the preceding code, and the last two examples were converted to the same regular expression even though the original BSON regular expressions had different meanings.

When a BSON regular expression uses the non-portable ^ and $ anchors, its conversion to a Ruby regular expression can change its meaning:

BSON::Regexp::Raw.new("^hello.world", "").compile =~ "42\nhello world"
# => 3

When a Ruby regular expression is converted to a BSON regular expression, for example as part of a query, the BSON regular expression always has the m modifier set, reflecting the behavior of ^ and $ anchors in Ruby regular expressions.

Both Ruby and BSON regular expressions implement the to_bson method for serializing to BSON:

regexp_ruby = /^b403158/
# => /^b403158/
regexp_ruby.to_bson
# => #<BSON::ByteBuffer:0x007fcf20ab8028>
_.to_s
# => "^b403158\x00m\x00"
regexp_raw = Regexp::Raw.new("^b403158")
# => #<BSON::Regexp::Raw:0x007fcf21808f98 @pattern="^b403158", @options="">
regexp_raw.to_bson
# => #<BSON::ByteBuffer:0x007fcf213622f0>
_.to_s
# => "^b403158\x00\x00"

Both Regexp and BSON::Regexp::Raw classes implement the from_bson class method that deserializes a regular expression from a BSON byte buffer. Methods of both classes return a BSON::Regexp::Raw instance that must be converted to a Ruby regular expression by using the compile method as described in the preceding code.

The following code demonstrates how to use the from_bson method to deserialize a regular expression:

byte_buffer = BSON::ByteBuffer.new("^b403158\x00\x00")
regex = Regexp.from_bson(byte_buffer)
# => #<BSON::Regexp::Raw:0x000055df63100d40 @pattern="^b403158", @options="">
regex.pattern
# => "^b403158"
regex.options
# => ""
regex.compile
# => /^b403158/

BSON documents preserve the order of keys, because documents are stored as lists of key-value pairs. Hashes in Ruby also preserve key order, so the order of keys specified in your application are preserved when you serialize a hash to a BSON document, and when you deserialize a BSON document into a hash.

The BSON specification allows BSON documents to have duplicate keys, because documents are stored as lists of key-value pairs. Avoid creating documents that contain duplicate keys, because MongoDB Server behavior is undefined when a BSON document contains duplicate keys.

In Ruby, hashes cannot have duplicate keys. When you serialize Ruby hashes to BSON documents, no duplicate keys are generated.

Because keys in BSON documents are always stored as strings, specifying the same key as as string and a symbol in Ruby retains only the most recent specification:

BSON::Document.new(test: 1, 'test' => 2)
# => {"test"=>2}

When loading a BSON document with duplicate keys, the last value for a duplicated key overwrites previous values for the same key.

Back

Time Series Data