Building MongoDB Applications with Binary Files Using GridFS: Part 2

< View all blog posts
Jay Runkel
December 05, 2014
Category: Technical

How GridFS Works

In part 1 we looked at some of the use cases that are appropriate for GridFS. In this post we take a closer look at how GridFS works and how you can use it in your apps.

GridFS stores large binary files by breaking the files into smaller files called “chunks” and saving them in MongoDB. It essentially saves you, the application developer, from having to write all the code to break large files up into chunks, saving all the individual chunks into MongoDB, and then, when retrieving the files, combining all the chunks back together. GridFS gives you all this functionality for free.

The way GridFS works is shown in Figure 1. GridFS uses two collections to save a file to a database: fs.files and fs.chunks. (The default prefix is “fs”, but you can rename it.) The fs.chunks collection contains the binary file broken up into 255k chunks. The fs.files collection contains the metadata for the document.

GridFS Structure
Figure 1 - GridFS Structure

An Example

Let’s work through an example to see how this works. I wrote a small Perl program to load a set of videos my wife had taken on her iPhone of my son’s baseball games and my daughter’s gymnastics meets. Figure 2 shows the core of the program (you can download the entire program from github). The program performs the following steps:

  • Establishes a connection to MongoDB and creates database object for the “digital” database into which we want to load the files
  • Creates a gridFS object for the database to gain access to the gridFS API
  • Reads the Movies directory on my laptop and finds all the .MOV files
  • Each .MOV file is inserted into MongoDB with three metadata fields: filename, content-type, and author.
my $movieDir = "/Users/jayrunkel/Movies";
my $client = MongoDB::MongoClient->new(host => 'localhost', 
                                       port => 27017);
my $database = $client->get_database( 'digital' );
my $grid = $database->get_gridfs;
my @vidFiles = getFiles($movieDir);

$grid->drop();
foreach my $file (@vidFiles) {
    my $fh = IO::File->new("$movieDir/$file", "r");
    $grid->insert($fh, {"filename" => $file,
    "content-type" => "video/quicktime", 
    "author" => "deb"});
}
Figure 2 - Loading binary files using GridFS

As the files are inserted into the database, MongoDB calculates the MD5 checksum of the file and compares that to the checksum calculated by the file system. If they don’t match an error condition occurs. My program will simply die, but it could handle the error and do something more intelligent.

Going to the mongoshell and looking at the digital database shows that there are two collections: fs.chunks and fs.files, just as we expect.

> use digital
> show collections
fs.chunks
fs.files
system.indexes

Let’s first look at the fs.files collection (Figure 3). This collection contains one document for each of the MOV files we loaded into MongoDB. Each document contains the metadata for the corresponding file. This metadata consists of the GridFS defined metadata fields (length, chunksize, and uploadDate) plus the metadata fields I supplied when I inserted the MOV files: content-type, author, and filename. The length field is the size of the document in bytes, the chunk field is the size of each chunk in bytes, and the uploadDate is the date the document was first inserted into MongoDB.

You can define any metadata fields as appropriate and due to the flexibility of MongoDB, each document can have the specific set of fields that make sense. There isn’t a requirement that all files have the same set of metadata fields and you can feel free to add or modify fields as your application evolves.

{
  "_id": ObjectId("546fb304f9781507293f9891"),
  "length": NumberLong("834284"),
  "content-type": "video/quicktime",
  "uploadDate": ISODate("2014-11-21T21:47:48Z"),
  "chunkSize": NumberLong("261120"),
  "filename": "IMG_0425.MOV",
  "author": "deb"
}
Figure 3 - fs.files collection

Each document in the fs.files collection is associated with a set of documents in the fs.chunks collection (Figure 4). The fs.chunks documents are numbered from 0 to n (see the “n” field) with binary data for the chunk stored in the “data” field. The fs.files document is related to its corresponding chunk documents in the fs.chunks collection by the files_id field. The files_id field is the _id of the corresponding document in the fs.files collection. Here are the first two chunks associated with the document whose ObjectId is 546fb304f9781507293f9896.

{
  "_id": ObjectId("546fb304f9781507293f98a6"),
  "files_id": ObjectId("546fb304f9781507293f9896"),
  "data": BinData(0, "+QWU+Pwf1u03d………..."),
  "n": NumberLong("0")
}
{
  "_id": ObjectId("546fb304f9781507293f9898"),
  "files_id": ObjectId("546fb304f9781507293f9896"),
  "data": BinData(0, "EgKwOoOAOYEwFoi…."), 
  "n": NumberLong("1")
}
...
Figure 4 - fs.chunks collection

Because the goal of the GridFS API is to make it easy and efficient to retrieve and reassemble the binary document, GridFS will automatically create a compound index on the fs.chunks collection on the files_id and n fields. This will enable MongoDB to quickly identify all the chunks associated with a particular file, sort them from 0 to n using the n field, and return them to the driver so that they can be assembled back into the original document. To demonstrate this, here is the output of running getIndexes on the fs.chunks collection. As you see the Perl driver automatically created the compound index when I inserted the first document into my database.

> db.fs.chunks.getIndexes()
[
  {
    "v": 1,
    "key": {
      "_id": 1
    },
    "name": "_id_",
    "ns": "digital.fs.chunks"
  },
  {
    "v": 1,
    "unique": true,
    "key": {
      "files_id": NumberLong("1"),
      "n": NumberLong("1")
    },
    "name": "files_id_1_n_1",
    "ns": "digital.fs.chunks"
  }
]
Figure 5 - fs.chunks indexes

How do you use the GridFS API to build an application

Using the GridFS API to build an application is simple. GridFS does all the hard work. All you need to do is insert your binary documents into MongoDB using the GridFS API. When you insert the documents, you can provide the document metadata and you can update this metadata during the life of the document in your application using the same update queries you use to update other documents in MongoDB.

Retrieving and deleting documents is simple as well. The GridFS API provides methods for retrieving a document, as a whole or in parts, which can be very useful for “skipping” to the middle of an audio or video file. This function combines the chunk files to recreate the original file and returns a file object to your application. The delete method removes the documents from the fs.file and fs.chunks collection so that the original document has been removed from MongoDB.

That is all there is to it. Use the GridFS API to insert, retrieve, and remove binary documents to and from MongoDB. GridFS will do all complex work leaving you to focus on the rest of your application. And if you’re interested in learning more about the architecture of MongoDB, download our guide:

Download the Architecture Guide

About Jay Runkel

Jay Runkel is a Solutions Architect Manager at MongoDB and has been working with Fortune 500 companies to architect enterprise solutions using NoSQL document databases for over five years. Before MongoDB, Runkel was a Principal Technologist at MarkLogic where he worked with Financial Service, Medical, and Media organizations to develop operational systems for analytics and custom publishing. Runkel has also recently been a Sales Engineering Manager at Venafi where he he assisted large financial institutions, retailers, health care and insurance organizations improve security by securing, protecting, and managing their encryption assets. Runkel has also held various positions developing automated underwriting, product information management, and CRM solutions. Runkel has a BS in Applied Mathematics from Carnegie Mellon and a Masters in Computer Science from the University of Michigan.

comments powered by Disqus