MongoDB for the PHP Mind, Part 2

MongoDB

#Releases

This is part 2 of a series, with part 1 covering the bare essentials to get you going. In this post we are going to take a closer look at queries and how indexes work in MongoDB.

Introduction

I’d like to kick off this post with a thanks to the folks behind the PHP extension for MongoDB, who have done a fantastic job of matching the functionality of the Mongo shell client. This is important when you start to see how similarly the two function, and you might find that you can tweak your logic using the shell and quickly implement the same logic from within PHP.

The PHP extension supports something that is rather new to a lot of folks in the PHP world, a feature called method chaining: The ability to run several methods at the same time on one object. For example, you might want to run a query and then apply a limit to it. Most folks would think that this is two operations, and they are correct, however with method chaining, you can do both in one shot, like this:

$result = $songs->find()->limit(2);

Of course this works in the Mongo shell too. You would basically do the same thing:

result = db.songs.find().limit(2);

For more reading on method chaining, there’s an excellent blog post about method chaining in PHP 5.

Before we dig deep into finding and manipulating your data, let’s discuss the different data types that MongoDB supports.

MongoDB Data Types

All databases have their own data types, and MongoDB is no different. A summary of MongoDB’s available data types are as follows:

  • ObjectId: Also known within PHP as MongoId, is a unique object usually provided as a primary key by default in the property _id. It is 12 bytes long, and is automatically created by the database when you insert a document without an _id property set. You can also set your own values, but remember that this is used as the primary key and so must be unique.
  • String: Just like strings in PHP, however all strings must be UTF-8. You need to convert non-UTF-8 strings before inserting into your database.
  • Binary: Used for non-UTF-8 strings and other binary data.
  • Boolean: You can use the familiar TRUE and FALSE right from PHP.
  • Numbers: This is a bit more complex, based on whether you are running on a 32 or 64 bit system. On a 32 bit system, numbers are generally stored as 32 bit integers, and 64 bit systems default to 64 bit integers. You can read more detail on MongoInt32 and MongoInt64.
  • Dates: Known as the MongoDate class, which are based on milliseconds since the epoch.
  • Null values: You can assign NULL values from PHP as well.

A Special Word About MongoId

Thoughtful consideration needs to be given to the MongoId data type, as it is used as a primary key for most documents. It is recommended that you allow this feature to run automatically unless you have very specific needs and your own naturally unique primary keys.

A common mistake is the assumption by PHP engineers that MongoIds are strings. They are not. A MongoId is stored as an object. So if you are working with a document whose _id property is set as a MongoId instance with the value of 4cb4ab6d7addf98506010000, you will need to search for that document with an instance of the MongoId class with that value. For instance, imagine you are looking for the document with the previously mentioned MongoId as the _id property:

// This is only a string, this is NOT a MongoId
$mongoid = '4cb4ab6d7addf98506010000';

// You will not find anything by searching by string alone $nothing = $collection->find(array('_id' => $mongoid)); echo $nothing->count(); // This should echo 0

// THIS is how you find something by MongoId $realmongoid = new MongoId($mongoid);

// Pass the actual instance of the MongoId object to the query $something = $collection->find(array('_id' => $realmongoid)); echo $something->count(); // This should echo 1

Always keep this in mind when working with MongoDB. Types are important here, just like PostgreSQL, which will punish you if you attempt to join using columns with slightly different data types.

Another note on the previous example: I assigned the result of the find() to variables $nothing and $something, and then called methods on them. That is because the find() method returns a recordset called a MongoCursor, which provides its own methods. You can get a count on the number of documents returned by a query, as well as iterate through them, and even get an explain plan to see how the query is being executed.

Here is a very common question: So what should I store for _id values in other collections such as user_id or article_id? The solution is simple: Always use MongoId instances. I’ve made the mistake of storing _id values as strings in another collection, and was rewarded by having to always instantiate a new MongoId object for every query. Apt punishment for not thinking things all the way through.

Simply put, if you have a users collection where a given user has an _id property, and they need to store that same value in the posts collection as author_id, then make sure you save author_id as a MongoId object and not a string. Otherwise, every time you wish to display the details of an author to a post, you have to manually instantiate author_id as a MongoId object so you can find the user document by _id primary key.

Oh yeah, MongoDate

MongoDate is also stored as an object, as opposed to an integer or string. Like MongoIds, you need to treat MongoDate objects with additional care; however it’s then possible to do some neat things like find a document that has a MongoDate between 1971 and 1999 for instance:

// Instantiate dates for the range of the query
$start = new MongoDate(strtotime('1971-01-01 00:00:00'));
$end = new MongoDate(strtotime('1999-12-31 23:59:59'));

// Now find documents with create_date between 1971 and 1999 $collection->find(array("create_date" => array('$gt' => $start, '$lte' => $end)));

Queries from PHP

Now it is time to construct some more complex documents to demonstrate how you can find and manipulate your data in MongoDB. Let’s create several documents with a few properties, including a nested array, nested document and a variety of data types discussed earlier in this post. Note the deliberate difference between strings and numbers, as I spell out numbers as strings for simplicity. I’m using the shell to insert these documents quickly and easily, and suggest you follow along:

one =   {
  "string" : "This is not my beautiful house",
  "number" : 42,
  "boolean" : true,
  "list" : ["one", "two", "three"],
  "doc" : {"one" : 1, "two" : 2}
};
db.things.save(one);
two = {
  "string" : "This is not my beautiful wife",
  "number" : 666,
  "boolean" : false,
  "list" : [1, 2, 3],
  "doc" : {"1" : "one", "2" : "two"}
};
db.things.save(two);
three = {
  "string" : "Same as it ever was",
  "number" : 117,
  "boolean" : true,
  "list" : ["one", "two", "four"],
  "doc" : {"one" : 1, "four" : 4}
};
db.things.save(three);

You probably want to see how this went, so you can get a nicely formatted list of what is in your things collection thusly:

> db.things.find().pretty()
{
  "_id" : ObjectId("4fdc77f74e300a45bea9897a"),
  "string" : "This is not my beautiful house",
  "number" : 42,
  "boolean" : true,
  "list" : ["one", "two", "three"],
  "doc" : {"one" : 1, "two" : 2}
}
{
  "_id" : ObjectId("4fdc77f74e300a45bea9897b"),
  "string" : "This is not my beautiful wife",
  "number" : 666,
  "boolean" : false,
  "list" : [1, 2, 3],
  "doc" : {"1" : "one", "2" : "two"}
}
{
  "_id" : ObjectId("4fdc77f94e300a45bea9897c"),
  "string" : "Same as it ever was",
  "number" : 117,
  "boolean" : true,
  "list" : ["one", "two", "four"],
  "doc" : {"one" : 1, "four" : 4}
}

Notice that each of your new documents has an _id property. Chances are your _id values are different than mine, as they have been designed to be unique based on hardware, time and other aspects. This is greatly useful when you are running hundreds (thousands!) of MongoDB servers and need a single value to be unique across all of them.

You can now do some interesting things both from the shell and PHP. I’m hopping back to PHP as, um, this is a series on PHP…

// Connect to test database on localhost
$db = new Mongo('mongodb://localhost/test');

// Get the users collection $c_things = $db->things;

// Get a count of documents in the things collection $count_things = $c_things->count(); echo "There are $count_things documents in the things collection.\n";

// How many have the boolean property set to true? $count_things = $c_things->count(array('boolean' => true)); echo "There are $count_things true documents in the things collection.\n";

// How many have a string property set, regardless of value? $count_things = $c_things->count(array('string' => array('$exists' => true))); echo "There are $count_things documents with strings in the things collection.\n";

// How many have a list property with array values including "one" and "two"? $count_things = $c_things->count(array('list' => array('$in' => array('one','two')))); echo "There are $count_things documents with 'one' and 'two' as list array values in the things collection.\n";

// How many have a list property with array values not including 'three'? $count_things = $c_things->count(array('list' => array('$nin' => array('three')))); echo "There are $count_things documents not including the string 'three' in list array values in the things collection.\n";

// How many have include 'ever was' in the string property? Using a regular expression: $regex = new MongoRegex("/ever was/"); $count_things = $c_things->count(array('string' => $regex)); echo "There are $count_things documents including the string 'ever was' in string property in the things collection.\n";

This is what you should see when running this script on your machine:

$ php -f example.php 
There are 3 documents in the things collection.
There are 2 true documents in the things collection.
There are 3 documents with strings in the things collection.
There are 2 documents with 'one' and 'two' as list array values in the things collection.
There are 2 documents not including the string 'three' in list array values in the things collection.
There are 1 documents including the string 'ever was' in string property in the things collection.

Most importantly, notice that we searched on embedded values, including an array and an embedded object. We were able to search by the existence of a property, values set for a property, and pass an array to see if any of those values were set in an embedded array.

That last example illustrates how MongoDB can search with regular expressions - which are case sensitive by default. There are a great many more query options available, with explanation for advanced queries being your best start.

Returning Documents to PHP

So far we’ve stuck to the command line and simple counts as our results. Now we will take a look at how MongoDB returns documents to your PHP applications. The last item in the previous example used a regular expression, but returned just the count. What if you wanted the document instead?

// Find a document that includes 'ever was' in the string property using a regular expression:
$regex = new MongoRegex("/ever was/");
$ever_was = $c_things->findOne(array('string' => $regex));
var_dump($ever_was);

Running this script should look like this, which will probably look very familiar to many of you who have been working in PHP with other databases:

$ php -f example.php 
array(6) {
  '_id' =>
    class MongoId#7 (1) {
      public $$id =>
      string(24) "4fdc77f94e300a45bea9897c"
    }
  'string' =>
    string(47) "Same as it ever was"
  'number' =>
    double(117)
  'boolean' =>
    bool(true)
  'list' =>
    array(3) {
      [0] =>
        string(3) "one"
      [1] =>
        string(3) "two"
      [2] =>
        string(4) "four"
    }
  'doc' =>
    array(2) {
      'one' =>
        double(1)
      'four' =>
        double(4)
    }
}

The result was an array, including all the elements of the document returned by the query. By calling findOne() we ensured only one document would be returned, and for multiple documents you could shorten this to just find() and iterate over the results like your ordinary database query.

Indexing your MongoDB Data

This section is not PHP specific, but critical if you want your PHP apps to perform adequately.

Indexes in MongoDB are similar to what you are familiar with for other databases. When you reach a certain number of documents (and data set size) indexes will become necessary to ensure your queries execute fast and efficiently. Being a document database, however, means that you can index array values and even embedded objects.

Creating an index is simple, as the following shell example illustrates:

> db.things.ensureIndex({"string":1});
> db.things.ensureIndex({"number":1});
> db.things.ensureIndex({"boolean":1});

We just plopped indexes on the string, number and boolean properties for all documents in this collection. What is interesting about this is that there could be documents in this collection that do not have any of these properties set. With a relational database you would be forced to allow NULL values for those columns, which is not always accurate.

What about those nested arrays and objects? We can index the properties at the top level, and we can even index an embedded property if we wanted to. Look at the following example:

> db.things.ensureIndex({"list":1});
> db.things.ensureIndex({"docs.two":1});

We just indexed a property that has an embedded array. What this means is that MongoDB is smart enough to figure out how to provide index values for each element in that array, per document. So if you search for all documents that have the string one in their list property, MongoDB will still use the index. Type conversion is handled as well, so you can use the same index to search for all documents that have the number 2 instead. MongoDB refers to indexed arrays as multikeys.

The second example demonstrates one of the many powers of BSON: reaching inside a document for embedded information. We just created an index on the docs property, which applies to all documents, including those that do not have that property set.

This can be extremely useful when embedding arrays in your documents. For example, I have an application where I have multiple third parties that have users with special privileges only for their applications that are running within the main website. I can now store an embedded object called partners and store each partner name and a value based on their access levels for their own applications. All of this can live happily in the users collection, making maintenance and reporting a breeze!

But what about compound indexes? If you are doing a ton of queries based on the values of two properties, you can create a single index that includes both:

> db.things.ensureIndex({"string":1,"boolean":1});

Of course, you can search on the first property and still use the index, so you don’t have to create separate indexes for each property in the compound index. That said, this process only works left-to-right, meaning that using the above index, you can search on string, string and boolean; but if you search solely on boolean you will not use the index.

You’re probably wondering what the numbers are behind the properties in the index creation statements. Those numbers (1 and -1) tell MongoDB whether this is an ascending or descending index, respectively. Note that index order is irrelevant for single-key indexes, and mainly comes into play with operations like sorting.

Let’s take a quick look at what we’ve done to the things collection today, using the shell:

>db.things.stats()
{
  "ns" : "test.things",
  "count" : 3,
  "size" : 656,
  "avgObjSize" : 218.66666666666666,
  "storageSize" : 12288,
  "numExtents" : 1,
  "nindexes" : 7,
  "lastExtentSize" : 12288,
  "paddingFactor" : 1,
  "flags" : 1,
  "totalIndexSize" : 57232,
  "indexSizes" : {
    "_id_" : 8176,
    "string_1" : 8176,
    "number_1" : 8176,
    "boolean_1" : 8176,
    "list_1" : 8176,
    "docs.two_1" : 8176,
    "string_1_boolean_1" : 8176
  },
  "ok" : 1
}

This is the collStats feature in the shell, which gives you statistics about the mentioned collection. This can be useful if you are experiencing unexpected behavior with your collections or indexes.

A new index feature in MongoDB is the sparse index. Imagine having a users collection with around 300 million documents, with only 35 of them also having a specific property set. Do you really want to have an index that includes entries for all 300 million documents when searching for those with that property? A sparse index basically only includes documents that have that indexed property. So if you only have 35 documents in your users collection with that property set, you will want to use a sparse index.

One final word on indexes in MongoDB: You can do many more things like dropping duplicates, unique indexes, and indexing geospatial data. A great place to read in detail is indexing advice and FAQ where a lot of common questions are answered.

Outro, or What’s Coming Next

There are a few more posts coming in this series, including detailed coverage on document data design, a comparison between ODM/ORM/driver approaches and frameworks, advanced queries and how they relate to PHP, taking advantage of map reduce, and a few sample applications demonstrating a few common use cases that I’ll be sharing on GitHub. It is safe to say that there is a lot more coming for the PHP universe on this blog!