How to Perform Random Queries on MongoDB

Norberto Leite


As part of my day-to-day tasks I occasionally run random queries on datasets. I could need these queries to get a nice example of a random set or to put together a list of MUG members for a quick raffle or swag winner.

MUG member holding prize winning raffle

To pull these queries using MongoDB (obviously I'm going to use MongoDB for the task!) we can apply a few different approaches.

The traditional approach

Not long ago, to raffle a few conference tickets at one of the MongoDB User Groups, I invited all members to come up with an implementation on how to get a random document from MongoDB.

I was hoping for highly efficient approaches that would involve changing the MongoDB kernel and implementing a new feature in the database, but that’s not what our MUG members came up with. All proposed algorithms were instead client based, meaning that all the randomization phase of the algorithm was performed on the client side.

The algorithms would be based on the client random libraries and would consist of the following approach:

  • Get data from MongoDB
  • Run it through a random library
  • Spill out a winner

A simple example for this type of approach can be the following:

def load_data(collection, n=100):
    for i,d in load_data_file(n):
        collection.insert( d )
<p>def get_data(collection, query={}):
for d in collection.find(query):
yield d</p>
<h1>Load the entire contents of the collection in memory.</h1>
<p>elements = []
for e in get_data(collection):
<p>idx = random.randint(0, len(elements))</p>
<p>print "AND THE WINNER IS ..... " + elements[idx]['name']

We could adjust the above code to avoid loading the entire contents of the collection into memory at once by using reservoir sampling. Reservoir sampling is generally equivalent to the process described above, except that we sample the winners immediately from the incoming MongoDB result set, as a stream, so that we don't need to keep all documents in memory at once. The algorithm works by replacing the current winner with decreasing probability such that all elements--without even knowing how many there are beforehand--have an equal probability of being selected. This is actually how mongodb-js/collection-sample works when communicating with older versions of the server that don't support the new $sample operator.

Adding a bit of salt ...

Although the previous approach can be slightly improved by adding filters and indexes, they are client bound for choosing a random document. If we have 1M records (MongoDB User groups are quite popular!), then iterating over that many documents becomes a burden on the application and we are not really using any MongoDB magic to help us complete our task.

#### Adding an incremental value

A typical approach would be to mark our documents with a value that we would then use to randomly pick elements.

We can start by setting an incremental value to our documents and then query based on the range of values that we recently marked the documents with.

def load_data(collection, n=100):
    #for each element we will insert the `i` value
    for i in xrange(n):
        name = ''.join(random.sample( string.letters, 20))
        collection.insert( {'name': name, 'i': i})

After tattooing our documents with this new element we are now able to use some of the magic of a MongoDB query language to start collecting our well deserved prize winner.

mc = MongoClient()
db = mc.simplerandom
collection = db.names

number_of_documents = 100

load_data(collection, number_of_documents )

query = {'i': random.randint(0, number_of_documents )  }

winner = collection.find_one(query);

print "AND THE WINNER IS ..... " + winner['name']

While this approach seems fine and dandy there are a few problems here:

  • we need to know the number of documents inserted
  • we need to make sure that all data is available

Double trip to mother ship

To attend the first concern we need to know the number of documents inserted, which we can count with the MongoDB count operator.

number_of_documents = collection.count()

This operator will immediately tell us the number of elements that a given collection contains

But it does not solve all of our problems. Another thread might be working concurrently and deleting or updating your documents.

We need to know the higher value of i since that might differ from the count of documents in the collection and we need to account for any eventual skip of the value of i (deleted document, incorrect client increment).

For accomplishing this in a truly correct way we would need to use distinct to make sure we are not missing any values and then querying for a document that would contain such a value.

def load_data(collection, n=100):
    #let's skip some elements
    skiplist = [10, 12, 231 , 2 , 4]
    for i,d in load_data_file(n):
        d['i'] = i
        if i in skiplist:
        collection.insert( d )

load_data(collection, 100)

distinct = collection.distinct('i')

ivalue = random.sample(distinct, 1)

winner = collection.find_one({ 'i': ivalue })

print "AND THE WINNER IS ..... " + winner['name']

Although we are starting to use MongoDB magic to give us some element of randomness this is not truly a good solution.

  • Requires multiple trips to the database
  • Becomes computationally bound on the client (again)
  • Very prone to errors while data is being used

Making more of the same tatoo

To avoid a large computational bound on the client due to the high number of distinct i values, we could use a limited amount of i elements to mark our documents and randomize the occurrence of this mark:

def load_data(collection, n=100):
    #fixed number of marks
    max_i = 10
    for j,d in load_data_file(n):
        d['i'] = random.randint(0, max_i)
        collection.insert( d )

This way we are limiting the variation of the i value and limiting the computational task on both the client and on MongoDB:

number_of_documents = 100

load_data(collection, number_of_documents )

query = {'i': random.randint(0, 10 )  }

docs = [x for x in collection.find(query)]

winner = random.sample(docs, 1)[0]

print "AND THE WINNER IS ..... " + winner['name']

... but then again, we have a better solution!

The above implementations, no matter how magical they might be, more or less simplistic, multiple or single trips to the database ... they tend to:

  • Be inefficient
  • Create artificial workarounds on data
  • Are not natively / pure MongoDB implementations
  • And the random implementation is always bound to the client

But fear not! Our 3.2 release brings a solution to this simple wished task: $sample

$sample is a new aggregation framework operator that implements a native random sample operation over a collection data set:

  • no more playing with extra fields and extra indexes
  • no more double trips to the database to get your documents
  • native, optimized, sampling operator implemented in the database:
number_of_documents = 100
<p>load_data(collection, number_of_documents)</p>
<p>winner = [ d for d in collection.aggregate([{'$sample': {'size': 1 }}])][0]</p>
<p>print "AND THE WINNER IS ..... " + winner['name']

Just beautiful!

Learn more about the latest features included in the release of MongoDB 3.2.

What's new in MongoDB 3.2

About the Author - Norberto Leite

Norberto is a Software Engineer at MongoDB working on the content and materials that make up the MongoDB University curriculum. Prior to this role, Norberto served as Developer Advocate and Solutions Architect helping professionals develop their skills and understanding of MongoDB. Norberto has several years of software development experience in large scalable systems.