Text Analysis Using MongoDB

Aaron Cordova, The Interllective Inc.

October 3 2011


In this talk we present our experiences building a text analysis system on top of MongoDB. The Interllective, is building research tools to discover... conceptually related documents in technical collections (patents, journal articles, etc.). We use sharded MongoDB to scale out our document collections. We discuss experiences performing text clustering, with collection design and indexing strategies for using Hadoop MapReduce over MongoDB. We also discuss building a real-time distributed relevance scoring engine on top of MongoDB. And since our data is too large to fit into memory, we address tradeoffs to consider when scaling using a disk based approach.