Narrative Science Quill™ and MongoDB: A Match Made In Heaven



This is a guest post by Craig Booth and Katy De Leon of Narrative Science

Big Problems with Big Data

Billions of dollars have been invested in Big Data that have produced innumerable business intelligence and visualization tools. No amount of data is too large. No combination of algorithms is too many. No visualization is too complex. But no spreadsheet or dashboard can tell you what you need to know to improve your business. This you must figure out for yourself. Or rely on someone to figure it out for you.

At Narrative Science, we approach data in a different way. We find the insight hidden in data and communicate it in a form that makes natural sense to everyone: a written narrative. Whether it’s an earnings report, a system performance review or a product description, narratives leave you with nothing left to decipher. All you have to do is read.

Since Big Data is vital to what we do, we require a data storage solution that is reliable, high performing, scalable, flexible and expressive regarding the structure and hierarchy necessary for generating plain English language. MongoDB meets all of these criteria and has been the backend of our entire infrastructure since 2010. Over the years, our MongoDB replica sets have remained available even during region-wide outages in the Amazon cloud computing centers. We also leverage our MongoDB deploys as a job queuing system, and as a store for temporary data for our web services (using TTL indexes).

MongoDB also offers us schematic flexibility — a critical feature since we are constantly adding new data from a diverse range of industries and clients. Changes can be made iteratively and instantaneously, a huge advantage for us over a traditional SQL database. As our data requirements and client demands have grown ever more complex, MongoDB has grown with them; providing access to the data structures necessary to generate natural language directly from data.

Quill Automatically Explains MongoDB Usage

Quill™ is our automated narrative generation platform, powered by Artificial Intelligence. Quill integrates structured data from disparate sources, understands what is important to the end user and then generates perfectly written narratives that convey meaningful information to any intended audience. In essence, we do what data visualizations cannot. Although charts and dashboards look appealing, they still require people to explain them. Quill, on the other hand, adds value to data by identifying relevant data points and relaying them through professional, conversational language that people can immediately comprehend, act on and trust.

In order to diagnose potential performance and scaling issues before they affect us or our clients, we decided to put Quill on the task of monitoring our MongoDB usage. Here is a snapshot from a recent Quill-generated report explaining the weekly performance of one of our environments:

Data Use

This report, generated on July 25, 2014, compares today’s MongoDB usage to that from one week ago. Between last week and today:

  • The database “prod_ns4” has grown the most in the past week, increasing by 4.07 Gb. The collection inside of “prod_ns4” that drove most of this growth was “metadata”, which now contains 531,126 documents.
  • The total data size increased from 354.79 Gb to 365.31 Gb.
  • The total number of collections remained constant at 8407.
  • The number of documents has increased from 119.4 million to 121.1 million.
  • The following databases are the largest of those in your MongoDB instance: “prod_pub_service” (with 196 collections, containing a total of 18.4 million objects in 128 Gb) and “prod_harpoon” (with 25 collections, containing a total of 9.9 million objects in 44.54 Gb).
  • The following databases were dropped: “prod_assignment_service” and “stg_sla_monitor”.
  • No new databases were created in the past week.

Slow Queries

The three collections that accounted for the largest number of slow queries this week were ‘stg_finance.qm’, 'stg_cft.tweets’, and 'stg_statprovider.latest’ with 8675, 1405 and 601 slow queries respectively.

72% of the total number of slow queries were on the collection 'stg_finance.qmc. Consider adding indexes. The slowest queries were:

  • Wed Jul 16 05:01:48 [conn61548] query stg_finance.qqm query: { $query: { ticker: “LTG”, exchange: “NYSE” }, $orderby: { date: -1 } } ntoreturn:1 ntoskip:0 nscanned:643 scanAndOrder:1 keyUpdates:0 numYields: 20 locks(micros) r:9929394 nreturned:1 reslen:110 7102ms
  • Wed Jul 16 04:53:55 [conn61548] query stg_finance.qm query: { $query: { ticker: “ATR”, exchange: “NYSE” }, $orderby: { date: -1 } } ntoreturn:1 ntoskip:0 nscanned:643 scanAndOrder:1 keyUpdates:0 numYields: 18 locks(micros) r:13086372 nreturned:1 reslen:110 7050ms

Better Transparency and Decision-Making

This machine-generated explanation distills thousands of numbers, about over 5TB of data, into an easily digestible, relevant, and informative story that is sent to our engineering team every day and every week. It allows them to zero in on only those issues that are important and ignore the information that’s irrelevant to performance issues. And, unlike what you get from other monitoring tools, even a non-techy type can immediately understand the report. It gives our team what we need, when we need it - actionable insights specific to the performance of our MongoDB deployment.

These reports have already allowed us to make performance tweaks to our deployments. For example, when inefficient MongoDB queries make it into production code, Quill immediately flags them so we can place indexes in the relevant places. It also makes our lives easier by providing transparency into which projects are hogging data space so we can make the relevant collections smaller, either by making different choices about what data we store, or adding TTL indexes that allow irrelevant data to expire.

MongoDB is a powerful tool for dealing with the deeply nested documents that are needed for Natural Language Generation (NLG). It is a tool that is essential to Quill’s performance. At the same time, natural language reporting from Quill provides us with a powerful way to continually understand and improve our MongoDB usage.

Want to learn how Quill can immediately start adding value to your business? We’ve written a white paper that explains how Quill works and outlines examples of how companies are using it today. You can also visit our website, or contact us directly to see a demonstration.