Leaf in the Wild: Hekima Unlocks Social Media Analytics with Cloud Manager, Hadoop and Apache Spark

Mat Keep


Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

Everyone from the C-suite to the factory floor understands that social media presents their organizations with additional channels to engage communities and customers. But the smartest companies also recognize that they can outmaneuver competitors by analyzing social data streams, so they can, for example, monitor, respond and influence customer sentiment in real time. They can better engage key influencers, implement and optimize promotions, and forecast trends.

Hekima is at the forefront of enabling one of the world’s fastest growing markets to ride the wave of social media, building its service on a cutting edge big data pipeline running in the cloud. I met with Thiago Cardoso, co-founder and CTO of Hekima to learn more.

Can you start by telling us a little bit about your company?

Hekima is a technology startup focused on using data mining and machine learning to help our customers analyze and extract relevant information from data. Our goal is to automate the analysis process and help extract insights as efficiently and quickly as possible. These insights can come from different data sources, enabling our customers to take truly data-driven decisions. Our customers include the Federal Government of Brazil, Caixa and Grupo Abril.

One of our services, Zahpee Monitor, helps our customers track conversations about their brands and products, identify key influencers and monitor sentiment. Zahpee Monitor generates both real time and historical analytics, powered by a big data pipeline built on MongoDB, Hadoop and Apache Spark.

Please describe how you are using MongoDB.

The Zahpee Monitor service is built around our social media monitoring tool, which processes and extracts insights from huge volumes of social media streams. Understanding what is being said about a given topic can be trivial if the number of posts is small. Achieving a conclusion is a matter of reading all the associated content. Extracting valuable insight becomes a challenge when you start looking to broader searches like: “what is being said about health insurance companies?” and “what are people interested in when talking about cars?”. The volume of available content makes it impractical to manually process and read all this data, forcing companies to resort to sampling strategies and analysis.

Zahpee Monitor provides the necessary tools to help analysts tame this complexity by automatically contextualizing and classifying data. In a matter of hours, a single marketeer can understand the different topics being discussed, along with how the information flows in millions of posts. All these characteristics make Zahpee Monitor well suited for popular B2C companies, research institutes and governments.

MongoDB is our main database. It is used to store all social media content, and perform real time analytics against the data. For example, we can alert against critical events as they happen, generate dashboards of positive and negative sentiment over the past 60 minutes, profile the geo-distribution of posts and more. For this type of analytics, we started out using MapReduce in MongoDB, but have since migrated to MongoDB’s Aggregation Framework, which makes building queries much faster and more reliable, and delivers higher query throughput.

Can you describe where Hadoop and Spark fit into your data pipeline?

MongoDB gives us immediate speed-to-insight as events break on social media, and the ability to track brand engagement, such as clickthroughs to articles, in real time. To understand longer-term trends, we also perform exploratory analytics and train our machine learning algorithms against large historical data sets. To do this, we extract data from MongoDB every one to two hours, transform it into the Parquet column-oriented file format and load it into our Hadoop cluster where we analyze it with Spark. Data is partitioned by day into Resilient Distributed Datasets (RDDs) where it can be JOINed with historic data sets and queried with SparkSQL.

_Zahpee’s powerful visualizations, powered by MongoDB, unlocks sophisticated social media analytics in real time_

Is this a new project, or the migration of an existing project?

We have always used MongoDB as our operational database. On the Hadoop side, we started out with our own on-premise cluster running Apache Hive, MapReduce and Impala, before migrating to Spark. The unified set of APIs made development and ongoing application maintenance much simpler, while preserving performance levels. As a part of the upgrade, we migrated our Hadoop cluster to AWS’ Elastic MapReduce (EMR) service running on YARN, with data stored in HDFS on S3.

What made you select MongoDB as your operational database?

There were multiple factors that drove our decision:

  • MongoDB’s flexible data model with dynamic schema enables us to add new attributes as social media data formats change, all without downtime.
  • The expressive query language, aggregation pipeline, and secondary indexes allow us to query data however we want – which is essential to mining data in real time.
  • Elastic scalability and the depth of operational tooling enable us to dynamically expand and contract our MongoDB cluster based on workload demands.

Please describe your MongoDB deployment.

We have just upgraded to MongoDB 3.0, currently configured with the MMAPv1 storage engine. We are in the process of evaluating the WiredTiger storage engine within our test environment. We believe its improved write performance with document level concurrency control, coupled with storage compression will help increase the performance and efficiency of our environment.

Currently we are using RabbitMQ for data ingestion from the social media stream. Most of our codebase uses the MongoDB Java driver and Morphia for object mapping. Some of our components and tools are written in Go and use the great MongoDB Go (mgo) driver. mgo is a good choice because it enriches the default set of driver operations with new functionality.

We run a sharded MongoDB cluster on AWS’s EC2 service. The ability to scale the cluster on-demand helps us to optimize our cost model. For example, during the FIFA 2014 World Cup here in Brazil, we had to process and report on millions of social media posts in real time for the government. To keep pace with the workload, we were able to scale our MongoDB cluster from three to six shards, all hosted on m2.4xlarge EC2 instances. We are currently running our MongoDB cluster on r3.8xlarge instances.

Aged data is automatically purged from MongoDB collections after three months using Time to Live (TTL) indexes.

*Zahpee Big Data Pipeline*

How do you manage your MongoDB deployment?

We use MongoDB Cloud Manager for proactive monitoring of the cluster, and for backups. Incremental backup with point in time recovery is important to us for disaster recovery, and to quickly provision new development environments.

We have also used MongoDB’s consulting services to help us optimize our deployment. We asked the consulting engineer to review our use of the aggregation framework (and migrating from MapReduce), TTL indexes, warming new secondaries, sharding strategy, and more. The result of our consulting engagement was a higher confidence in our MongoDB deployment, as well as a hardware cost reduction while maintaining the same performance.

Can you share any best practices you have observed on using MongoDB?

Monitoring memory usage is crucial. Keeping track of the working set size with tools like Cloud Manager will help you understand the proper dimensions of your cluster and decide when to scale.

How are you measuring the impact of MongoDB on your business?

Social media enables companies to track sentiment, customer behavior, and competitive activity in real time. Speed-to-insight is critical. This is where MongoDB provides the greatest value. High performance data ingest coupled with high speed analytics enables our customers to act on social media immediately, giving them insight ahead of their competitors.

Do you have plans to use MongoDB and Spark for other applications?

Yes, we are developing additional custom big data analytics products that leverage our infrastructure. These products bring our expertise to other areas like marketing, logistics and education.

What advice would you give someone who is considering using MongoDB for their next project?

If choosing between MongoDB's MapReduce and the aggregation framework, always go with the latter if you can. The aggregation framework is implemented directly in the database core and delivers higher performance.

What advice would you give someone who is considering using Spark for their next project?

Fine-tuning is an essential part of using Spark and YARN. Tuning workers, executors, cores and memory usage is mandatory for good performance.

Thiago – thank you for sharing your experiences with the MongoDB community.

To learn more, read our white paper on MongoDB and Apache Spark:
MongoDB and Apache Spark

About the Author - Mat Keep

Mat is a director within the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.