Leaf in the Wild: YouGov Powers Market Research with Globally Distributed MongoDB

Mat Keep

Business
Facebook ShareLinkedin ShareReddit ShareTwitter Share

Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects.

Local writes and global reads increases user engagement and transforms customer experience

YouGov are one of the world’s leading market research organizations, used by governments, corporations and causes around the globe to track public opinion on a range of issues. Recently I had the chance to meet with Jason Coombs, Executive Technical Director at YouGov, to learn more about his experiences using MongoDB over the past 5 years.

Can you start by telling us about YouGov?

YouGov was founded in 2000 with a belief that if people can participate in decisions made by the institutions that serve them, then better decisions will be the result.

At the heart of our company is a global online community, where millions of people and thousands of political, cultural and commercial organizations engage in ongoing conversations around opinions, behaviours and brands. We combine this continuous stream of data with our deep research expertise to develop the technologies and methodologies that enable more collaborative decision-making.

Our suite of data products includes:

  • BrandIndex – the daily brand perception tracker
  • YouGov Omnibus – the fastest, most cost effective way to obtain answers from both national and selected audience samples
  • Pulse – that tracks actual online consumer behaviour across web and mobile channels
  • YouGov Profiles – a new tool for media planning, segmentation and forecasting.

Please describe your application using MongoDB. What problem were you trying to solve? How does MongoDB help you solve that problem?

MongoDB is the database serving Gryphon, our flagship survey system. All our surveys are stored in MongoDB, with each document capturing a user’s responses and activity for a questionnaire within an interview session. This raw data then serves our data products. We ETL (extract, transform and load) the data into our custom-developed column store for analytics, and export relevant surveys to our clients.

We also use MongoDB for dozens of other applications within YouGov, but Gryphon is our core MongoDB use case.

What were you using before MongoDB? Did you consider other databases?

We moved to MongoDB in 2010 from Faststore, our own internally developed key-value database. We did consider a couple of NoSQL databases, and we also had a competing implementation in the company using Microsoft SQL Server, but that stack could not provide the speed of innovation or performance required by the Gryphon product.

I actually first came into contact with MongoDB at a Python conference in early 2010. At that time, I was looking for a database to handle my growing volumes of semi and unstructured data, and it seemed using an XML-based database was the way to go. That was until I saw MongoDB in action, and a light just went on in my head!

Today, MongoDB is one of the two default databases within YouGov. If relational databases are appropriate, Postgres is the default. For large file support, semi-structured data, or rapid development, MongoDB is the standard.

It’s not just on the development side of the house that MongoDB has made its mark; operationally it also delivers terrific value. Replica sets give us fault resilience and zero downtime upgrades, while sharding gives us multi-data center, cross-region scale out to serve a geographically distributed audience. MongoDB Cloud Manager gives us deep visibility into database performance.”

What were those early days back in 2010 with MongoDB like?

Well, it was certainly early days for you guys. It was all pre-journaling, pre-sharding, pre-replication, and so on. We did face a couple of early issues, but the MongoDB support team was fantastic.

While MongoDB has evolved considerably since then, as has our own expertise, we continue to rely on MongoDB support today, provided through MongoDB Enterprise Advanced. It is less about break/fix support, and more about proactive, consultative services and tools – such as planning upgrades or advice on schema design for new apps.

Please describe your MongoDB deployment.

We capture on average 2-4GB of new data every hour, though often at peak we ingest 3x that volume. MongoDB stores all survey data – both current and archive. So our database is continually growing.

YouGov has a single global cluster of five shards, with two in the US and two in EMEA, and another with replicas spanning both regions. Each shard has between two and five replica set members, depending on how much maintenance and activity is occurring in that shard. MongoDB is pretty flexible at enabling content to be allocated to the various shards, while providing transparent access through a single interface. Using the technique of deploying a mongos query router on each application host, the configuration for every app is dead simple – just refer to the target database on localhost.

With data distributed by MongoDB’s replica sets, we ensure specific data is globally available, but can also be served regionally at low latency. Those shards in the MongoDB cluster are configured in a “write-local, read-global” pattern. Users are directed to a primary replica set member in their region, so the database delivers very low latency writes as the user responds to each survey question. That data is then replicated globally for rapid retrieval anywhere in the world. Other shards are maintained regionally. In all of this, MongoDB presents an abstract, unified, global interface to the data.

*YouGov Active/Active Multi-Datacenter MongoDB Deployment*

Are you using MongoDB 3.0? What are your thoughts on this latest release?

In all the years I’ve worked with the database, MongoDB 3.0 is the best release ever! With the compression in the WiredTiger storage engine, we are seeing a 70% reduction in storage size.

Given that we’ve moved to SSDs for storage, this saving is fantastic. That feature alone has pushed us to upgrade faster than we would have otherwise.

Also, document level concurrency control is dear to our heart, as we have many databases in our cluster, and many services sharing those databases. More granular concurrency control helps improve performance, both in terms of throughput, and in maintaining predictable latency, even in the upper percentiles.

Please describe your technology stack

YouGov consists primarily of JSON-driven web apps and Python-backed web services. The cluster has been substantially upgraded to MongoDB 3.0. Applications are in Python, Node.js, Java, and .Net, with the majority deployed on Ubuntu Linux servers in Docker-like containers.

What do you use for managing your deployment?

We use puppet for host provisioning and New Relic for application monitoring.

Our sysadmin team uses MongoDB Cloud Manager to identify and diagnose any database-related issues. We started using Cloud Manager (then MMS) about 18 months ago. The interface had been overhauled, and the scope of monitoring had extended beyond individual nodes to provide a complete and consolidated view of our MongoDB estate. And that is when it became really useful to our Ops team.

How are you measuring the impact of MongoDB on your business?

We have five corporate values - the first two of which are: “We Love Technology” and “We Are Entrepreneurial”. These two sit at the core of everything we do. MongoDB fits into that perfectly. Like us, it’s about challenging assumptions and taking a different approach to solving problems.

Right on the forefront of innovation - that’s where we want to be and that’s where we feel we are with MongoDB.

MongoDB has many times shown its superiority in performance, which directly impacts customer experience and engagement. When we moved the application from our home-grown Faststore database to MongoDB, performance issues due to operating at scale were eliminated, and the implementation became dramatically simpler. This simplicity of implementation means our developers can focus more time on producing higher value applications, rather than learning or building abstraction layers.

Faster time to market is another benefit MongoDB has provided. There is no doubt that there are some applications we would not have built had it not been for MongoDB. It eliminates the impedance developers and ops teams face as new apps are built and rolled out. Its flexible data model with a dynamic schema is clearly a part of that. But so is the driver support – we use both Python and .NET, so being able to use a native, idiomatic driver is a huge plus for developer productivity. It’s fast to get started – just download a binary and away you go.

Replica sets are simple to configure, so in minutes you have an enterprise-grade deployment ready to roll!

The best bit is that ease of development scales. The lack of impedance we have with MongoDB lets us get straight to solving problems. If that’s spinning up an initial instance of MongoDB, or if that’s working with a deployment that’s half a decade old - the intuitive nature of MongoDB means we’re never held back by our software.

What advice would you give someone who is considering using MongoDB for their next project?

MongoDB can be incredibly forgiving. It performs remarkably well even with poor schema design, until it needs to perform at scale. For any application that you expect to run at scale, the software engineering and implementation will need to take into account the practical realities and best practices of the database. Attend MongoDB sessions or take MongoDB University training to understand the architecture and gain some best practices for indexing and consistency management. Get a foundational understanding of MongoDB to avoid common pitfalls, and MongoDB will repay tenfold with high performance and low maintenance.

Jason, thank you for taking the time to speak with me!


To learn more about building applications with a global reach, read our guide to multi-data center deployments with MongoDB.
MongoDB multi-data center deployments

About the Author - Mat

Mat is a director within the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.