Elsevier runs MongoDB for high performance, Big Data scalability

Elsevier is a leading provider of science and health information, serving over 30 million scientists, students and health and information professionals worldwide. Headquartered in Amsterdam, Elsevier employs more than 7,000 people in 24 countries.

Unlike more traditional publishers, Elsevier distinguishes itself combining content and data with analytics and technology in global platforms, sharing institutional skills, assets and resources across the company.

image


At the heart of Elsevier’s information strategy is its Smart Content initiative, which enables Elsevier products to overlay content with meaning, making it easier to discover the most relevant content, plus find related content that would be missed by other search engines. In other words, Elsevier is not simply presenting information, but aims to help its users more readily understand this information.

It’s an ambitious strategy that required investment in a new infrastructure. A key component of the new Smart Content infrastructure is the Linked Data Repository (LDR) where Elsevier turned to the MongoDB document database for the flexibility and power of its data model, as well as its ability to scale to meet the company’s demanding requirements.

The Problem

Elsevier’s Smart Content strategy began as an initiative to add more meaning to its content. Whereas the company had put the burden on its readers to read an article in order to discover what it was about, Elsevier wanted to expose the meaning of its content to its readers before they read the articles. Elsevier sought to do this through applying taxonomies and content enrichment technology and then exposing the resulting metadata in innovative ways to different applications and other systems.

At the core of Elsevier’s Smart Content metadata strategy is the Linked Data Repository, which is essentially a knowledge base of metadata about its content. LDR is a key piece of internal infrastructure that supports large scale data integration and construction of smart applications (apps that enable discovery and semantic search). The LDR:

  • Is used to store and structure metadata derived from content
  • Interlinks knowledge data with other related information sources like documents, images, videos, and data from external sources
  • Has been optimized for reliable high-volume read and write access
  • Provides service layer APIs for ease of integration with both suppliers and end-user products

The LDR, an extension to the Smart Content infrastructure, is built on MongoDB.

Why MongoDB?

As Elsevier annotates its content and extracts knowledge therefrom, the information giant needed a scalable way to store that annotated information and provide a mechanism for enriching its information assets further by integrating with third-party assets and data sources. While Elsevier evaluated a range of SQL and NoSQL database technologies to underpin its LDR, MongoDB, a fast and flexible document database, proved to be a convenient fit.

For one thing, early on Elsevier didn’t know exactly what the system should look like, or how best to link data for innovative search and discovery services. It needed to explore possibilities without a huge upfront financial commitment. If the project were to fail, Elsevier wanted it to fail quickly and as inexpensively as possible.

The company therefore heavily relied on cloud technology and open-source technology like MongoDB, as shown in this diagram:

image

It quickly became apparent, however, that the benefits of MongoDB extend well beyond its financial flexibility.

When Elsevier tags its content, a fair amount of semantic metadata is generated, which in turn is packaged as documents, called “satellites”. In essence, these satellites say, “Here are all the concepts that are contained in this content.” Searches can then be as easy as “Find me a journal that contains information on diabetes,” with MongoDB’s fast read-write performance making it possible to both store and retrieve the data quickly.

Elsevier was initially concerned that MongoDB might not be able to handle the huge amounts of data generated in the LDR. Just for one business unit, the company pulls metadata for over 20 million documents, or satellites, each month. This involves loading the documents, storing them, indexing them, and then enabling large volumes of searches against the database. With some assistance from 10gen in architecting its system, MongoDB has met or exceeded Elsevier’s performance requirements.

The Future of MongoDB at Elsevier

Because of this success, not only is Elsevier using MongoDB as a document store, but the company is using the high-performance access to the data stored in MongoDB to build additional advanced semantic search and discovery capabilities with SOLR and an RDF Triple Store. The reliability and performance of MongoDB provide opportunities to consider other future uses of the technology.

By using MongoDB as the database for its LDR, Elsevier is able to provide a platform that makes vast amounts of content easily discoverable and enables them to gain new meaningful insight from the knowledge data embedded within its information assets.

Tagged with: MongoDB, Elsevier, smart content, high performance, Big Data, scalability, TCO

comments powered by Disqus