The European Organisation for Nuclear Research, known as CERN, plays a leading role in fundamental studies of physics. It has been instrumental in many key global innovations and breakthroughs and today operates the world's largest particle physics laboratory, home to the Large Hadron Collider (LHC) nestled under the mountains between Switzerland and France. The LHC and its various experiments have received media attention following the discovery of a new particle strongly suspected to be the elusive Higgs Boson.
The Compact Muon Solenoid (CMS) is one of the two general purpose particle physics detectors operated at the LHC. It is designed to explore the frontiers of physics and provide physicists with the ability to look at the conditions presented in the early stage of our Universe. More then 3,000 physicists from 183 institutions representing 38 countries are involved in the design, construction and maintenance of the experiments.
Experiment of this magnitude requires a vast and complex distributed computing and data model. CMS spans more than a hundred data centres in a three-tier model and generates around 10 petabytes (PB) of data each year in real data, simulated data and meta-data.
This information is stored and retrieved from relational and non-relational data-sources, such as relational databases, document databases, blogs, wikis, ﬁle systems and customised applications.
At this scale, the information discovery within a heterogeneous, distributed environment becomes an important ingredient of successful data analysis. The data and associated meta-data are produced in variety of forms and digital formats. However, users want to be able to query different services and combine information from these varied sources. However, this vast and complex collection of data means they don’t necessarily know where to find the right information or have the domain knowledge to extract this data.
To overcome this hurdle, CMS's Data Management and Workflow Management (DMWM) created the Data Aggregation System (DAS), built on MongoDB to provide the ability to search and aggregate information across this complex data landscape.
Data and meta-data for CMS comes from many different sources and is distributed in a variety of digital formats. It is organised and managed by constantly evolving software using both relational and non-relational data sources.
The DAS provides a layer on top of the existing data-sources that allows researchers and other staff to query data via free text-based queries, and then aggregates the results from across distributed providers - while preserving their integrity, security policy and data-formats. The DAS then represents that data in defined format.
"The choice of an existing relational database was ruled out for several reasons – namely we didn’t require any transactions and data persistency in DAS, and as such can't have a pre-defined schema. Also the dynamic typing of stored meta-data objects was one of the requirements. Amongst other reasons, those arguments forced us to look for alternative IT solutions," explained Valentin Kuznetsov, a research associate from Cornell University who works at CMS.
"We considered a number of different options, including file-based and in memory caches, as well as key-value databases, but ultimately decided that a document database would best suits our needs. After evaluating several applications we chose MongoDB, due to its support of dynamic queries, full indexes, including inner objects and embedded arrays, as well as auto-sharding."
Free Form Queries
All DAS queries can be expressed in a free text-based form, either as a set of keywords or key-value pairs, where a pair can represent a condition. Users can query the system using a simple, SQL-like language, which is then transformed into the MongoDB query syntax, which is itself a JSON record.
"Due to the schema-less nature of the underlying MongoDB back-end we are able to store DAS records of any arbitrary structure, regardless of whether it's a dictionary, lists, key-value pairs etc. Therefore every DAS key has a set of attributes describing its JSON structure," added Kuznetsov.
Given the number of different data sources, types and providers that DAS connects to, it is imperative that the system itself is data agnostic and allows us to query and aggregate the meta-data information in customisable way.
The MongoDB architecture easily integrates with existing data-services while preserving their access, security policy and development cycles. This also provides a simple plug-and-play mechanism that makes it easy to add new data-services as they are implemented and configure DAS to connect to specific domains.
Caching for Data Providers
As well as providing a way for users to easily access a wide range of data sources in a simple and consistent manner, DAS uses MongoDB as a dynamic cache collating the information fed back from the data providers - which feed back in a variety of formats and file structures.
"When a user enters a query, it checks if the MongoDB database has the aggregation the user is asking for and, if it does, returns it, otherwise the system does the aggregation and saves it to MongoDB," said Kuznetsov.
"If the cache does not contain the requested query, the system contacts distributed data-providers that could have this information and queries them, gathering their results. It then merges all of the results, doing a sort of 'group by' operation based on predefined identifying key and inserts the aggregated information into the cache."
- The CMS DAS currently runs on a single eight-core server which processes all of the queries and caches the aggregated data.
- OS: Scientific Linux
- Server hardware configuration: 8-core CPU, 40GB RAM, 1TB storage (but data set usually around 50-100GB)
- Sharding: not yet, but being considered
- Application Language: Python
- Other database technologies: Aggregates data from a number of different databases including Oracle, PostGreSQL, CouchDB and MySQL
"DAS is used 24 hours a day, seven days a week, by CMS physicists, data operators and data managers at research facilities around the world. The average query may resolve into thousands of documents, each a few kilobytes in size. The performance of MongoDB has been outstanding, with a throughput of around 6,000 documents a second for raw cache population," concluded Kuznetsov.
"The ability to offer a free text query system that is fast and scalable, with a highly dynamic and scalable cache that is data agnostic provides an invaluable two-way translation mechanism. DAS helps CMS users to easily find and discovery information they need in their research and it represents one of the many tools which physicists use on a daily basis towards great discoveries. Without help from DAS information look-up will have taken orders of magnitude longer."
As the data collected by the various experiments grows, CMS is looking into horizontally scaling the system with sharding to meet demand. Similarly the team are spreading the word beyond CMS and out to other parts of CERN.
Furthermore, as this is a general grid aggregation/querying tool, there is also interest in applying it to problems outside of CERN. In his role at Cornell University Kuznetsov is working with post-graduates to develop the idea on similar projects for aggregating data from across distributed and varied data systems.
Industry: Science and Research
Location: New York
- Enables researchers to enter free form queries to scour terabytes of meta-data from a wide variety of sources and providers and aggregate that information in a consistent, data agnostic form
- Processes queries from more than 6,000 users across the globe, 24/7 and returns tens of thousands of documents every second
- Caches results with predefined expiry times to further speed up related searches and help plan pre-fetch strategies for common queries