Genomics England uses MongoDB to Power the Data Science Behind the 100,000 Genomes Project
MongoDB Enterprise Advanced Scales to Handle Hundreds of Terabytes of Complex Genetic Data to Power Life Saving and Groundbreaking Discoveries
London, UK – April 10, 2018 – Genomics England is using MongoDB (Nasdaq: MDB), the leading modern, general purpose data platform, to power the data science that makes the 100,000 Genomes Project possible and reduce the processing time for complex queries from hours to milliseconds, enabling scientists to discover new insights, faster.
Genomics England, a company owned by the UK government’s Department of Health and Social Care, runs the 100,000 Genomes Project. This flagship project is sequencing 100,000 whole genomes from patients with rare diseases and their families as well as patients with common cancers. Working with the UK’s National Health Service, the aim is to harness the power of whole genome sequencing technology to transform the way people are cared for. Patients may be offered a diagnosis where there wasn’t one before. In time, there is the potential of new and more effective personalized treatments for patients.
Patients’ genomic data is combined with their clinical data to enable interpretation and analysis. Approved researchers, clinicians and industry partners can also access the Project’s dataset – studying how best to use genomics in healthcare and how best to interpret the data to help patients.
The project is sequencing on average 1,000 genomes per week, producing about 10 terabytes of data per day. To manage this complex, massive and sensitive data set, Genomics England uses MongoDB Enterprise Advanced as an important component of its computing platform.
“Managing clinical and genomic data at this scale and complexity has presented interesting challenges,” said Augusto Rendon, Director of Bioinformatics at Genomics England.
“That’s why adopting MongoDB has been vital to getting the 100,000 Genomes Project off the ground. It has provided us with great flexibility to store and analyze these complex data sets together. This will ultimately help us to realize the benefits of the Project - delivering better diagnostic approaches for patients and new discoveries for the research community."
Two of the most important projects are Cellbase and OpenCGA (Computational Genomics Analysis). Cellbase is a data warehouse and open API that stores reference genomic data from public resources such as Ensembl, Clinvar, and Uniprot. The genomic information itself includes the specific annotations other researchers have made on cell mutations. The code for Cellbase is available on Github. By relying on MongoDB, Cellbase can typically run sophisticated queries in an average of 40 milliseconds or less, and complex aggregations in less than 1 second. This is down from six hours using previous filesystem-based querying and storage. Importantly, it can annotate about 20,000 variants per second, making it compatible with whole genome sequencing data throughput requirements, while also returning a rich set of annotations that helps scientists better understand the data.
OpenCGA aims to provide researchers and clinicians with a high-performance solution for genomic big data processing and analysis. The OpenCGA platform also includes much more detailed information on genomic material including private metadata and more specific background information on specific sequenced genomes. This means OpenCGA has the ability to process incredibly complex queries based on a huge variety of variables, from a single mutation right through to analyzing common mutations within a specific geographic region.
By using MongoDB, OpenCGA enables researchers to query data in any way they want. This is done by use of MongoDB’s secondary indexes – from compound indexes to query data across related attributes, text search facets to efficiently navigate and explore data sets, and sparse indexes to access highly variable data structures. Each collection can have 20 more secondary indexes to service multiple query patterns, including complex, ad-hoc queries.
Dev Ittycheria, President and CEO, MongoDB, concluded, “The 100,000 Genomes Project hits home for me in a very personal way as I recently lost my mother to cancer. I am extremely grateful that so many brilliant people are dedicating their time and energy to this important project. We are honored that MongoDB is playing an essential role as the underlying data platform to produce data science that is likely to change the lives of millions of people, including someone we may personally know, for the better. This is the kind of project that inspires us to do our best work every day.”
About Genomics England
Genomics England is a company owned by the Department of Health and Social Care, and was set up to deliver the 100,000 Genomes Project. This flagship project will sequence 100,000 whole genomes from NHS patients and their families. Genomics England has four main aims:
- to bring benefit to patients
- to create an ethical and transparent programme based on consent
- to enable new scientific discovery and medical insights
- to kickstart the development of a UK genomics industry
The project is focusing on patients with rare diseases, and their families, as well as patients with common cancers. For more information visit www.genomicsengland.co.uk.
MongoDB is the leading modern, general purpose database platform, designed to unleash the power of software and data for developers and the applications they build. Headquartered in New York, MongoDB has more than 5,700 customers in over 90 countries. The MongoDB database platform has been downloaded over 35 million times and there have been more than 850,000 MongoDB University registrations.