The Guide to openEHR Schema Modeling with MongoDB

Rajesh Rajagopalan

Disclosure: This article features valuable insights from MongoDB's experts, customers, or partners to further an understanding of MongoDB in specific industry use cases. Please note that while MongoDB does not validate the accuracy of the text and statements, this resource aims to provide practical knowledge for your reference.

The openEHR specification is a widely used standard for storing and managing electronic health records (EHRs). It offers a structured way of organizing clinical data that makes it easy to query and analyze. However, implementing the schema modeling and querying of openEHR data can present unique challenges. In this blog post, we will explore the intricacies of openEHR schema modeling, some challenges, and outline potential solutions that we can implement using MongoDB.

Understanding openEHR specification and interoperability standards

Before diving into the complexities of openEHR schema modeling and querying, let’s first understand the openEHR specification and how it relates to other interoperability standards such as HL7 FHIR. These standards work together to enable seamless healthcare information exchange across various systems and applications. A clear understanding of these standards will provide a solid foundation for addressing the challenges we’ll encounter.

openEHR (open Electronic Health Record) is an open-source standard for the representation and management of electronic health records (EHRs). It is designed to provide a flexible and interoperable framework for the collection, storage, retrieval, and exchange of health data, regardless of the system or application used to generate or consume it.

Key building blocks of openEHR include:

  • Archetypes: Archetypes are structured, reusable models that define the content and structure of clinical information. An example is a Vital Signs Archetype which defines the structure and constraints for capturing vital signs measurements. Properties may include elements such as temperature, heart rate, blood pressure, and oxygen saturation.

  • Templates: Templates, on the other hand, are derived from archetypes and are used to define specific subsets of clinical information. An example is an Adult Vital Signs Template — a template derived from the Vital Signs Archetype and customized for adult patients. It includes a subset of vital sign elements specifically relevant to adult patients, such as temperature, heart rate, and blood pressure.

  • Compositions: A composition is an instance of a template that contains actual clinical data. An example is Patient Encounter Composition — a composition that represents a patient encounter, containing various clinical measurements and data. Properties include sections for patient demographics, symptoms, diagnoses, procedures, medications, and vital signs. Within the composition, the vital signs section would follow the structure defined by the Adult Vital Signs Template, containing actual vital signs measurements for a specific encounter.

Figure 1: openEHR - Archetype & template

openEHR & HL7 FHIR

While openEHR defines an “information model” for modeling and persisting data in EMR systems, FHIR is an open-source standard for healthcare information exchange across EMRs and other systems. The diagram below can help you visualize how these standards and technologies work together in healthcare systems.

FIgure 2: openEHR, FHIR, & HL7

openEHR schema modeling challenges

openEHR schema modeling poses several challenges due to its complex hierarchical structure and the need to handle diverse data types. Additionally, querying this data can be difficult due to the complex relationships between the different data elements. Many of the end user queries are at composition level and each composition is made up of hundreds of fields. This is typically overcome by creating multiple sets of indexes often resulting in performance bottlenecks.

Let’s examine some of the typical query patterns and demonstrate how these pose challenges for schema modeling and querying:

  • Complex Hierarchical Structure: The openEHR specification is built on a complex hierarchical structure that represents various healthcare concepts and their relationships. Translating this structure into a database schema can be challenging, as traditional relational databases may struggle to handle the dynamic and nested nature of openEHR data. Finding an efficient and flexible way to model this complex structure is essential for ensuring accurate representation and easy retrieval of data.

  • Data Versioning and Evolution: openEHR supports the concept of versioning and evolution, allowing for changes and updates to healthcare records over time. Modeling and querying evolving data can be complex, as it requires maintaining the history of changes and accommodating different versions of the schema. Ensuring data consistency, efficient versioning, and the ability to query historical data are critical considerations in openEHR schema modeling.

  • Performance and Scalability: Healthcare systems generate a vast amount of data, and efficient querying of openEHR records is crucial for timely analysis and decision-making. Designing a schema that allows for fast and scalable querying is a challenge, particularly when dealing with large datasets and complex query patterns. Optimizing query performance, indexing strategies, and data partitioning techniques are essential for ensuring a responsive and scalable system.

  • Query Patterns: openEHR data is queried based on various patterns, such as retrieving patient records, searching for specific diagnoses, or aggregating data for statistical analysis. Each query pattern may have different performance requirements and may involve traversing complex relationships within the hierarchical structure. Designing an efficient schema that can handle these query patterns and provide fast and accurate results is a significant challenge in openEHR schema modeling.

Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability

Typical openEHR schema model

As you see in the below hierarchical model, the complexity of the openEHR specification can make it challenging to model and query clinical data. In the subsequent sections, we will explore potential solutions for openEHR schema modeling, including the Attribute Pattern and the Flat Hierarchy Pattern.

Figure 3: Part of Archetype data captured in JSON format

Archetype query language

Archetype Query Language (AQL) is a query language specifically designed for querying clinical data stored in openEHR-based electronic health record systems. It provides a standardized and powerful way to retrieve specific clinical information from structured data using archetypes and templates. AQL enables clinicians, researchers, and developers to express complex queries, filter data based on clinical criteria, and retrieve meaningful information for analysis and decision support. For example to “get the latest 5 abnormal blood pressure values that were recorded in a health encounter for a specific patient”, you can write an AQL as follows:

As you can see above in the highly nested and hierarchical schema model, while being a flexible and extensible approach to representing clinical data, presents unique challenges in storage, retrieval, and performance.

Even a small volume of 250k documents in this format requires a significant amount of storage space for handling. In addition, given the variable nature of nesting hierarchy, typical patterns for indexing are highly inefficient.

We will explore possible solution options using MongoDB schema modeling patterns.

Addressing openEHR schema modeling challenges with MongoDB

MongoDB provides flexible and powerful features that can help address the challenges of openEHR schema modeling and querying.

Attribute pattern with standard index

One possible solution is to use the attribute pattern modeling style, simplifying the schema model and making it more predictable.

This approach allows for efficient indexing, enabling fast retrieval of data.

The query for our example scenario to “get the latest 5 abnormal blood pressure values that were recorded in a health encounter for a specific patient” can be simplified as follows.

The above query is highly efficient as evaluated against a collection of 5 MM document size.

Additionally, the storage is significantly improved taking 833 MB compressed storage space for 5 MM documents, as well as efficient retrieval using index.

However the index size is significantly large for this pattern — about 3.5 GB for the 5 MM document size – which may be a blocker for larger data sets. Can we improve on this? What are our options?

Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability

Flat hierarchy model with wildcard search

An alternative approach to addressing challenges discussed above is by leveraging a flattened document model and using wildcard index. The same openEHR schema model can be represented as follows: By using a flattened document data model, the complex hierarchical structure of openEHR schemas can be simplified for storage and retrieval efficiency.

While this simplifies the schema model, there are a few considerations like the dimensions need to be moved to the application / configuration layer, like the body temperature is celsius, for example.

In addition the query on the above document model is simplified as below.

To make the query efficient, we create a wild-card index as follows.

With the above index in place, the query is significantly more efficient as we can see in the explain plan below.

When comparing the storage size to openEHR standard spec, the flat hierarchy spec model only takes 839 MB compressed storage for 5 MM documents.

However, the wildcard index is significantly heavy. For the above volume of documents, the index size is only 705 MB.

Can we improve upon this further?

Flat hierarchy model with Atlas Search

MongoDB Atlas Search brings the power of Apache Lucene to MQL. You can simplify the process of indexing and search significantly.

You start by creating an Atlas Search index as follows:

Atlas takes care of the various steps leading up to making a Lucene search index available. Once the index is in place, you can query using $search opearator. The query we have seen in above solution options can be constructed as follows.

The data storage is identical to the flat hierarchy schema, with only the index built using Atlas search. The search index size is similar to the wildcard index. However the search index provides significantly greater functionalities and capabilities.

Keep in mind that this capability is only available on cloud and with MongoDB Atlas.

We have seen multiple solution options as detailed above. You can choose one that suits your application as well as infrastructure requirements.

Design Validation

Typical process of evaluating various possible schema models and their efficiencies will require multiple iterations including NFR validation on production scale of data volume. This helps us to:

  • Better understand fit of MongoDB to the particular use case

  • Understand MongoDB sizing expectations

  • Demonstrate MongoDB performance

  • Understand performance of specific queries

  • Fine tune MongoDB schema for our needs

Current tools only replicate a sample data or generate random data and are not suitable for the above purposes as the indexes get skewed and performance of queries is not close to real world. Completely random data also does not provide an accurate view of the MongoDB sizing that will be required.

Test data generator

Test data generator and Performance testing tools are a solution accelerator from PeerIslands that helps customers generate large volume, customizable, close-to-real-world test data with specific customer schemas.

We have used the test data generator to generate a 5 Million documents data set for both attribute and flat hierarchy design patterns.

The test data generator takes a configuration file and quickly generates large volumes of data, as shown in the flat hierarchy schema model below.

Generating test data for the original openEHR spec requires a more complex configuration as the one below.

Results and comparison

We generated 5MM data items for each type of design pattern (attribute and flat hierarchy). For base openEHR spec, we generated 250k items given the size and time requirements.

The figure below provides a comparison between the data and index sizes for each design approach:

In addition the data and index sizes for Atlas Search is shown below for a 214k dataset.

The attribute pattern performs significantly better than the base openEHR spec. However it is limited by its index size.

The flat hierarchy data model performs the best overall with a wildcard index.

When you can use cloud, you have the option of using Atlas Search. While the overall index size is not any better than the wildcard index on the flat hierarchy collection, the Lucene search index provides a significantly expanded feature set for query and retrieval.

Final thoughts

The MongoDB document data model provides a powerful and intuitive approach to structuring and interacting with healthcare data (read openEHR), which is often complex and variable. It closely aligns with how you think and code, allowing you to store and retrieve data of any shape and form. The powerful query engine and indexing capabilities further enhance its versatility, enabling you to develop complex query patterns and optimize performance for your specific application requirements.

Choosing the appropriate modeling approach depends on the specific requirements of the application, query patterns, and performance considerations.

Both the Attribute Pattern and the Flat Hierarchy models offer viable solutions for openEHR data storage in MongoDB, providing flexibility and performance optimizations tailored to different use cases. Additionally, MongoDB Atlas Search introduces powerful search capabilities for enhanced query and retrieval functionalities.

While we looked into schema modeling and querying solutions for openEHR data in MongoDB, there are other topics that are of interest while developing production-scale applications and environments where you would like both engineers and operations teams to be more productive. We will be exploring the following topics in a future blog.

  • Simplifying openEHR queries: A DSL based approach to convert existing AQL to MQL and using Generative AI to build MQL based on natural language prompts.

  • Strategies for organizing openEHR data such as Multi-tenancy by clinic and horizontal scaling using sharding.

  • Generate production-scale data volume persisted in a Sharded MongoDB cluster.

  • Run NFR validation on dedicated production grade infrastructure and compare performance of various approaches discussed.

We hope this comprehensive guide has provided valuable insights into openEHR schema modeling and querying challenges, as well as potential solutions using MongoDB.

Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability