The Guide to openEHR Schema Modeling with MongoDB
Disclosure: This article features valuable insights from MongoDB's experts, customers, or partners to further an understanding of MongoDB in specific industry use cases. Please note that while MongoDB does not validate the accuracy of the text and statements, this resource aims to provide practical knowledge for your reference. The openEHR specification is a widely used standard for storing and managing electronic health records (EHRs). It offers a structured way of organizing clinical data that makes it easy to query and analyze. However, implementing the schema modeling and querying of openEHR data can present unique challenges. In this blog post, we will explore the intricacies of openEHR schema modeling, some challenges, and outline potential solutions that we can implement using MongoDB. Understanding openEHR specification and interoperability standards Before diving into the complexities of openEHR schema modeling and querying, let’s first understand the openEHR specification and how it relates to other interoperability standards such as HL7 FHIR. These standards work together to enable seamless healthcare information exchange across various systems and applications. A clear understanding of these standards will provide a solid foundation for addressing the challenges we’ll encounter. openEHR (open Electronic Health Record) is an open-source standard for the representation and management of electronic health records (EHRs). It is designed to provide a flexible and interoperable framework for the collection, storage, retrieval, and exchange of health data, regardless of the system or application used to generate or consume it. Key building blocks of openEHR include: Archetypes: Archetypes are structured, reusable models that define the content and structure of clinical information. An example is a Vital Signs Archetype which defines the structure and constraints for capturing vital signs measurements. Properties may include elements such as temperature, heart rate, blood pressure, and oxygen saturation. Templates: Templates, on the other hand, are derived from archetypes and are used to define specific subsets of clinical information. An example is an Adult Vital Signs Template — a template derived from the Vital Signs Archetype and customized for adult patients. It includes a subset of vital sign elements specifically relevant to adult patients, such as temperature, heart rate, and blood pressure. Compositions: A composition is an instance of a template that contains actual clinical data. An example is Patient Encounter Composition — a composition that represents a patient encounter, containing various clinical measurements and data. Properties include sections for patient demographics, symptoms, diagnoses, procedures, medications, and vital signs. Within the composition, the vital signs section would follow the structure defined by the Adult Vital Signs Template, containing actual vital signs measurements for a specific encounter. Figure 1: openEHR - Archetype & template openEHR & HL7 FHIR While openEHR defines an “information model” for modeling and persisting data in EMR systems, FHIR is an open-source standard for healthcare information exchange across EMRs and other systems. The diagram below can help you visualize how these standards and technologies work together in healthcare systems. FIgure 2: openEHR, FHIR, & HL7 openEHR schema modeling challenges openEHR schema modeling poses several challenges due to its complex hierarchical structure and the need to handle diverse data types. Additionally, querying this data can be difficult due to the complex relationships between the different data elements. Many of the end user queries are at composition level and each composition is made up of hundreds of fields. This is typically overcome by creating multiple sets of indexes often resulting in performance bottlenecks. Let’s examine some of the typical query patterns and demonstrate how these pose challenges for schema modeling and querying: Complex Hierarchical Structure: The openEHR specification is built on a complex hierarchical structure that represents various healthcare concepts and their relationships. Translating this structure into a database schema can be challenging, as traditional relational databases may struggle to handle the dynamic and nested nature of openEHR data. Finding an efficient and flexible way to model this complex structure is essential for ensuring accurate representation and easy retrieval of data. Data Versioning and Evolution: openEHR supports the concept of versioning and evolution, allowing for changes and updates to healthcare records over time. Modeling and querying evolving data can be complex, as it requires maintaining the history of changes and accommodating different versions of the schema. Ensuring data consistency, efficient versioning, and the ability to query historical data are critical considerations in openEHR schema modeling. Performance and Scalability: Healthcare systems generate a vast amount of data, and efficient querying of openEHR records is crucial for timely analysis and decision-making. Designing a schema that allows for fast and scalable querying is a challenge, particularly when dealing with large datasets and complex query patterns. Optimizing query performance, indexing strategies, and data partitioning techniques are essential for ensuring a responsive and scalable system. Query Patterns: openEHR data is queried based on various patterns, such as retrieving patient records, searching for specific diagnoses, or aggregating data for statistical analysis. Each query pattern may have different performance requirements and may involve traversing complex relationships within the hierarchical structure. Designing an efficient schema that can handle these query patterns and provide fast and accurate results is a significant challenge in openEHR schema modeling. Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability Typical openEHR schema model As you see in the below hierarchical model, the complexity of the openEHR specification can make it challenging to model and query clinical data. In the subsequent sections, we will explore potential solutions for openEHR schema modeling, including the Attribute Pattern and the Flat Hierarchy Pattern. Figure 3: Part of Archetype data captured in JSON format Archetype query language Archetype Query Language (AQL) is a query language specifically designed for querying clinical data stored in openEHR-based electronic health record systems. It provides a standardized and powerful way to retrieve specific clinical information from structured data using archetypes and templates. AQL enables clinicians, researchers, and developers to express complex queries, filter data based on clinical criteria, and retrieve meaningful information for analysis and decision support. For example to “get the latest 5 abnormal blood pressure values that were recorded in a health encounter for a specific patient”, you can write an AQL as follows: As you can see above in the highly nested and hierarchical schema model, while being a flexible and extensible approach to representing clinical data, presents unique challenges in storage, retrieval, and performance. Even a small volume of 250k documents in this format requires a significant amount of storage space for handling. In addition, given the variable nature of nesting hierarchy, typical patterns for indexing are highly inefficient. We will explore possible solution options using MongoDB schema modeling patterns. Addressing openEHR schema modeling challenges with MongoDB MongoDB provides flexible and powerful features that can help address the challenges of openEHR schema modeling and querying. Attribute pattern with standard index One possible solution is to use the attribute pattern modeling style, simplifying the schema model and making it more predictable. This approach allows for efficient indexing, enabling fast retrieval of data. The query for our example scenario to “get the latest 5 abnormal blood pressure values that were recorded in a health encounter for a specific patient” can be simplified as follows. The above query is highly efficient as evaluated against a collection of 5 MM document size. Additionally, the storage is significantly improved taking 833 MB compressed storage space for 5 MM documents, as well as efficient retrieval using index. However the index size is significantly large for this pattern — about 3.5 GB for the 5 MM document size – which may be a blocker for larger data sets. Can we improve on this? What are our options? Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability Flat hierarchy model with wildcard search An alternative approach to addressing challenges discussed above is by leveraging a flattened document model and using wildcard index. The same openEHR schema model can be represented as follows: By using a flattened document data model, the complex hierarchical structure of openEHR schemas can be simplified for storage and retrieval efficiency. While this simplifies the schema model, there are a few considerations like the dimensions need to be moved to the application / configuration layer, like the body temperature is celsius, for example. In addition the query on the above document model is simplified as below. To make the query efficient, we create a wild-card index as follows. With the above index in place, the query is significantly more efficient as we can see in the explain plan below. When comparing the storage size to openEHR standard spec, the flat hierarchy spec model only takes 839 MB compressed storage for 5 MM documents. However, the wildcard index is significantly heavy. For the above volume of documents, the index size is only 705 MB. Can we improve upon this further? Flat hierarchy model with Atlas Search MongoDB Atlas Search brings the power of Apache Lucene to MQL. You can simplify the process of indexing and search significantly. You start by creating an Atlas Search index as follows: Atlas takes care of the various steps leading up to making a Lucene search index available. Once the index is in place, you can query using $search opearator. The query we have seen in above solution options can be constructed as follows. The data storage is identical to the flat hierarchy schema, with only the index built using Atlas search. The search index size is similar to the wildcard index. However the search index provides significantly greater functionalities and capabilities. Keep in mind that this capability is only available on cloud and with MongoDB Atlas . We have seen multiple solution options as detailed above. You can choose one that suits your application as well as infrastructure requirements. Design Validation Typical process of evaluating various possible schema models and their efficiencies will require multiple iterations including NFR validation on production scale of data volume. This helps us to: Better understand fit of MongoDB to the particular use case Understand MongoDB sizing expectations Demonstrate MongoDB performance Understand performance of specific queries Fine tune MongoDB schema for our needs Current tools only replicate a sample data or generate random data and are not suitable for the above purposes as the indexes get skewed and performance of queries is not close to real world. Completely random data also does not provide an accurate view of the MongoDB sizing that will be required. Test data generator Test data generator and Performance testing tools are a solution accelerator from PeerIslands that helps customers generate large volume, customizable, close-to-real-world test data with specific customer schemas. We have used the test data generator to generate a 5 Million documents data set for both attribute and flat hierarchy design patterns. The test data generator takes a configuration file and quickly generates large volumes of data, as shown in the flat hierarchy schema model below. Generating test data for the original openEHR spec requires a more complex configuration as the one below. Results and comparison We generated 5MM data items for each type of design pattern (attribute and flat hierarchy). For base openEHR spec, we generated 250k items given the size and time requirements. The figure below provides a comparison between the data and index sizes for each design approach: In addition the data and index sizes for Atlas Search is shown below for a 214k dataset. The attribute pattern performs significantly better than the base openEHR spec. However it is limited by its index size. The flat hierarchy data model performs the best overall with a wildcard index. When you can use cloud, you have the option of using Atlas Search. While the overall index size is not any better than the wildcard index on the flat hierarchy collection, the Lucene search index provides a significantly expanded feature set for query and retrieval. Final thoughts The MongoDB document data model provides a powerful and intuitive approach to structuring and interacting with healthcare data (read openEHR), which is often complex and variable. It closely aligns with how you think and code, allowing you to store and retrieve data of any shape and form. The powerful query engine and indexing capabilities further enhance its versatility, enabling you to develop complex query patterns and optimize performance for your specific application requirements. Choosing the appropriate modeling approach depends on the specific requirements of the application, query patterns, and performance considerations. Both the Attribute Pattern and the Flat Hierarchy models offer viable solutions for openEHR data storage in MongoDB, providing flexibility and performance optimizations tailored to different use cases. Additionally, MongoDB Atlas Search introduces powerful search capabilities for enhanced query and retrieval functionalities. While we looked into schema modeling and querying solutions for openEHR data in MongoDB, there are other topics that are of interest while developing production-scale applications and environments where you would like both engineers and operations teams to be more productive. We will be exploring the following topics in a future blog. Simplifying openEHR queries: A DSL based approach to convert existing AQL to MQL and using Generative AI to build MQL based on natural language prompts. Strategies for organizing openEHR data such as Multi-tenancy by clinic and horizontal scaling using sharding. Generate production-scale data volume persisted in a Sharded MongoDB cluster. Run NFR validation on dedicated production grade infrastructure and compare performance of various approaches discussed. We hope this comprehensive guide has provided valuable insights into openEHR schema modeling and querying challenges, as well as potential solutions using MongoDB. Learn more about how MongoDB works with any healthcare data standard in our whitepaper, What is Radical Interoperability References: Introduction to openEHR HL7 FHIR & openEHR Choosing the standard that is right for you MongoDB Wildcard Index
1Data - PeerIslands Data Sync Accelerator
Today’s enterprises are in the midst of digital transformation, but they’re hampered by monolithic, on-prem legacy applications that don’t have the speed, agility, and responsiveness required for digital applications. To make the transition, enterprises are migrating to the cloud. MongoDB has partnered with PeerIslands to develop 1Data, a reference architecture and solution accelerator that helps users with their cloud modernization. This post details the challenges enterprises face with legacy systems and walks through how working with 1Data helps organizations expedite cloud adoption. Modernization Trends As legacy systems become unwieldy, enterprises are breaking them down into microservices and adopting cloud native application development. Monolith-to-microservices migration is complex, but provides value across multiple dimensions. These include: Development velocity Scalability Cost-of-change reduction Ability to build multiple microservice databases concurrently One common approach for teams adopting and building out microservices is to use domain driven design to break down the overall business domain into bounded contexts first. They also often use the Strangler Fig pattern to reduce the overall risk, migrate incrementally, and then decommission the monolith once all required functionality is migrated. While most teams find this approach works well for the application code, it’s particularly challenging to break down monolithic databases into databases that meet the specific needs of each microservice. There are several factors to consider during transition: Duration. How long will the transition to microservices take? Data synchronization. How much and what types of data need to be synchronized between monolith and microservice databases? Data translation in a heterogeneous schema environment. How are the same data elements processed and stored differently? Synchronization cadence. How much data needs syncing, and how often (real-time, nightly, etc.)? Data anti-corruption layer. How do you ensure the integrity of transaction data, and prevent the new data from corrupting the old? Simplifying Migration to the Cloud Created by PeerIslands and MongoDB, 1Data helps enterprises address the challenges detailed above. Migrate and synchronize your data with confidence with 1Data Schema migration tool. Convert legacy DB schema and related components automatically to your target MongoDB instance. Use the GUI-based data mapper to track errors. Real-time data sync pipeline. Sync data between monolith and microservice databases nearly in real time with enterprise grade components. Conditional data sync. Define how to slice the data you’re planning to sync. Data cleansing. Translate data as it’s moved. DSLs for data transformation. Apply domain-specific business rules for the MongoDB documents you want to create from your various aggregated source system tables. This layer also acts as an anti-corruption layer. Data auditing. Independently verify data sync between your source and target systems. Go beyond the database. Synchronize data from APIs, Webhooks & Events. Bidirectional data sync. Replicate key microservice database updates back to the monolithic database as needed. Get Started with Real-Time Data Synchronization With the initial version of 1Data, PeerIslands addresses the core functionality of real-time data sync between source and target systems. Here’s a view of the logical architecture: Source System. The source system can be a relational database like Oracle, where we’ll rely on CDC, or other sources like Events, API, or Webhooks. **Data Capture & Streaming.**Captures the required data from the source system and converts them into data streams using either off-the-shelf DB connectors or custom connectors, depending on the source type. 1Data implements data sharding and throttling, which enable data synchronization at scale, in this phase. Data Transformation. The core of the accelerator, when we convert the source data streams into target MongoDB document schemas. We use LISP-based Domain Specific Language to enable simple, rule-based data transformation, including user-defined rules. Data Sink & Streaming. Captures the data streams that need to be updated into the MongoDB database through stream consumers. The actual update into the target DB is done through sink connectors. Target system. The MDB database used by the microservices. Auditing. Most data that gets migrated is enterprise-critical; 1Data audits the entire data synchronization process for missed data and incorrect updates. Two-way sync. The logical architecture enables data synchronization from the MongoDB database back to the source database. We used MongoDB, Confluent Kafka and Debezium to implement this initial version of 1Data: The technical architecture is cloud agnostic, and can be deployed on-prem as well. We’ll be customizing it for key cloud platforms as well as fleshing out specific architectures to adopt for common data sync scenarios. Conclusion The 1Data solution accelerator lends itself to multiple use cases, from single view to legacy modernization. Please reach out to us for technical details and implementation assistance, and watch this space as we develop the 1Data accelerator further.