Jay Runkel

7 results

Bosch IoT 和应用程序驱动型分析的重要性

对于 Kai Hackbarth 和他在 Bosch 的团队而言，他们有能力将实时分析构建到其应用程序中，处理海量数据。 “没有数据的支撑，就只是一次又一次的试错。” Kai Hackbarth 如是说。他是 Bosch Global Software 的高级技术宣传官，在物联网领域拥有超过 22 年的行业经验。“不论是汽车、工业、还是智能家居，我们都有相关经验。”Hackbarth 说道。但轮胎是个例外，他最近正密切关注该领域及技术挑战。 “听起来可能挺简单的，”Hackbarth 表示，“但细想的话，[其实很复杂]。” 因为实践证明，轮胎可以收集大量不同类型的数据，让系统全面了解汽车在某一时间不同部件的运作状态。 “压力、温度、加速计，”Hackbarth 说道，“当然还有其他数据……对汽车安全和可持续性有重要影响。” 但是，想要发挥数据的真正价值，就需要在尽可能靠近来源的位置对其进行实时分析。为什么？ “事关安全大事儿，数据的全面性非常重要，”Hackbarth 说道，“但想要将全部原始数据上传到云中，会消耗大量成本。” 其中上传数据需要花费的时间是主要成本。要想对问题做出反应，就不能依赖历史数据。因为与轮胎失压或湿路打滑相关的历史数据对车载应用程序并无意义，而这些应用程序恰好又需要在发生这种状况时作出及时响应。所幸，Hackbarth 及他在 Bosch 的团队能够将实时分析技术部署到其应用程序中，处理海量数据。内置分析为更智能应用程序实现良好开局一直以来，应用程序和分析都作为两个不同类别的工作负载而存在。因此，读写访问模式、并发及延迟等需求也不同。企业通常会部署专属的数据存储区，包括适用于应用程序的数据库以及适用于分析的> 数据仓库，然后再在这两种数据存储区之间输送或复制数据。这种做法效果不错，但前提是，分析不需要影响应用程序实时响应的方式。然而，大多数客户期望应用程序能够在当下就采取智能措施，而不是事后补救。 Bosch 的轮胎项目同样以此为原则。要让一辆速度过快的车在逼近另一辆车时能够主动刹车，或是在轮胎感测到湿路打滑时减速，车载应用程序就需要能够实时分析从所有传感器传来的全部数据。将实时分析内置到应用程序的这个过程称为“ 应用程序驱动型分析 ”。应用程序就是通过这种方式变得更加智能的。手机上的电商应用程序、车载安全应用程序以及能够监控火箭发射的应用程序都属于这类智能应用程序。对于许多开发团队来说，他们需要关注应如何轻松地将这个功能内置到自己的应用程序中。在很长一段时间里，这都是一个亟待解答的难题。将实时分析内置到既有应用程序的平台 “根据我从业22 年的经验，”Hackbarth 说道，“我们从未具备这能力。” 此前，各个公司的团队 (不单是 Bosch 的团队) 都不得不做大量的定制工程工作，才能在靠近来源的位置实现实时分析：拼凑多个数据库来处理不同数据结构（文档、表格、时间序列测量数据、键值、图形等），且需要通过独有的查询 API 来访问每一种数据结构。打造 ETL 数据管道，将数据转换为所需的分析格式，再将数据从实时数据库分层为成本更低的对象存储。加速运转联盟查询引擎，跨各个数据层运作，且利用的还是自己独有的查询 API。集成Serverless，响应实时数据变化。设立自己的 API 层，将数据提供给使用中的应用程序。以上操作会产生众多需要处理的运营和安全模型、大量数据集成工作以及大量数据复制。然而，今时不同往日。市面上已有能够将运营和分析工作负载整合到一处的数据平台，进而将实时运营数据和实时分析结合起来。我就在 MongoDB 任职，因此十分熟悉 MongoDB Atlas 平台。它能够为开发人员提供契合其工作流程的一系列集成式数据和应用程序服务。开发人员能够载入任意结构的数据，以自己想要的任何方式索引、查询及分析数据，然后再对其进行归档。值得一提的是，无需打造自己的数据管道或复制数据，只需通过一个统一的 API 就能完成所有这些操作。 Bosch 团队就是以该平台为基础持续打造自己的解决方案。 Hackbarth 表示，“没有数据的支撑，就只是一次又一次的试错。”如今，有了数据和单一平台之后，他们能够将实时分析内置到应用程序中，收获兼顾响应能力和可行性的切实成果，变得更加智能。对于 Bosch 正在推进的轮胎项目及其他关键用例来说，他们尤其看重这个特质。如需深入了解如何将应用程序驱动型分析内置到应用程序中，欢迎参加由我主持的由三部分构成的直播演示，届时将展示如何在 MongoDB Atlas 中打造应用程序驱动型分析应用程序。在这个三大板块系列中，我们将针对能够管理火箭发射的模拟系统打造实时分析功能。第 1 部分将介绍基本要点，使用 MongoDB 的聚合框架打造复杂的分析查询，以及使用图表打造可视化效果。第 2 部分将重点介绍 Atlas 的其他功能，如 Atlas Search、Atlas Triggers，这些功能对于打造应用程序驱动型分析以及将图表生成的可视化嵌入应用程序 UI 具有无限价值。第 3 部分将聚焦如何使用 Atlas Data Federation、Atlas Data Lake 和 Atlas SQL Interface，并依靠跨多个数据来源的大量历史数据和联盟查询来执行分析。第 1 场直播时间为 3 月 15 日上午 10 点，接下来的两场分别安排在 3 月 22 日和 3 月 29 日。 Jay Runkel 是 MongoDB 的杰出解决方案架构师，一直效力于世界 500 强企业。在近 15 年的时间里，他使用 NoSQL 文档数据库来架设企业解决方案。加入 MongoDB 前，Runkel 曾担任 MarkLogic 的主要技术专家…… 查看 Jay Runkel 的其他博客→

November 28, 2023

Bosch IoT and the Importance of Application-Driven Analytics

This post is also available in: 简体中文 For Kai Hackbarth and his team at Bosch, there exists the ability to bring real-time analytics into their applications that can handle lots and lots of data. “Without data, it’s just trial and error.” So says Kai Hackbarth . He’s a senior technology evangelist at Bosch Global Software with over 22 years of industry experience in the Internet of Things. “Be it automotive, be it industrial, be it smart-home, we’ve done it all,” Hackbarth said. Except for tires. That’s his latest focus and technology challenge. “Sounds maybe a bit simple,” Hackbarth said, “but if you think about it more deeply, [it’s really complex].” Because, as it turns out, tires can collect a lot of different pieces of data that can tell a system a lot about what’s going on with a lot of different things related to the car at any given moment. “Pressure, temperature, accelerometer,” Hackbarth said. “And then you also have other data from the car that’s critical for safety and sustainability.” But to be of any value, that data needs to be analyzed as close to the source as possible and in real time. Why? “It’s safety-critical,” Hackbarth said. “If you send all the raw data to the cloud this consumes a lot of costs.” Chief among those costs: Time. In order to react to an issue, that data cannot be historical. Because historical data about a tire that’s losing pressure or hydroplaning isn’t helpful to the applications inside the car that need to respond to these developments when they’re happening And thankfully for Hackbarth and his team at Bosch, there exists the ability to bring real-time analytics into their applications that can handle lots and lots of data. Smarter applications start with built-in analytics Traditionally, applications and analytics have existed as two separate classes of workloads with different requirements, for example: read and write access patterns, as well as concurrency and latency. As a result, businesses have usually deployed purpose-built data stores — including both databases for applications and data warehouses for analytics — and piping or duplicating the data between them. And that’s been fine when analytics don’t need to affect how an application responds in real time. But most customers expect applications to take intelligent actions in the moment, rather than after the fact. The same principle applies to Bosch’s tire project. The applications inside the car that can autonomously brake when approaching another vehicle too fast or slow down if the tire senses that it’s hydroplaning, need to also analyze all the data from all the sensors in real time. This process of building real-time analytics into applications is known as “ application-driven analytics .” And it’s how applications get smarter. Be they e-commerce apps on your phone, safety apps in your car or those that monitor rocket launches . The question for many development teams, though, is how do you build this capability into your applications easily? For a long time, that’s been a difficult question to answer. A platform for building real-time analytics into your apps “From my history [of doing this for] 22 years,” Hackbarth says, “we never had the capabilities to do this before.” Previously, teams everywhere — not just at Bosch — used to have to do a lot of custom engineering work to do real-time analytics close to the source, including: Stitching together multiple databases to handle different data structures (documents, tables, time series measurements, key values, graph), each accessed with its own unique query API. Building ETL data pipelines to transform data into required analytics formats and tier it from the live database to lower-cost object storage . Spinning up a federated query engine to work across each data tier, again using its own unique query API. Integrating serverless functions to react to real-time data changes. Standing up their own API layers to expose data to consuming applications. All of which result in a multitude of operational and security models to deal with, a ton of data integration work and lots of data duplication. But that’s no longer the case. There are now data platforms that bring together operational and analytical workloads into one, and that, in turn, allow you to bring live, operational data and real-time analysis together. MongoDB Atlas — the platform with which I am most familiar (since I work at MongoDB) — allows developers to do this by providing an integrated set of data and application services that fit their workflows. Developers can land data of any structure, index, query and analyze it in any way they want, and then archive it. And they can do all of this with a unified API and without having to build their own data pipelines or duplicate data. This is the platform on which the Bosch team continues to build its solutions. “Without data,” Hackbarth says, “it’s just trial and error.” But now with data and with a single platform to build real-time analytics into their applications, it’s something concrete, something responsive and something actionable. It’s something smart. Especially in the most critical of use cases that Bosch is working on. Like tires. If you’d like to learn more about how to build application-driven analytics into your applications, check out our three-part livestream demo showing how to build application-driven analytical applications in MongoDB Atlas . During this three-part series, we build real-time analytical capabilities for a simulated system for managing rocket launches. Part One covers the basics: Building complex analytical queries using MongoDB’s aggregation framework and building visualizations using charts. Part Two highlights some additional Atlas capabilities that are often invaluable when building app-driven analytics: Atlas search, triggers and embedding Charts generated visualizations into an application UI. Part Three focuses on how to use Atlas Data Federation, Atlas Data Lake and Atlas SQL Interface to perform analytics using large sets of historical data and federated queries across multiple data sources.

April 19, 2023

The Best Solutions Architects Work At MongoDB

Despite the bravado in the title, the purpose of this article is not to say that MongoDB Solutions Architects (SAs) are better than those working at other organizations. Rather, this article argues that the unique challenges encountered by SAs at MongoDB imply that successful MongoDB SAs are some of the best in the business. This assertion is derived from the unique challenges encountered by both supporting MongoDB customers and the MongoDB sales organization and breadth and depth of skills and knowledge required to be successful. To see why this is the case, let’s explore the role of an SA at MongoDB and the wide range of skills a Solutions Architect must master. A MongoDB SA (sometimes called a Sales Engineer in other organizations) is an engineer that supports the sales organization. The role is multi-faceted. A solutions architect must have: In-depth technical knowledge to both understand a customer’s technical challenges and to articulate how MongoDB addresses them Communication skills to present technical concepts in a clear and concise manner while tactfully dealing with skeptics and those more familiar with other technologies Sales skills to engage a prospect to learn their business challenges and the technical capabilities required to address those challenges Design and troubleshooting skills to assist prospects with designing solutions to complex problems and getting them back on track when things go wrong. The description above may make the MongoDB Solutions Architect role sound like other similar roles, but there are unique features of MongoDB (the product) and its competitive situation that make this role extremely challenging. We will explore this in the sections below. Technology While the strength of MongoDB and a major factor in its success has been the ease with which it can be adopted by developers, MongoDB is a complex product. Presenting MongoDB, answering questions, brainstorming designs, and helping resolve problems requires a wide range of knowledge including: The MongoDB query language Application development with MongoDB’s drivers in 10+ different programming languages Single and multi-data center architectures for high availability Tuning MongoDB to achieve the required level of performance, read consistency, and write durability Scaling MongoDB to manage TBs of data and thousands of queries per second Estimating the size of a cluster (or the cloud deployment costs) required to meet application requirements Best practices for MongoDB schema design and how to design the best MongoDB schema for a given application MongoDB Enterprise operations tools: Ops Manager , Compass , etc. Atlas : MongoDB’s Database as a Service Offering MongoDB’s various connectors: BI , Spark , and Hadoop Migration strategies from RDBMS (and other databases) to MongoDB This is a lot to know and there is a lot of complexity. In addition to the core knowledge listed above, knowledge of the internal workings of MongoDB is essential when working on designs for applications with high performance and scalability requirements. Therefore, most Solutions Architects understand MongoDB’s internal architecture, such as how the WiredTiger storage engine works or how a MongoDB cluster manages connections. To make the SA role even more challenging, organizations often choose MongoDB after failing with some other technology. (Maybe their RDBMS didn’t scale or it was too difficult to expand to handle new sources of data, or Hadoop processing did not meet real-time requirements, or some other NoSQL solution did not provide the required query expressibility and secondary indexes.) This means that MongoDB is often used for bleeding-edge applications that have never been built before. One of the roles of an SA is to understand the application requirements and help the application team come up with an initial design that will ensure their success 1 . It is probably obvious to experienced SAs, but SAs need to understand the capabilities, strengths, and weakness of all competing and tangential solutions as well. MongoDB’s biggest competitors are Oracle, Amazon, and Microsoft – all of whom are constantly evolving their product offerings and marketing strategies. An SA must always keep their knowledge up to date as the market evolves. Communication Being a great technologist is not enough. An SA spends at least as much time communicating with customers as they do working with technology. Communication is sometimes in the form of a standard presentation or demo, but it most often entails detailed technical conversations about how MongoDB works or how MongoDB can be used to address a particular problem. Concise technical explanations that address customer questions using language tailored to their particular situation and frame of reference are the hallmark of an SA. MongoDB SAs have to be comfortable communicating with a wide range of people, not just development teams. They must engage operations, line of business stakeholders, architects, and technology executives in sales discovery conversations and present the technical aspects of MongoDB of most concern at the appropriate level of detail. For example, an SA must be able to provide technology executives with an intuitive feel for why their development teams will be significantly more productive with MongoDB or will be able to deploy a solution that can meet scalability and performance requirements unattainable with previous technology approaches. Similarly, an SA must learn an operations team’s unique challenges related to managing MongoDB and describe how tools like Ops Manager and Atlas address these requirements. Public speaking skills are also essential. Solutions Architects deliver webinars, speak at conferences, write blog posts, and lead discussions and MongoDB User Groups (MUGs). Sales An SA is a member of the Sales organization and “selling” is a big part of the role. Selling involves many aspects. First, SAs assist the MongoDB Account Executives with discovery and qualification. They engage the customer in conversations to understand what their current problems are, their desired solution, the business benefits of the solution, the technical capabilities required to implement this solution, and how they'll measure success. After every customer conversation, SAs work with their Account Executives to refine their understanding of the customer’s situation and identify information that they want to gather at future meetings. Once the required technical capabilities are understood, it is the SA’s role to lead the sales activities that prove to the customer that (1) MongoDB meets all their required capabilities and (2) MongoDB meets these capabilities better than competing solutions. Most of the time this is accomplished via customer conversations, presentations, demonstrations, and design brainstorming meetings. Finally, customers sometimes want to test or validate that MongoDB will meet their technical required capabilities. This is often in the form of a proof of concept (POC) that might test MongoDB performance or scalability, the ease of managing MongoDB clusters with its operations tools, or that MongoDB’s BI Connector provides seamless connectivity with industry standard BI Tools, such as Tableau , Qlik , etc. SAs lead these POC efforts. They work with prospects to define and document the scope and success criteria and work with the prospect during the course of a POC to ensure success. Design and Troubleshooting I alluded to this in the “Technology” section: helping prospects with creative problem solving distinguishes SAs at MongoDB. Organizations will choose MongoDB if they believe and understand how they will be successful with it. Imparting this understanding (a big part of the Solutions Architect’s role) is typically done by helping an organization through some of the more thorny design challenges and implementation decisions. Organizations will choose MongoDB when they understand the framework of a good MongoDB design for their use case and believe all their design requirements will be met. Designing a solution is not a yes or no question that can be researched in the documentation, but is found through deep technical knowledge, careful analysis, and tradeoffs among many competing requirements. The best answer is often found through a collaborative process with the customer. SAs often lead these customer discussions, research solutions to the most challenging technical problems, and help craft the resulting design. Solutions Architects are also a source of internal innovation at MongoDB. Since Solutions Architects spend a significant amount of time speaking with customers, they are the first to realize when marketing or technical material is not resonating with customers or is simply difficult to understand. The pressure of short timelines and desire to be successful often results in innovative messaging and slides that are often adopted by MongoDB’s Product Marketing organization. Similar innovation often occurs with respect to MongoDB feature requests and enhancements. SAs are continually working with customers to help them solve problems and they quickly identify areas where MongoDB’s enhancements would provide significant value. The identification of these areas and specific recommendations from SAs on what product enhancements are required have played a big role in focusing the feature set of future MongoDB releases. Project Management Lastly, SAs often support a number of Account Executives and work on several dozen sales opportunities per quarter. This means that SAs are working a large number of opportunities simultaneously and must be highly organized to ensure that they are prepared for each activity and complete every follow-up item in a timely manner. It is not possible for an SA manager to track or completely understand every sales opportunity so SAs must be self-motivated and manage all their own activities. Summary Solutions Architecture at MongoDB is a challenging and rewarding role. The wide range of technical knowledge plus sales and communication skills required to be successful is common to SA roles. When you combine this with the need for SAs to design innovative solutions to complex (often previously unsolvable problems), the SAs have the set of skills and the track record of success that makes them the “best” in the business. If you want to join the best, check out the MongoDB Careers page . About the Author - Jay Runkel Jay Runkel is a principal solutions architect at MongoDB. For over 5 years, Jay has worked with Fortune 500 companies to architect enterprise solutions using non-relational document databases. Before MongoDB, Jay was a key team member at MarkLogic and Venafi, where he worked with financial services, medical, and media organizations to develop operational systems for analytics and custom publishing. He also has experience guiding large financial institutions, retailers, health care and insurance organizations to secure, protect, and manage their encryption assets. Jay has a BS in Applied Mathematics from Carnegie Mellon and a Masters in Computer Science from the University of Michigan. 1. My favorite part of the job is to get locked in a conference room and whiteboard for 4 hours with a development team to brainstorm the MongoDB solution/design for a particular use case. The most valuable end product of this session is not the design, but the development’s belief that they will be successful with MongoDB and that the development process will be easier than they expected. ↩

February 5, 2018

Active-Active Application Architectures with MongoDB

Introduction Determining the best database for a modern application to be deployed across multiple data centers requires careful evaluation to accommodate a variety of complex application requirements. The database will be responsible for processing reads and writes in multiple geographies, replicating changes among them, and providing the highest possible availability, consistency, and durability guarantees. But not all technology choices are equal. For example, one database technology might provide a higher guarantee of availability while providing lower data consistency and durability guarantees than another technology. The tradeoffs made by an individual database technology will affect the behavior of the application upon which it is built. Unfortunately, there is limited understanding among many application architects as to the specific tradeoffs made by various modern databases. The popular belief appears to be that if an application must accept writes concurrently in multiple data centers, then it needs to use a multi-master database – where multiple masters are responsible for a single copy or partition of the data. This is a misconception and it is compounded by a limited understanding of the (potentially negative) implications this choice has on application behavior. To provide some clarity on this topic, this post will begin by describing the database capabilities required by modern multi-data center applications. Next, it describes the categories of database architectures used to realize these requirements and summarize the pros and cons of each. Finally, it will look at MongoDB specifically and describe how it fits into these categories. It will list some of the specific capabilities and design choices offered by MongoDB that make it suited for global application deployments. Active-Active Requirements When organizations consider deploying applications across multiple data centers (or cloud regions) they typically want to use an active-active architecture. At a high-level, this means deploying an application across multiple data centers where application servers in all data centers are simultaneously processing requests (Figure 1). This architecture aims to achieve a number of objectives: Serve a globally distributed audience by providing local processing (low latencies) Maintain always-on availability, even in the face of complete regional outages Provide the best utilization of platform resources by allowing server resources in multiple data centers to be used in parallel to process application requests. Figure 1 - Active-Active Application Architecture An alternative to an active-active architecture is an active-disaster recovery (also known as active-passive) architecture consisting of a primary data center (region) and one or more disaster recovery (DR) regions (Figure 2). Under normal operating conditions, the primary data center processes requests and the DR center is idle. The DR site only starts processing requests (becomes active), if the primary data center fails. (Under normal situations, data is replicated from primary to DR sites, so that the the DR sites can take over if the primary data center fails). The definition of an active-active architecture is not universally agreed upon. Often, it is also used to describe application architectures that are similar to the active-DR architecture described above, with the distinction being that the failover from primary to DR site is fast (typically a few seconds) and automatic (no human intervention required). In this interpretation, an active-active architecture implies that application downtime is minimal (near zero). Figure 2 - Active-DR architecture A common misconception is that an active-active application architecture requires a multi-master database. This is not only false, but using a multi-master database means relaxing requirements that most data owners hold dear: consistency and data durability. Consistency ensures that reads reflect the results of previous writes. Data durability ensures that committed writes will persist permanently: no data is lost due to the resolution of conflicting writes or node failures. Both these database requirements are essential for building applications that behave in the predictable and deterministic way users expect. To address the multi-master misconception, let’s start by looking at the various database architectures that could be used to achieve an active-active application, and the pros and cons of each. Once we have done this, we will drill into MongoDB’s architecture and look at how it can be used to deploy an Active-Active application architecture. Database Requirements for Active-Active Applications When designing an active-active application architecture, the database tier must meet four architectural requirements (in addition to standard database functionality: powerful query language with rich secondary indexes, low latency access to data, native drivers, comprehensive operational tooling, etc.): Performance - low latency reads and writes. It typically means processing reads and writes on nodes in a data center local to the application. Data durability - Implemented by replicating writes to multiple nodes so that data persists when system failures occur. Consistency - Ensuring that readers see the results of previous writes, readers to various nodes in different regions get the same results, etc. Availability - The database must continue to operate when nodes, data centers, or network connections fail. In addition, the recovery from these failures should be as short as possible. A typical requirement is a few seconds. Due to the laws of physics, e.g., the speed of light, it is not possible for any database to completely satisfy all these requirements at the same time, so the important consideration for any engineering team building an application is to understand the tradeoffs made by each database and selecting the one that provides for the application’s most critical requirements. Let’s look at each of these requirements in more detail. Performance For performance reasons, it is necessary for application servers in a data center to be able to perform reads and writes to database nodes in the same data center, as most applications require millisecond (a few to tens) response times from databases. Communication among nodes across multiple data centers can make it difficult to achieve performance SLAs. If local reads and write are not possible, then the latency associated with sending queries to remote servers significantly impacts application response time. For example, customers in Australia would not expect to have a far worse user experience than customers in the eastern US where the e-commerce vendors primary data center is located. In addition, the lack of network bandwidth between data centers can also be a limiting factor. Data Durability Replication is a critical feature in a distributed database. The database must ensure that writes made to one node are replicated to the other nodes that maintain replicas of the same record, even if these nodes are in different physical locations. The replication speed and data durability guarantees provided will vary among databases, and are influenced by: The set of nodes that accept writes for a given record The situations when data loss can occur Whether conflicting writes (two different writes occurring to the same record in different data centers at about the same time) are allowed, and how they are resolved when they occur Consistency The consistency guarantees of a distributed database vary significantly. This variance depends upon a number of factors, including whether indexes are updated atomically with data, the replication mechanisms used, how much information individual nodes have about the status of corresponding records on other nodes, etc. The weakest level of consistency offered by most distributed databases is eventual consistency. It simply guarantees that, eventually, if all writes are stopped, the value for a record across all nodes in the database will eventually coalesce to the same value. It provides few guarantees about whether an individual application process will read the results of its write, or if value read is the latest value for a record. The strongest consistency guarantee that can be provided by distributed databases without severe impact to performance is causal consistency. As described by Wikipedia , causal consistency provides the following guarantees: Read Your Writes : this means that preceding write operations are indicated and reflected by the following read operations. Monotonic Reads : this implies that an up-to-date increasing set of write operations is guaranteed to be indicated by later read operations. Writes Follow Reads : this provides an assurance that write operations follow and come after reads by which they are influenced. Monotonic Writes : this guarantees that write operations must go after other writes that reasonably should precede them. Most distributed databases will provide consistency guarantees between eventual and causal consistency. The closer to causal consistency the more an application will behave as users expect, e.g.,queries will return the values of previous writes, data won’t appear to be lost, and data values will not change in non-deterministic ways. Availability The availability of a database describes how well the database survives the loss of a node, a data center, or network communication. The degree to which the database continues to process reads and writes in the event of different types of failures and the amount of time required to recover from failures will determine its availability. Some architectures will allow reads and writes to nodes isolated from the rest of the database cluster by a network partition, and thus provide a high level of availability. Also, different databases will vary in the amount of time it takes to detect and recover from failures, with some requiring manual operator intervention to restore a healthy database cluster. Distributed Database Architectures There are three broad categories of database architectures deployed to meet these requirements: Distributed transactions using two-phase commit Multi-Master, sometimes also called “masterless” Partitioned (sharded) database with multiple primaries each responsible for a unique partition of the data Let’s look at each of these options in more detail, as well as the pros and cons of each. Distributed Transactions with Two-Phase Commit A distributed transaction approach updates all nodes containing a record as part of a single transaction, instead of having writes being made to one node and then (asynchronously) replicated to other nodes. The transaction guarantees that all nodes will receive the update or the transaction will fail and all nodes will revert back to the previous state if there is any type of failure. A common protocol for implementing this functionality is called a two-phase commit . The two-phase commit protocol ensures durability and multi-node consistency, but it sacrifices performance. The two-phase commit protocol requires two-phases of communication among all the nodes involved in the transaction with requests and acknowledgments sent at each phase of the operation to ensure every node commits the same write at the same time. When database nodes are distributed across multiple data centers this often pushes query latency from the millisecond range to the multi-second range. Most applications, especially those where the clients are users (mobile devices, web browsers, client applications, etc.) find this level of response time unacceptable. Multi-Master A multi-master database is a distributed database that allows a record to be updated in one of many possible clustered nodes. (Writes are usually replicated so records exist on multiple nodes and in multiple data centers.) On the surface, a multi-master database seems like the ideal platform to realize an active-active architecture. It enables each application server to read and write to a local copy of the data with no restrictions. It has serious limitations, however, when it comes to data consistency. The challenge is that two (or more) copies of the same record may be updated simultaneously by different sessions in different locations. This leads to two different versions of the same record and the database, or sometimes the application itself, must perform conflict resolution to resolve this inconsistency. Most often, a conflict resolution strategy, such as most recent update wins or the record with the larger number of modifications wins, is used since performance would be significantly impacted if some other more sophisticated resolution strategy was applied. This also means that readers in different data centers may see a different and conflicting value for the same record for the time between the writes being applied and the completion of the conflict resolution mechanism. For example, let’s assume we are using a multi-master database as the persistence store for a shopping cart application and this application is deployed in two data centers: East and West. At roughly the same time, a user in San Francisco adds an item to his shopping cart (a flashlight) while an inventory management process in the East data center invalidates a different shopping cart item (game console) for that same user in response to a supplier notification that the release date had been delayed (See times 0 to 1 in Figure 3). At time 1, the shopping cart records in the two data centers are different. The database will use its replication and conflict resolution mechanisms to resolve this inconsistency and eventually one of the two versions of the shopping cart (See time 2 in Figure 3) will be selected. Using the conflict resolution heuristics most often applied by multi-master databases (last update wins or most updated wins), it is impossible for the user or application to predict which version will be selected. In either case, data is lost and unexpected behavior occurs. If the East version is selected, then the user’s selection of a flashlight is lost and if the West version is selected, the the game console is still in the cart. Either way, information is lost. Finally, any other process inspecting the shopping cart between times 1 and 2 is going to see non-deterministic behavior as well. For example, a background process that selects the fulfillment warehouse and updates the cart shipping costs would produce results that conflict with the eventual contents of the cart. If the process is running in the West and alternative 1 becomes reality, it would compute the shipping costs for all three items, even though the cart may soon have just one item, the book. Figure 3 - Example inconsistency in multi-master database The set of uses cases for multi-master databases is limited to the capture of non-mission-critical data, like log data, where the occasional lost record is acceptable. Most use cases cannot tolerate the combination of data loss resulting from throwing away one version of a record during conflict resolution, and inconsistent reads that occur during this process. Partitioned (Sharded) Database A partitioned database divides the database into partitions, called shards. Each shard is implemented by a set of servers each of which contains a complete copy of the partition’s data. What is key here is that each shard maintains exclusive control of its partition of the data. At any given time, for each shard, one server acts as the primary and the other servers act as secondary replicas. Reads and writes are issued to the primary copy of the data. If the primary server fails for any reason (e.g., hardware failure, network partition) one of the secondary servers is automatically elected to primary. Each record in the database belongs to a specific partition, and is managed by exactly one shard, ensuring that it can only be written to the shard’s primary. The mapping of records to shards and the existence of exactly one primary per shard ensures consistency. Since the cluster contains multiple shards, and hence multiple primaries (multiple masters), these primaries may be distributed among the data centers to ensure that writes can occur locally in each datacenter (Figure 4). Figure 4 - Partitioned database A sharded database can be used to implement an active-active application architecture by deploying at least as many shards as data centers and placing the primaries for the shards so that each data center has at least one primary (Figure 5). In addition, the shards are configured so that each shard has at least one replica (copy of the data) in each of the datacenters. For example, the diagram in Figure 5 depicts a database architecture distributed across three datacenters: New York (NYC), London (LON), and Sydney (SYD). The cluster has three shards where each shard has three replicas. The NYC shard has a primary in New York and secondaries in London and Sydney The LON shard has a primary in London and secondaries in New York and Sydney The SYD shard has a primary in Sydney and secondaries in New York and London In this way, each data center has secondaries from all the shards so the local app servers can read the entire data set and a primary for one shard so that writes can be made locally as well. Figure 5 - Active Active architecture with sharded database The sharded database meets most of the consistency and performance requirements for a majority of use cases. Performance is great because reads and writes happen to local servers. When reading from the primaries, consistency is assured since each record is assigned to exactly one primary. This option requires architecting the application so that users/queries are routed to the data center that manages the data (contains the primary) for the query. Often this is done via geography. For example, if we have two data centers in the United States (New Jersey and Oregon), we might shard the data set by geography (East and West) and route traffic for East Coast users to the New Jersey data center, which contains the primary for the Eastern shard, and route traffic for West Coast users to the Oregon data center, which contains the primary for the Western shard. Let’s revisit the shopping cart example using a sharded database. Again, let’s assume two data centers: East and West. For this implementation, we would shard (partition) the shopping carts by their shopping card ID plus a data center field identifying the data center in which the shopping cart was created. The partitioning (Figure 6) would ensure that all shopping carts with a DataCenter field value of “East” would be managed by the shard with the primary in the East data center. The other shard would manage carts with the value of “West”. In addition, we would need two instances of the inventory management service, one deployed in each data center, with responsibility for updating the carts owned by the local data center. Figure 6 - Shard key partitioning for shopping cart example This design assumes that there is some external process routing traffic to the correct data center. When a new cart is created, the user’s session will be routed to the geographically closest data center and then assigned a DataCenter value for that data center. For an existing cart, the router can use the cart’s DataCenter field to identify the correct data center. From this example, we can see that the sharded database gives us all the benefits of a multi-master database without the complexities that come from data inconsistency. Applications servers can read and write from their local primary, but because each cart is owned by a single primary, no inconsistencies can occur. In contrast, multi-master solutions have the potential for data loss and inconsistent reads. Database Architecture Comparison The pros and cons of how well each database architecture meets active-active application requirements is provided in Figure 7. In choosing between multi-master and sharded databases, the decision comes down to whether or not the application can tolerate potentially inconsistent reads and data loss. If the answer is yes, then a multi-master database might be slightly easier to deploy. If the answer is no, then a sharded database is the best option. Since inconsistency and data loss are not acceptable for most applications, a sharded database is usually the best option. MongoDB Active-Active Applications MongoDB is an example of a sharded database architecture. In MongoDB, the construct of a primary server and set of secondary servers is called a replica set. Replica sets provide high availability for each shard and a mechanism, called Zone Sharding, is used to configure the set of data managed by each shard. Zone sharding makes it possible to implement the geographical partitioning described in the previous section. The details of how to accomplish this are described in the “ MongoDB Multi-Data Center Deployments ” white paper and Zone Sharding documentation , but MongoDB operates as described in the “Partitioned (Sharded) Database” section. Numerous organizations use MongoDB to implement active-active application architectures. For example: Ebay has codified the use of zone sharding to enable local reads and writes as one of its standard architecture patterns. YouGov deploys MongoDB for their flagship survey system, called Gryphon, in a “write local, read global” pattern that facilitates active-active multi data center deployments spanning data centers in North America and Europe. Ogilvy and Maher uses MongoDB as the persistence store for its core auditing application. Their sharded cluster spans three data centers in North America and Europe with active data centers in North American and mainland Europe and a DR data center in London. This architecture minimizes write latency and also supports local reads for centralized analytics and reporting against the entire data set. In addition to the standard sharded database functionality, MongoDB provides fine grain controls for write durability and read consistency that make it ideal for multi-data center deployments. For writes, a write concern can be specified to control write durability. The write concern enables the application to specify the number of replica set members that must apply the write before MongoDB acknowledges the write to the application. By providing a write concern, an application can be sure that when MongoDB acknowledges the write, the servers in one or more remote data centers have also applied the write. This ensures that database changes will not be lost in the event of node or a data center failure. In addition, MongoDB addresses one of the potential downsides of a sharded database: less than 100% write availability. Since there is only one primary for each record, if that primary fails, then there is a period of time when writes to the partition cannot occur. MongoDB combines extremely fast failover times with retryable writes. With retryable writes, MongoDB provides automated support for retrying writes that have failed due to transient system errors such as network failures or primary elections, , therefore significantly simplifying application code. The speed of MongoDB’s automated failover is another distinguishing feature that makes MongoDB ideally suited for multi-data center deployments. MongoDB is able to failover in 2-5 seconds (depending upon configuration and network reliability), when a node or data center fails or network split occurs. (Note, secondary reads can continue during the failover period.) After a failure occurs, the remaining replica set members will elect a new primary and MongoDB’s driver, upon which most applications are built, will automatically identify this new primary. The recovery process is automatic and writes continue after the failover process completes. For reads, MongoDB provides two capabilities for specifying the desired level of consistency. First, when reading from secondaries, an application can specify a maximum staleness value ( maxStalenessSeconds ). This ensures that the secondary’s replication lag from the primary cannot be greater than the specified duration, and thus, guarantees the currentness of the data being returned by the secondary. In addition, a read can also be associated with a ReadConcern to control the consistency of the data returned by the query. For example, a ReadConcern of majority tells MongoDB to only return data that has been replicated to a majority of nodes in the replica set. This ensures that the query is only reading data that will not be lost due to a node or data center failure, and gives the application a consistent view of the data over time. MongoDB 3.6 also introduced causal consistency – guaranteeing that every read operation within a client session will always see the previous write operation, regardless of which replica is serving the request. By enforcing strict, causal ordering of operations within a session, causal consistency ensures every read is always logically consistent, enabling monotonic reads from a distributed system – guarantees that cannot be met by most multi-node databases. Causal consistency allows developers to maintain the benefits of strict data consistency enforced by legacy single node relational databases, while modernizing their infrastructure to take advantage of the scalability and availability benefits of modern distributed data platforms. Conclusion In this post we have shown that sharded databases provide the best support for the replication, performance, consistency, and local-write, local-read requirements of active-active applications. The performance of distributed transaction databases is too slow and multi-master databases do not provide the required consistency guarantees. In addition, MongoDB is especially suited for multi-data center deployments due to its distributed architecture, fast failover and ability for applications to specify desired consistency and durability guarantees through Read and Write Concerns. View MongoDB Architect Hub

December 8, 2017

Making HIPAA Compliant Applications with MongoDB

Editors note: the content in this blog was revised to include new features available in MongoDB 3.2 . Introduction to Security Compliance As a Solution Architect at MongoDB, I am often asked a large number of questions by organizations considering using MongoDB as the database layer for an application managing data covered by privacy regulations such as HIPAA, SOX, or PCI. Medical and insurance providers, financial service organizations, and retailers most often ask these types of questions. While I am not an expert on any of these regulations, I have a fair amount of experience answering the questions pertinent to the database layer of such applications. MongoDB users in a number of industries run applications that must comply with regulations that typically exist to protect sensitive information about individuals: Health insurance organizations want to create a “single view” of the patient that aggregates all records, claims, and billing information into a single view. Retailers want to aggregate customer purchase, payment, and invoices across multiple channels. Financial service organizations wish to manage centrally all customer financial information. In all cases, various government regulations specify how to control and secure access to customer information. Regulations apply to the entire application stack. One must configure all the layers in tandem to provide the appropriate level of data security. MongoDB by itself does not pose an issue regarding HIPAA or SOX compliance. Properly configured, you can use MongoDB to provide the persistence layer of an application that complies with these regulations. When organizations design the database layer of an application that must protect customer data and comply with the previously mentioned regulations, they typically require that their database layer meet the following requirements: Authentication - The database must securely authenticate users who will have access to data. Most often the organization’s existing directory servers, such as Active Directory, LDAP, or Kerberos control user access. Authorization - The database must control access to customer information by assigning roles and privileges to users. For example, an insurance provider that manages insurance policies for a large number of companies must insure that each company’s administrators only have access to their employee’s data and not data for other companies employees. Auditing - The database must provide auditing so that one may identify immediately any changes made to #2 and validate to ensure that changed user roles or privileges remain in compliance with their policies and governing regulations. Encryption - The database must encrypt data. This includes data at rest in the file system, data moving from the application layer to the database layer or among database components. Encryption ensures that some malicious actor cannot bypass the database controls and access information directly. In the remainder of this blog post, we will examine how MongoDB Enterprise addresses these requirements. MongoDB Enterprise MongoDB is available in two editions: MongoDB and MongoDB Enterprise , a commercial edition of the database. MongoDB provides all the core developer productivity, high-availability, and scalability features that we all love about MongoDB. While the manual provides a security checklist for ensuring your MongoDB deployment is secure, there are additional measures that many users wish to take in the areas of authentication, authorization, and auditing that are not included with MongoDB. MongoDB Enterprise expands the security capabilities of MongoDB with features we will discuss in this blog post. Authentication - Establishing User Identity Authentication ensures the identity of the users accessing the database. It ensures that only designated users have access to the database. Most organizations concerned about regulatory compliance insist on integrating the database authentication with their organization-wide identity management system (e.g., Active Directory). This system enables user access to be defined (or revoked) in a central repository and immediately enforced across all systems, including MongoDB. MongoDB Enterprise provides support for a number of authentication protocols enabling MongoDB to integrate with identity management systems. MongoDB provides support for LDAP, Kerberos and Active Directory. Active Directory integration is achieved via Kerberos. Company security infrastructure and certificate authorities can authenticate users via x.509 client certificates. Authorization - Controlling Access to Sensitive Data Once a system has confirmed that the person (or application) has established user identity, next they must determine what data to which they should have access and what actions that user is permitted to perform. From a compliance perspective, the system must provide the appropriate controls so each user only has access to the data to which they are supposed to have access. Sometimes the system authenticates at the user level. For example, John Smith has access only to his medical records. More often authentication happens at the role level. For example, all doctors have the ability to modify patient records, but associates in the billing department may only review those records. Fortunately, MongoDB’s security model makes this type of security configuration straightforward. MongoDB provides a library of fine-grained privileges that then can be combined to define a user role. MongoDB then assigns roles to users. A user can perform any of the operations and access the data as defined by the union of the privileges associated with their roles. For example, in our patient record management example above the following roles may be defined: Auditing - Ensuring That The System Remains Secure On the surface, it would appear that if the system performs authentication and authorization properly, we have a secure system. We have effective controls to validate that users are who they say they are. We can grant them the appropriate access privileges so users can only access the data relevant to their role. However, organizations constantly evolve. People change roles. New people get hired and others leave. Therefore, the security settings described above experience constant updating. Just because they are compliant on day one doesn’t mean they are still complaint 3, 6, or 12 months later. To be compliant with most regulations, an organization must constantly monitor their security configuration. A large and complex security configuration in MongoDB can have many collections, users, and roles. Performing a complete audit each time data or users change be extremely time consuming and expensive. To make the process of ensuring ongoing regulatory compliance manageable, organizations require their database layer to provide an audit log. The audit log provides a history of all changes made to security settings. The security operations team can then monitor this log and immediately validate that changes to users and roles are regulatory compliant. Sometimes, organizations will want an additional layer of security by auditing CRUD operations. Monitoring CRUD operations enables the security operation teams to see the type of data accessed by different user roles. For example, a security team may believe that the database allows billers to have read only access to the patient records, but an audit of the CRUD operations may show that in certain situations they can delete records as well. The team may use the log to identify holes in security settings. MongoDB Enterprise audits both security changes and CRUD operations. MongoDB can monitor a long list of security configuration changes, including access permissions to databases and collections, access to DBA operations on databases and collections like index management, and operations team functions like the administration of replica sets and shards. MongoDB Enterprise also supports CRUD operation auditing. Complex filters can be specified to limit the amount of information written to the log and to ensure only the relevant events are audited. Encryption - Ensuring That Data Cannot Be Accessed Outside The Database Encryption ensures that a rogue user or application cannot access data by monitoring traffic between the application and the database or by reading files directly. SSL is widely used for encrypting data in transit between the database and the application. Database-native and third-party encryption solutions may also be used to prevent the direct reading of database files, commonly called “data at rest.” Any encryption solution is only effective if the encryption keys are securely managed. Securely managing keys typically involves deploying a central key management application that manages the process of tracking, securing, and rotating keys. MongoDB supports all of these encryption approaches. MongoDB Enterprise provides built in support for SSL and, with the release of MongoDB 3.2, MongoDB Enterprise provides encryption at rest via its Encrypted Storage engine (the WiredTiger storage engine with encryption enabled). To enhance security and to simplify key management, MongoDB Enterprise supports the KMIP protocol, the protocol used by most key management platforms to application encryption administer keys. Finally, MongoDB Enterprise continues to work with encryption at rest solutions from Vormetric and IBM Guardium. Summary If you want to consider using MongoDB for an application that must satisfy a privacy/security regulation like HIPAA or PCI, you should feel very comfortable doing so. MongoDB Enterprise provides the authentication, authorization, auditing, and encryption capabilities required to ensure that you application will be compliant. Read our Security Architecture Guide for additional requirements, features, and a security configuration checklist: Read the Security Guide About Jay Runkel Jay Runkel is a Solutions Architect Manager at MongoDB and has been working with Fortune 500 companies to architect enterprise solutions using NoSQL document databases for seven years. Before MongoDB, Runkel was a Principal Technologist at MarkLogic where he worked with Financial Service, Medical, and Media organizations to develop operational systems for analytics and custom publishing. Runkel has also recently been a Sales Engineering Manager at Venafi where he he assisted large financial institutions, retailers, health care and insurance organizations improve security by securing, protecting, and managing their encryption assets. Runkel has also held various positions developing automated underwriting, product information management, and CRM solutions. Runkel has a BS in Applied Mathematics from Carnegie Mellon and a Masters in Computer Science from the University of Michigan.

January 7, 2015

Building MongoDB Applications with Binary Files Using GridFS: Part 2

How GridFS Works In part 1 we looked at some of the use cases that are appropriate for GridFS. In this post we take a closer look at how GridFS works and how you can use it in your apps. GridFS stores large binary files by breaking the files into smaller files called “chunks” and saving them in MongoDB. It essentially saves you, the application developer, from having to write all the code to break large files up into chunks, saving all the individual chunks into MongoDB, and then, when retrieving the files, combining all the chunks back together. GridFS gives you all this functionality for free. The way GridFS works is shown in Figure 1. GridFS uses two collections to save a file to a database: fs.files and fs.chunks . (The default prefix is “fs”, but you can rename it.) The fs.chunks collection contains the binary file broken up into 255k chunks. The fs.files collection contains the metadata for the document. Figure 1 - GridFS Structure An Example Let’s work through an example to see how this works. I wrote a small Perl program to load a set of videos my wife had taken on her iPhone of my son’s baseball games and my daughter’s gymnastics meets. Figure 2 shows the core of the program (you can download the entire program from github ). The program performs the following steps: Establishes a connection to MongoDB and creates database object for the “digital” database into which we want to load the files Creates a gridFS object for the database to gain access to the gridFS API Reads the Movies directory on my laptop and finds all the .MOV files Each .MOV file is inserted into MongoDB with three metadata fields: `filename`, `content-type`, and `author`. my $movieDir = "/Users/jayrunkel/Movies"; my $client = MongoDB::MongoClient->new(host => 'localhost', port => 27017); my $database = $client->get_database( 'digital' ); my $grid = $database->get_gridfs; my @vidFiles = getFiles($movieDir); $grid->drop(); foreach my $file (@vidFiles) { my $fh = IO::File->new("$movieDir/$file", "r"); $grid->insert($fh, {"filename" => $file, "content-type" => "video/quicktime", "author" => "deb"}); } Figure 2 - Loading binary files using GridFS As the files are inserted into the database, MongoDB calculates the MD5 checksum of the file and compares that to the checksum calculated by the file system. If they don’t match an error condition occurs. My program will simply die, but it could handle the error and do something more intelligent. Going to the mongoshell and looking at the digital database shows that there are two collections: fs.chunks and fs.files , just as we expect. > use digital > show collections fs.chunks fs.files system.indexes Let’s first look at the fs.files collection (Figure 3). This collection contains one document for each of the MOV files we loaded into MongoDB. Each document contains the metadata for the corresponding file. This metadata consists of the GridFS defined metadata fields ( length , chunksize , and uploadDate ) plus the metadata fields I supplied when I inserted the MOV files: content-type , author , and filename . The length field is the size of the document in bytes, the chunk field is the size of each chunk in bytes, and the uploadDate is the date the document was first inserted into MongoDB. You can define any metadata fields as appropriate and due to the flexibility of MongoDB, each document can have the specific set of fields that make sense. There isn’t a requirement that all files have the same set of metadata fields and you can feel free to add or modify fields as your application evolves. { "_id": ObjectId("546fb304f9781507293f9891"), "length": NumberLong("834284"), "content-type": "video/quicktime", "uploadDate": ISODate("2014-11-21T21:47:48Z"), "chunkSize": NumberLong("261120"), "filename": "IMG_0425.MOV", "author": "deb" } Figure 3 - fs.files collection Each document in the fs.files collection is associated with a set of documents in the fs.chunks collection (Figure 4). The fs.chunks documents are numbered from 0 to n (see the “n” field) with binary data for the chunk stored in the “data” field. The fs.files document is related to its corresponding chunk documents in the fs.chunks collection by the files_id field. The files_id field is the _id of the corresponding document in the fs.files collection. Here are the first two chunks associated with the document whose ObjectId is 546fb304f9781507293f9896 . { "_id": ObjectId("546fb304f9781507293f98a6"), "files_id": ObjectId("546fb304f9781507293f9896"), "data": BinData(0, "+QWU+Pwf1u03d………..."), "n": NumberLong("0") } { "_id": ObjectId("546fb304f9781507293f9898"), "files_id": ObjectId("546fb304f9781507293f9896"), "data": BinData(0, "EgKwOoOAOYEwFoi…."), "n": NumberLong("1") } ... Figure 4 - fs.chunks collection Because the goal of the GridFS API is to make it easy and efficient to retrieve and reassemble the binary document, GridFS will automatically create a compound index on the fs.chunks collection on the files_id and n fields. This will enable MongoDB to quickly identify all the chunks associated with a particular file, sort them from 0 to n using the n field, and return them to the driver so that they can be assembled back into the original document. To demonstrate this, here is the output of running getIndexes on the fs.chunks collection. As you see the Perl driver automatically created the compound index when I inserted the first document into my database. > db.fs.chunks.getIndexes() [ { "v": 1, "key": { "_id": 1 }, "name": "_id_", "ns": "digital.fs.chunks" }, { "v": 1, "unique": true, "key": { "files_id": NumberLong("1"), "n": NumberLong("1") }, "name": "files_id_1_n_1", "ns": "digital.fs.chunks" } ] Figure 5 - fs.chunks indexes How do you use the GridFS API to build an application Using the GridFS API to build an application is simple. GridFS does all the hard work. All you need to do is insert your binary documents into MongoDB using the GridFS API. When you insert the documents, you can provide the document metadata and you can update this metadata during the life of the document in your application using the same update queries you use to update other documents in MongoDB. Retrieving and deleting documents is simple as well. The GridFS API provides methods for retrieving a document, as a whole or in parts, which can be very useful for “skipping” to the middle of an audio or video file. This function combines the chunk files to recreate the original file and returns a file object to your application. The delete method removes the documents from the fs.file and fs.chunks collection so that the original document has been removed from MongoDB. That is all there is to it. Use the GridFS API to insert, retrieve, and remove binary documents to and from MongoDB. GridFS will do all complex work leaving you to focus on the rest of your application. And if you’re interested in learning more about the architecture of MongoDB, download our guide: Download the Architecture Guide About Jay Runkel Jay Runkel is a Solutions Architect Manager at MongoDB and has been working with Fortune 500 companies to architect enterprise solutions using NoSQL document databases for over five years. Before MongoDB, Runkel was a Principal Technologist at MarkLogic where he worked with Financial Service, Medical, and Media organizations to develop operational systems for analytics and custom publishing. Runkel has also recently been a Sales Engineering Manager at Venafi where he he assisted large financial institutions, retailers, health care and insurance organizations improve security by securing, protecting, and managing their encryption assets. Runkel has also held various positions developing automated underwriting, product information management, and CRM solutions. Runkel has a BS in Applied Mathematics from Carnegie Mellon and a Masters in Computer Science from the University of Michigan.

December 5, 2014

Building MongoDB Applications with Binary Files Using GridFS: Part 1

Use Cases This is a two-part series that explores a powerful feature in MongoDB for managing large files called GridFS. In the first part, we discuss use cases appropriate for GridFS, and in part 2 we discuss how GridFS works and how to use it in your apps. In my position at MongoDB, I speak with many teams that are building applications that manage large files (videos, images, PDFs, etc.) along with supporting information that fits naturally into JSON documents. Content management systems are an obvious example of this pattern, where is it necessary to both manage binary artifacts as well as all the supporting information about those artifacts (e.g., author, creation date, workflow state, version information, classification tags, etc.). MongoDB manages data as documents, with a maximum size of 16MB. So what happens when your image, video, or other file exceeds 16MB? GridFS is a specification implemented by all of the MongoDB drivers that manage large files and their associated metadata as a group of small files. When you query for a large file, GridFS automatically reassembles the smaller files into the original large file. Content management is just one of the many uses of GridFS. For example, McAfee optimizes delivery of analytics and incremental updates in MongoDB as binary packages for efficiently delivery to customers. Pearson stores student data in GridFS and leverages MongoDB’s replication to distribute data and keep it synchronized across facilities. There are also many other systems with this requirement, many of which are in healthcare. Hospitals and other care-providing organizations want to centralize patient record information to provide a single view of the patient making this information more accessible to doctors, care providers, and patients themselves. The central system improves the quality of patient care (care providers get a more complete and global view of the patient’s health status) as well as the efficiency of providing care. A typical patient health record repository includes general information about the patient (name, address, insurance provider, etc.) along with all various types of medical records (office visits, blood tests and labs, medical procedures, etc.). Much of this information fits nicely into JSON documents and the flexibility of MongoDB makes it easy to accommodate variability in content and structure. Healthcare applications also manage large binary files such as radiology tests, x-ray and MRI images, CAT scans, or even legacy medical records created by scanning paper documents. It is not uncommon for a large hospital or insurance provider to have 50-100 TB of patient data and as much as 90% is large binary files. When building an application like this, organizations usually have to decide whether to store the binary data in a separate repository, or along with the metadata. With MongoDB, there are a number of compelling reasons for storing the binary data in the same system as the rest of the information. They include: The resulting application will have a simpler architecture: one system for all types of data; Document metadata can be expressed using the rich flexible document structure, and documents can be retrieved using all the flexibility of MongoDB’s query language; MongoDB’s high availability (replica sets) and scalability (sharding) infrastructure can be leveraged for binary data as well as the structured data; One consistent security model for authenticating and authorizing access to the metadata and files; GridFS doesn’t have the limitations of some filesystems, like the number of documents per directory, or file naming rules. Fortunately, GridFS makes working with large binary files in MongoDB easy. In part two, I will walk through in detail how you use GridFS to store and manage binary content in MongoDB. In the meantime if you’re looking to learn more, download our architecture guide: Download the Architecture Guide About Jay Runkel Jay Runkel is a Solutions Architect Manager at MongoDB and has been working with Fortune 500 companies to architect enterprise solutions using NoSQL document databases for over five years. Before MongoDB, Runkel was a Principal Technologist at MarkLogic where he worked with Financial Service, Medical, and Media organizations to develop operational systems for analytics and custom publishing. Runkel has also recently been a Sales Engineering Manager at Venafi where he assisted large financial institutions, retailers, health care and insurance organizations to improve security by securing, protecting, and managing their encryption assets. Runkel has also held various positions developing automated underwriting, product information management, and CRM solutions. Runkel has a BS in Applied Mathematics from Carnegie Mellon and a Masters in Computer Science from the University of Michigan. Read Part 2 >>

December 5, 2014