5 Key Questions for App-Driven Analytics
Note: This article originally appeared in The New Stack . Data that powers applications and data that powers analytics typically live in separate domains in the data estate. This separation is mainly due to the fact that they serve different strategic purposes for an organization. Applications are used for engaging with customers while analytics are for insight. The two classes of workloads have different requirements—such as read and write access patterns, concurrency, and latency—therefore, organizations typically deploy purpose-built databases and duplicate data between them to satisfy the unique requirements of each use case. As distinct as these systems are, they're also highly interdependent in today's digital economy. Application data is fed into analytics platforms where it's combined and enriched with other operational and historical data, supplemented with business intelligence (BI), machine learning (ML) and predictive analytics, and sometimes fed back to applications to deliver richer experiences. Picture, for example, an ecommerce system that segments users by demographic data and past purchases and then serves relevant recommendations when they next visit the website. The process of moving data between the two types of systems is here to stay. But, today, that’s not enough. The current digital economy, with its seamless user experiences that customers have come to expect, requires that applications also become smarter, autonomously taking intelligent actions in real time on our behalf. Along with smarter apps, businesses want insights faster so they know what is happening “in the moment.” To meet these demands, we can no longer rely only on copying data out of our operational systems into centralized analytics stores. Moving data takes time and creates too much separation between application events and analytical actions. Instead, analytics processing must be “shifted left” to the source of the data—to the applications themselves. We call this shift application-driven analytics . And it’s a shift that both developers and analytics teams need to be ready to embrace. Find out why the MongoDB Atlas developer data platform was recently named a Leader in Forrester Wave: Translytical Data Platforms, Q4 2022 Defining required capabilities Embracing the shift is one thing; having the capabilities to implement it is another. In this article, we break down the capabilities required to implement application-driven analytics into the following five critical questions for developers: How do developers access the tools they need to build sophisticated analytics queries directly into their application code? How do developers make sense of voluminous streams of time series data? How do developers create intelligent applications that automatically react to events in real time? How do developers combine live application data in hot database storage with aged data in cooler cloud storage to make predictions? How can developers bring analytics into applications without compromising performance? To take a deeper dive into app-driven analytics—including specific requirements for developers compared with data analysts and real-world success stories—download our white paper: Application-Driven Analytics . 1. How do developers access the tools they need to build sophisticated analytics queries directly into their application code? To unlock the latent power of application data that exists across the data estate, developers rely on the ability to perform CRUD operations, sophisticated aggregations, and data transformations. The primary tool for delivering on these capabilities is an API that allows them to query data any way they need, from simple lookups to building more sophisticated data processing pipelines. Developers need that API implemented as an extension of their preferred programming language to remain "in the zone" as they work through problems in a flow state. Alongside a powerful API, developers need a versatile query engine and indexing that returns results in the most efficient way possible. Without indexing, the database engine needs to go through each record to find a match. With indexing, the database can find relevant results faster and with less overhead. Once developers start interacting with the database systematically, they need tools that can give them visibility into query performance so they can tune and optimize. Powerful tools like MongoDB Compass let users monitor real-time server and database metrics as well as visualize performance issues . Additionally, column-oriented representation of data can be used to power in-app visualizations and analytics on top of transactional data. Other MongoDB Atlas tools can be used to make performance recommendations , such as index and schema suggestions to further streamline database queries. 2. How do you make sense of voluminous streams of time series data? Time series data is typical in many modern applications. Internet of Things (IoT) sensor data, financial trades, clickstreams, and logs enable businesses to surface valuable insights. To help, MongoDB developed the highly optimized time series collection type and clustered indexes. Built on a highly compressible columnar storage format, time series collections can reduce storage and I/O overhead by as much as 70%. Developers need the ability to query and analyze this data across rolling time windows while filling any gaps in incoming data. They also need a way to visualize this data in real time to understand complex trends. Another key requirement is a mechanism that automates the management of the time series data lifecycle. As data ages, it should be moved out of hot storage to avoid congestion on live systems; however, there is still value in that data, especially in aggregated form to provide historical analysis. So, organizations need a systematic way of tiering that data into low-cost object storage in order to maintain their ability to access and query that data for the insights it can surface. 3. How do you create intelligent applications that automatically react to events in real time? Modern applications must be able to continuously analyze data in real time as they react to live events. Dynamic pricing in a ride-hailing service, recalculating delivery times in a logistics app due to changing traffic conditions, triggering a service call when a factory machine component starts to fail, or initiating a trade when stock markets move—these are just a few examples of in-app analytics that require continuous, real-time data analysis. MongoDB Atlas has a host of capabilities to support these requirements. With change streams , for example, all database changes are published to an API, notifying subscribing applications when an event matches predefined criteria. Atlas triggers and functions can then automatically execute application code in response to the event, allowing you to build reactive, real-time, in-app analytics. 4. How do you combine live application data in hot database storage with aged data in cooler cloud storage to make predictions? Data is increasingly distributed across different applications, microservices , and even cloud providers. Some of that data consists of newly ingested time-series measurements or orders made in your ecommerce store and resides in hot database storage. Other data sets consist of older data that might be archived in lower cost, object cloud storage. Organizations must be able to query, blend, and analyze fresh data coming in from microservices and IoT devices along with cooler data, APIs, and third-party data sources that reside in object stores in ways not possible with regular databases. The ability to bring all key data assets together is critical for understanding trends and making predictions, whether that's handled by a human or as part of a machine learning process. 5. How can you bring analytics into your applications without compromising their performance? Live, customer-facing applications need to serve many concurrent users while ensuring low, predictable latency and do it consistently at scale. Any slowdown degrades customer experience and drives customers toward competitors. In one frequently cited study, Amazon found that just 100 milliseconds of extra load time cost them 1% in sales . So, it's critical that analytics queries on live data don’t affect app performance. A distributed architecture can help you enforce isolation between the transactional and analytical sides of an application within a single database cluster . You can also use sophisticated replication techniques to move data to systems that are totally isolated but look like a single system to the app. Next steps to app-driven analytics As application-driven analytics becomes pervasive, the MongoDB Atlas developer data platform unifies the core data services needed to make smarter apps and improved business visibility a reality. Atlas does this by seamlessly bridging the traditional divide between transactional and analytical workloads in an elegant and integrated data architecture. With MongoDB Atlas, you get a single platform managing a common data set for both developers and analysts. With its flexible document data model and unified query interface, the Atlas platform minimizes data movement and duplication and eliminates data silos and architectural complexity while unlocking analytics faster and at lower cost on live operational data. It does all this while meeting the most demanding requirements for resilience, scale, and data privacy. For more information about how to implement app-driven analytics and how the MongoDB developer data platform gives you the tools needed to succeed, download our white paper, Application-Driven Analytics .
5 Steps to Replacing Elasticsearch and Solr with Atlas Search
What do a global auto manufacturer, multinational media and entertainment company, and a challenger bank have in common? They have all made the switch from Elasticsearch to MongoDB Atlas Search to simplify their technology stack and ship application search faster. But what problems were they solving and how did they migrate? We have a new 5-step guide that takes you through why they switched, and how they did it. The need for application search Type almost anything into a search bar on sites like Google, Amazon, and Netflix and you are instantly presented with relevant results. Whether you make a typo or enter a partial search term, the search engine figures out what you are looking for. Results are returned conveniently sorted by relevance and are easy to navigate with features like highlighting, filters, and counts. Everyone now expects these same fast and intuitive search experiences in every application they use, whether at home or at work. However, creating these experiences is hard with the burden falling onto developers and ops teams who have to build and run the underlying systems. The pain of building application search MongoDB has always focused on accelerating and simplifying how developers build with data for any class of application. From our very earliest MongoDB releases, we saw developers needing to expose the application data stored in their database to search and information discovery. For simple use cases – where it was enough to just match text in a field – developers were able to use the basic text search operators and index built into the MongoDB database. However these lacked the much more sophisticated speed and relevance tuning features offered by dedicated search engines, typically built on top of Apache Lucene . As a result many developers ended up bolting on an external search engine such as Elasticsearch or Apache Solr to their database. Elasticsearch and Solr were (and remain) popular and proven. However as Figure 1 shows, they introduced a huge amount of complexity to the application stack, reducing developer velocity while driving up risk, complexity, and cost. Figure 1: The pain of bolting on a search engine to your database Working with the MongoDB community, our product designers and engineers ideated on ways to make building application search easier for developers – without compromising on the key features they needed. The result is MongoDB Atlas Search . What is Atlas Search and why switch to it? Atlas Search embeds a fully-managed Apache Lucene search index directly alongside the database and automatically synchronizes data between them. By integrating the database, search engine, and sync pipeline into a single, fully-managed platform you get to compress three systems into one and simplify your technology stack. Engineering teams and application owners have reported improved development velocity of 30% to 50% after adopting Atlas Search. This is because they get to: Eliminate the synchronization tax. Data is automatically and dynamically synced from the Atlas database to the Atlas Search indexes. They avoid having to stand up and manage their own sync mechanism, write custom transformation logic, or remap search indexes as their database schema evolves. They escape the 10% of engineering cycles typically lost to manually recovering sync failures, investing that time to innovate for their users instead. ( 1 ) Ship new features faster. They work with a single, unified API across both database and search operations, simplifying query development. No more context switching between multiple query languages, and with a single driver, build dependencies are streamlined so they release faster. They can test queries and preview results with interactive tools to fine-tune performance and scoring before deploying them directly into application code. Remove operational heavy-lifting. The fully-managed Atlas platform automates provisioning, replication, patching, upgrades, scaling, security, and disaster recovery while providing deep performance visibility into both database and search. By working with a single system, they avoid an exponential increase in the number of system components they need to design, test, secure, monitor, and maintain. Figure 2: Dramatic architectual simplification with integrated database, sync, and search in MongoDB Atlas 5 steps to make the switch to Atlas Search The benefits Atlas Search provides has led engineering teams across all industry sectors and geographies to make the switch from bolt-on search engines. Through the experiences gained by working with these teams, we have put together a repeatable 5-step methodology to replacing Elasticsearch and Solr. The guide steps you through how to: Qualify target workloads for Atlas Search. Migrate your indexes to Atlas Search. Migrate your queries to Atlas Search. Validate and relevance-tune your Atlas Search queries and indexes. Size and deploy your Atlas Search infrastructure. Figure 3: 5-step methodology to replacing Elasticsearch and Solr with Atlas Search The guide wraps up with examples of customers that have made the switch and provides guidance on how to get started with Atlas Search. What's next? You can get started today by downloading the 5-step guide to replacing Elasticsearch and Solr with Atlas Search . The 5-step guide is designed to help you plan and execute your migration project. MongoDB's Professional Services team is also available to you as a trusted delivery partner. We can help you through any of the steps in the methodology or throughout your entire journey to Atlas Search. If you want to dig deeper into Atlas Search, spin it up at no-cost on the Atlas Free Tier . You can follow along with reference materials and tutorials in the Atlas Search documentation using our sample data sets, or load your own data for experimentation within your own sandbox. Welcome to a world where application search is, at last, simplified! Download the 5-step Guide Now! 1. Based on interviews with engineering teams that have replaced bolt on search engines and the associated sync mechanism.
Scale Out Without Fear or Friction: Live Resharding in MongoDB
Live resharding was one of the key enhancements delivered in our MongoDB 5.0 Major Release . With live resharding you can change the shard key for your collection on demand as your application evolves with no database downtime or complex data migrations . In this blog post, we will be covering: Product developments that have made sharding more flexible What you had to do before MongoDB 5.0 to reshard your collection, and how that changed with 5.0 live resharding Guidance on the performance and operational considerations of using live resharding Before that, we should discuss why you should shard at all, and the importance of selecting a good shard key – even though you have the flexibility with live resharding to change it at any time. Go ahead and skip the next couple of sections if you are already familiar with sharding! Why Shard your Database? Sharding enables you to distribute your data across multiple nodes. You do that to: Scale out horizontally — accommodate growing data or application load by sharding once your application starts to get close to the capacity limits of a single replica set. Enforce data locality — for example pinning data to shards that are provisioned in specific regions so that the database delivers low latency local access and maintains data sovereignty for regulatory compliance. Sharding is the best way of scaling databases and MongoDB was developed to support sharding natively. Sharding MongoDB is transparent to your applications and it’s elastic so you can add and remove shards at any time. The Importance of Selecting a Good Shard Key MongoDB’s native sharding has always been highly flexible — you can select any field or combination of fields in your documents to shard on. This means you can select a shard key that is best suited to your application’s requirements. The choice of shard key is important as it defines how data is distributed across the available shards. Ideally you want to select a shard key that: Gives you low latency and high throughput reads and writes by matching data distribution to your application’s data access patterns. Evenly distributes data across the cluster so you avoid any one shard taking most of the load (i.e., a “hot shard”). Provides linear scalability as you add more shards in the future. While you have the flexibility to select any field(s) of your documents as your shard key, it was previously difficult to change the shard key later on. This made some developers fearful of sharding. If you chose a shard key that doesn’t work well, or if application requirements change and the shard key doesn’t work well for its changed access patterns, the impact on performance could be significant. At this point in time, no other mainstream distributed database allows users to change shard keys, but we wanted to give users this ability. Making Shard Keys More Flexible Over the past few releases, MongoDB engineers have been working to provide more sharding flexibility to users: MongoDB 4.2 introduced the ability to modify a shard key’s value . Under the covers the modification process uses a distributed, multi-document ACID transaction to change the placement of a document in a sharded cluster. This is useful when you want to rehome a document to a different geographic region or age data out to a slower storage tier . MongoDB 4.4 went further with the ability to refine the shard key for a collection by adding a suffix to an existing key. Both of these enhancements made sharding more flexible, but they didn’t help if you needed to reshard your collection using an entirely different shard key. Manual Resharding: Before MongoDB 5.0 Resharding a collection was a manual and complex process that could only be achieved through one of two approaches: Dumping the entire collection and then reloading it into a new collection with the new shard key . This is an offline process, and so your application is down until data reloading is complete — for example, it could take several days to dump and reload a 10 TB+ collection on a three-shard cluster. Undergoing a custom migration that involved writing all the data from the old cluster to a new cluster with the resharded collection. You had to write the query routing and migration logic, and then constantly check the migration progress to ensure all data had been successfully migrated. Custom migrations entail less downtime, but they come with a lot of overhead. They are highly complex, labor-intensive, risky, and expensive (as you had to run two clusters side-by-side). It took one MongoDB user three months to complete the live migration of 10 billion documents. How this Changed with MongoDB 5.0: Live Resharding We made manual resharding a thing of the past with MongoDB 5.0. With 5.0 you just run the reshardCollection command from the shell, point at the database and collection you want to reshard, specify the new shard key, and let MongoDB take care of the rest. reshardCollection: "<database>.<collection>", key: <shardkey> When you invoke the reshardCollection command, MongoDB clones your existing collection into a new collection with the new shard key, then starts applying all new oplog updates from the existing collection to the new collection. This enables the database to keep pace with incoming application writes. When all oplog updates have been applied, MongoDB will automatically cut over to the new collection and remove the old collection in the background. Lets walk through an example where live resharding would really help a user: The user has an orders collection. In the past, they needed to scale out and chose the order_id field as the shard key. Now they realize that they have to regularly query each customer’s orders to quickly display order history. This query does not use the order_id field. To return the results for such a query, all shards need to provide data for the query. This is called a scatter-gather query. It would have been more performant and scalable to have orders for each customer localized to a shard, avoiding scatter-gather, cross-shard queries. They realize that the optimal shard key would be "customer_id: 1, order_id: 1" rather than just the order_id . With MongoDB 5.0’s live resharding, the user can just run the reshard command, and MongoDB will reshard the orders collection for them using the new shard key, without having to bring the database and the application down. Watch our short Live Resharding talk from MongoDB.Live 2021 to see a demo with this exact example. Not only can you change the field(s) for a shard key, you can also review your sharding strategy, changing between range, hash, and zones. Live Resharding: Performance and Operational Considerations Even with the flexibility that live resharding gives you, it is still important to properly evaluate the selection of your shard key. Our documentation provides guidance to help you make the best choice of shard key . Of course, live resharding makes it much easier to change that key should your original choice have not been optimal, or if your application changes in a way that you hadn’t previously anticipated. If you find yourself in this situation, it is essential to plan for live resharding. What do you need to be thinking about before resharding Make sure you have sufficient storage capacity available on each node of your cluster. Since MongoDB is temporarily cloning your existing collection, spare storage capacity needs to be at least 1.2x the size of the collection you are going to reshard. This is because we need 20% more storage in order to buffer writes that occur during the resharding process. For example, if the size of the collection you want to reshard is 2 TB compressed, you should have at least 2.4 TB of free storage in the cluster before starting the resharding operation. While the resharding process is efficient, it will still consume additional compute and I/O resources. You should therefore make sure you are not consistently running the database at or close to peak system utilization. If you see CPU usage in excess of 80% or I/O usage above 50%, you should scale up your cluster to larger instance sizes before resharding. Once resharding is done, it's fine to scale back down to regular instance sizes. Before you run resharding, you should update any queries that reference the existing shard key to include both the current shard key and the new shard key. When resharding is complete, you can remove the old shard key from your queries. Review the resharding requirements documentation for a full run down on the key factors to consider before resharding your collection. What should you expect during resharding? Total duration of the resharding process is dependent on the number of shards, the size of your collection, and the write load to your collection. For a constant data size, the more shards the shorter the resharding duration. From a simple POC on MongoDB Atlas, a 100 GB collection took just 2 hours 45 minutes to reshard on a 4-shard cluster and 5 hours 30 minutes on a 2-shard cluster. The process scales up and down linearly with data size and number of shards – so a 1 TB collection will take 10 times longer to reshard than a 100GB collection. Of course your mileage may vary based on the read/write ratio of your application along with the speed and quality of your underlying hardware infrastructure. While resharding is in flight, you should expect the following impacts to application performance: The latency and throughput of reads against the collection that is being resharded will be unaffected . Even though we are writing to the existing collection and then applying oplog entries to both its replicas and to the cloned collection, you should expect to see negligible impact to write latency given enough spare CPU. If your cluster is CPU-bound, expect a latency increase of 5 to 10% during the cloning phase and 20 to 50% during the applying phase (*) . As long as you meet the aforementioned capacity requirements, the latency and throughput of operations to other collections in the database won't be impacted . (*) Note: If you notice unacceptable write latencies to your collection, we recommend you stop resharding, increase your shard instance sizes, and then run resharding again. The abort and cleanup of the cloned collection are instantaneous. If your application has time periods with less traffic, reshard your collection during that time if possible. All of your existing isolation, consistency, and durability guarantees are honored while resharding is running. The process itself is resilient and crash-safe, so if any shard undergoes a replica set election, there is no impact to resharding – it will simply resume when the new primary has been elected. You can monitor the resharding progress with the $currentOp pipeline stage. It will report an estimate of the remaining time to complete the resharding operation. You can also abort the resharding process at any time. What happens after resharding is complete? When resharding is done and the two collections are in sync, MongoDB will automatically cut over to the new collection and remove the old collection for you, reclaiming your storage and returning latency back to normal. By default, cutover takes up to two seconds — during which time the collection will not accept writes, and so your application will see a short spike in write latency. Any writes that timeout are automatically retried by our drivers , so exceptions are not surfaced to your users. The cutover interval is tunable: Resharding will be quicker if you raise the interval above the two second default, with the trade-off that the period of write unavailability will be longer. By dialing it down below two seconds, the interval of write unavailability will be shorter. However, the resharding process will take longer to complete, and the odds of the window ever being short enough to cutover will be diminished. You can block writes early to force resharding to complete by issuing the commitReshardCollection command. This is useful if the current time estimate to complete the resharding operation is an acceptable duration for your collection to block writes. What you Get with Live Resharding Live sharding is available wherever you run MongoDB – whether that’s in our fully managed Atlas data platform in the cloud , with Enterprise Advanced , or if using the Community Edition of MongoDB. To recap how you benefit from live resharding: Evolve with your apps with simplicity and resilience: As your applications evolve or as you need to improve on the original choice of shard key, a single command kicks off resharding. This process is automated, resilient, and non-disruptive to your application. Compress weeks/months to minutes/hours: Live resharding is fully automated, so you eliminate disruptive and lengthy manual data migrations. To make scaling out even easier, you can evaluate the effectiveness of different shard keys in dev/test environments before committing your choice to production. Even then, you can change your shard key when you want to. Extend flexibility and agility across every layer of your application stack: You have seen how MongoDB’s flexible document data model instantly adapts as you add new features to your app. With live resharding you get that same flexibility when you shard. New features or new requirements? Simply reshard as and when you need to. Summary Live Resharding is a huge step forward in the state of distributed systems, and is just the start of an exciting and fast-paced MongoDB roadmap that will make sharding even easier, more flexible, and automated. If you want to dig deeper, please take a look at the Live Resharding session recording from our developer conference and review the resharding documentation . To learn more about MongoDB 5.0 and our new Rapid Releases, download our guide to what’s new in MongoDB .
Data and the European Landscape: 3 Trends for 2022
The past two years have brought massive changes for IT leaders: large and complex cloud migrations; unprecedented numbers of people suddenly working, shopping and learning from home; and a burst in demand for digital-first experiences. Like everyone else, we are hoping that 2022 isn’t so disruptive (fingers crossed!), but our customer conversations in Europe do lead us to believe the new year will bring new business priorities. We’re already noticing changes in conversations around vendor lock-in, thanks to the Digital Markets Act, a new enthusiasm for combining operational and analytical data to drive new insights faster, and a more strategic embrace of sustainability. Here’s how we see these trends playing out in 2022. Digital markets act draws new attention to cloud vendor lock-in in Europe We’ve heard plenty about the European Commission’s Digital Markets Act , which, in the name of ensuring fair and open digital markets, would place new restrictions on companies that are deemed to be digital “gatekeepers” in the region. That discussion will be nothing compared to the vigorous debate we expect once the EU begins the very tricky political business of determining exactly which companies will fall under the act. If the EU sets the bar for revenues, users, and market size high enough, it’s possible that the regulation will end up affecting only Facebook, Amazon, Google, Apple, and Microsoft. But a European group representing 2,500 CIOs and almost 700 organisations is now pushing to have the regulation encompass more software companies. Their main concern centers around “distorted competition” in cloud infrastructure services and a worry that companies are being locked into one cloud vendor. A trend that will likely increase in 2022 that pushes back on cloud vendor lock-in is embracing multi-cloud strategies. We should expect to see more organisations in the region pursuing multi-cloud environments as a means to improve business continuity and agility whilst being able to access best of breed services from each cloud provider. As we have always said …”it’s fine to date your cloud provider….but don’t ever marry them.” The convergence of operational and analytical data The processing of operational and analytical data is almost always contained in different data systems, each tuned to that use case and managed by separate teams. But because that data lives in separate places, it’s almost impossible for organisations to generate insights and automate actions in real time, against live data. We believe 2022 is the year we’ll see a critical mass of companies in the region make significant progress toward a convergence of their operational and analytical data. We’re already starting to see some of the principles of microservices in operational applications, such as domain ownership, be applied to analytics as well. We’re hearing about this from so many of our customers locally, who are looking at MongoDB as a data platform that allows them to perform queries across both real-time and historical data, using a unified platform and a single query API. This results in the applications they are building becoming more intelligent and contextual to their users, while avoiding dependencies on centralized analytics teams that otherwise slow down how quickly new, data-driven experiences can be released. Sustainability drives local strategic IT choice Technology always has some environmental cost. Sometimes that’s obvious — such as the energy needs and emissions associated with Bitcoin mining. More often, though, the environmental costs are well hidden. The European Green Deal commits the European Union to reducing emissions by 55% by 2030, with a focus on sustainable industry. With the U.N. Climate Change Conference (COP26) recently completed in Glasgow, and coming off the hottest European summer on record, climate issues have become top of mind. That means our customers are increasingly looking to make their technical operations more sustainable — including in their choice of cloud provider and data centers. According to research from IDC , more than 20% of CxOs say that sustainability is now important in selecting a strategic cloud service provider, and some 29% of CxOs are including sustainability into their RFPs for cloud services. Most interesting, 26% say they are willing to switch to providers with better sustainability credentials. Historically, it’s been difficult to make a switch like that. That’s part of the reason we built MongoDB Atlas — to give our customers the flexibility to run in any region , with any of the three largest cloud providers, and to make it easy to switch between them, and even to run a single database cluster across them. Publicly available information about the footprint of individual regions and even single data centers will make it simpler for companies to make informed decisions. Already, at least one cloud platform has added indicators to regions with the lowest carbon footprint. So while we hope 2022 will not be as disruptive as the years gone by, it will still bring seminal changes to our industry. These changes will also prompt organisations toward more agile, cohesive and sustainable data platform strategies as they seek to gain competitive advantage and exceed customer expectations. Source: IDC, European Customers Engage Services Providers at All Stages of Their Cloud Journey, IDC Survey Spotlight, Doc #EUR248484021, Dec 2021
100x Faster Facets and Counts with MongoDB Atlas Search: Public Preview
Today we’ve released one of the most powerful features of Atlas Search in public preview, and ready for your evaluation: lightning fast facets and counts over large data sets. Faceted search allows users to filter and quickly navigate search results by categories and see the total number of results per category for at-a-glance statistics. With the new facet operator , facet and count operations are pushed down into Atlas Search’s embedded Lucene index and processed locally – taking advantage of 20+ years of Lucene optimizations – before returning the faceted result set back to the application. What this means is that now facet-heavy workloads such as ecommerce product catalogs, content libraries, and counts run up to 100x faster . The power of facets and counts in full-text search Faceting is a popular search and analytics capability that allows an application to group information into related categories by applying filters to query results. Users can narrow their search results by simply selecting a facet value as a filter criteria. They can intuitively explore complex data sets, providing fast and convenient navigation to quickly drill into the data that is of most interest. A common use of faceting is navigating product catalogs. With travel starting to reopen, let's take a travel site as an example. By using faceted search, the site can present vacation options by destination region, trip type (i.e. hotel, self-catering, beach, ski, city break), price band, season, and more, enabling users to quickly navigate to the category that is most relevant to them. Facets also enable fast results counting. Extending our travel site example, business analysts can use facets to quickly compare sales statistics by counting the number of trips sold by region and season. Prior to the new facet operator, the only way Atlas Search could facet and count data was to retrieve the entire result set back to MongoDB’s internal $facet aggregation pipeline stage . While that was OK for smaller data sets, it became slow when the result set exceeded tens of thousands of documents. This all changes as now operations are pushed down to Atlas Search’s embedded and optimized Lucene library in a single $search pipeline stage. From our internal testing of a collection with one million documents, the new Atlas Search faceting improves performance by 100x. How to use faceting in Atlas Search Our new Atlas Search facets tutorial will help you get started. It describes how to: Create an index with a facet definition on string, date, and numeric fields in the sample_mflix.movies collection. Then run an Atlas Search query against those fields for results grouped by values for the string field and by ranges for the date and numeric fields, including the count for each of those groups. To use Atlas Search facets, you must be running your Atlas cluster on MongoDB 4.4.11 and above or MongoDB 5.0.4 and above. These clusters must be running on the M10 tier or higher. Facets and counts currently work on non-sharded collections. Support for sharded collections is scheduled for next year. The power of Atlas Search in a unified data platform in the cloud MongoDB Atlas Search makes it easy to build fast, relevant full-text search on top of your data in the cloud. A couple of API calls or clicks in the Atlas UI, and you instantly expose your data to sophisticated search experiences that boost engagement and improve satisfaction with your applications. Your data is immediately more discoverable, usable, and valuable. By embedding the Apache Lucene library directly alongside your database, data is automatically synchronized with the search index; developers get to work with a single API; there is no separate system to run and pay for; and everything is fully-managed for you on any cloud you choose. Figure 1: Rather than bolting-on a separate search engine to your database, Atlas Search provides a fully integrated platform. Atlas Search provides the power you get with Lucene — including faceted navigation, autocomplete, fuzzy search, built-in analyzers, highlighting, custom scoring, and synonyms — combining it with the productivity you get fromMongoDB. As a result, developers can ship search applications and new features 30%+ faster. Next steps You can try out Atlas Search with the public preview of lightning-fast facets and counts today: If you are new to Atlas Search, simply spin up a cluster (M10 tier or above) and get started with our Atlas Search facets tutorial . If you are already using Atlas Search on M10 tiers and above then update your indexes to use the facet field mapping , and then start querying ! Your data remains searchable while it is being re-indexed. If you want to dig into the use cases you can serve with Atlas Search — along with users who are already taking advantage of it today — download our new Atlas Search whitepaper . Safe Harbor The development, release, and timing of any features or functionality described for our products remains at our sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality.
The Top 5 Data Trends Driving Competitive Advantage Today… — and Tomorrow
The latest market research from Cloudflight , a leading analyst firm based in Europe, identified 12 major technology trends for the current year. The trends found a radical shift in cloud adoption and an acceleration toward digital as people, society, the economy, and the environment all responded to the coronavirus pandemic. During a recent webinar , Dr. Stefan Ried (Cloudflight) and Mat Keep (MongoDB) shared key industry insights and explored in detail five of the most prevalent trends. The session found that, as the need for technological innovation grows, a company’s competitive advantage is increasingly tied to how well it can build software around its most important asset: data. In this post, Dr. Stefan Ried breaks down those five key trends and analyzes how businesses can drive data innovation to stay ahead of the field. Mat Keep then offers practical next steps to get started as data is increasingly managed in the cloud. Trend 1 Data becomes the differentiator — even beyond software Initially, many startups disrupted the incumbents in their industries with innovation based on software. All the while, non-digital-native enterprises caught up. Now data has become more important than software algorithms. Here’s an example: Imagine a traditional automotive company. The business could purchase components and software from a supplier to implement autonomous driving in its cars, but without enough learning data out of every region its cars wouldn’t drive reliably. In this case — and many more — the automotive firm cannot just buy a software competitive advantage off the shelf. Instead, it must build that advantage — and build it using data. It’s why data is quickly becoming the differentiator in all industries and why delivering a modern customer experience is increasingly reliant on this underlying infrastructure. Software Stack Eruption (Source: Cloudflight 2020) The above image illustrates just how the tech stack is evolving. Data quality is quickly becoming the outstanding differentiator compared to software algorithms. That’s why we consider the access, ownership, and quality of data to be the mountain of innovation in this decade and moving forward. Trend 2 Europe embraces various cloud scenarios Cloud adoption in Europe has always been behind that of the United States. One reason is obvious data sovereignty and compliance concerns. It would be an intriguing thought experiment to reflect on how the U.S. public cloud adoption would have developed over the past 10 years if the only strong and innovative providers were European or even Chinese companies. Europe, however, is now at an important inflection point. Global hyperscalers finally addressed these national privacy issues. Platform service providers, including MongoDB with MongoDB Atlas , have significantly increased support for these privacy requirements with technical features such as client-side-encryption and operational SLAs. This achievement enables enterprises and even public government agencies across Europe to embrace all three basic types of cloud scenarios. Lift and shift , moving existing legacy workloads without any change to new IaaS landscapes in the cloud. Modernization and decomposing existing application stacks into cloud-native services such as a DBaaS. Modernized workloads could leverage the public cloud PaaS stacks much better than monolithic legacy stacks. The new development of cloud-native applications and building modern applications with less code and more orchestration of many PaaS services. Trend 3 Hybrid-cloud is the dominant cloud choice and multicloud will come next Nearly 50 percent of participants in our recent webinar said hybrid-cloud is their current major deployment model. These organizations use different public and private clouds for different workloads. Just 20 percent of the attendees still restrict activities to a single cloud provider. Another equally sized group claimed the exact opposite approach to multicloud environments,where a single workload may use a mixture of cloud sources or may be developed on different providers to reach multiple regions. See below. Embracing the Cloud webinar poll results (June 2021) The increasing adoption of these real multicloud scenarios is yet another major trend we will see for many years. Less experienced customers may be afraid of the complexity of using multiple cloud providers, but independent vendors offer the management of a full-service domain across multiple providers. MongoDB Atlas offers this platform across AWS, Azure, and GCP, and paves the road for real multicloud adoption and innovation. Trend 4 Cloud-native is taking off with innovative enterprises In many client engagements, Cloudflight sees a strong correlation between new business models driven by digital products and cloud-native architectures. Real innovation happens when differentiated business logic meets the orchestration of a PaaS offering. That’s why car OEMs do not employ packaged asset-life-cycle-management systems but instead develop their own digital twins for the emerging fleet of millions of digitized vehicles. These PaaS architectures follow an API-first and service-oriented paradigm leveraging a lot of open-source software. Most of this open-source software is commercially managed by hyperscalers and their partner vendors to make it accessible and highly available without deep knowledge of the service itself. The approach provides very fast productive operations of new digital products. If compliance requires it, however, customers may operate the same open-source services on their own again. Once your product becomes extremely successful and you’re dealing with data volume far beyond one petabyte, you may also reconsider self-operations for cost reasons. This is because there is no operational lock-in for a specific service provider and you may become an “operations pro” on your own. Trend 5 Digital twins become cloud drivers in many industries Many people still connect the term “cloud computing” to virtualized compute-and-storage services. Yet cloud computing is far more. PaaS levels became increasingly attractive with prepackaged cloud-native services. It has been on the market for many years, but the perception and adoption — especially in Europe — is still behind its potential. Based on today’s PaaS services, cloud providers and their partners are already extending their offers to higher levels. The space of digital twins along with AI are clear opportunities here. There are offerings for each of the three major areas of digital twins: In modern automated manufacturing (industry 4.0), production twins are created when a product is ordered and they make production-relevant information (such as individual configurations) available to all manufacturing steps along the supply chain. Once the final product is delivered, the requirements for interactions and data models change significantly for these post-production-life-cycle twins . Production, post-production and simulation-twin (Source: Cloudflight ) Finally, simulation twins are a smart approach to test machine learning applications. Take, for example, the autonomous driving challenge: Instead of testing the ongoing iterations of driving “knowledge” on a physical vehicle, running virtual simulation twins is much preferred and safer than experiments in real traffic situations. Beyond manufacturing and automotive, there are many verticals in which digital twins make sense. Health care is a clear and obvious example in which real-life experiments may not always be the best approach. Success here depends mostly on the cooperation between technology vendors and the industry-specific digital twin ecosystems . In Summary Each of the five trends discussed center on or closely relate to cloud-native data management. A traditional database may be able to run for specific purposes on cloud infrastructure, but only a modern cloud-native data platform is able to serve both the migration of legacy applications and the development of multiple new cloud-native applications. Next Steps Where and how can companies get started on a path to using data as a driver of competitive advantage? Mat Keep, Senior Director of Products at MongoDB, takes us through how to best embrace this journey. As companies move to embrace the cloud, they face an important choice. Do they: Lift and shift: move existing applications to run in the cloud on the same architecture and technologies used on premises. Transform (modernize): rearchitect applications to take advantage of new cloud-native capabilities such as elasticity, redundancy, global distribution, and managed services. Lift and shift is often seen as an easier and more predictable path since it reuses a lot of the technology you use on premises — albeit now running in the cloud — presenting both the lowest business risk and least internal cultural and organizational resistance. It can be the right path in some circumstances, but we need to define what those circumstances are. For your most critical applications, lift and shift rarely helps you move the business forward. You will be unable to fully exploit new cloud-native capabilities that enable your business to build, test, and adapt faster. The reality we all face is that every application is different, so there is no simple or single “right” answer to choosing lift and shift versus transformation. In some cases, lift and shift can be the right first step, helping your teams gain familiarity with operating in the cloud before embarking on a fuller transformation as they see everything the cloud has to offer. This can also be a risk, however, if your teams believe they are done with the cloud journey and don’t then progress beyond that first step. To help business and technology leaders make the right decisions as they embrace the cloud, we have created an Executive Perspective for Lift and Shift Versus Transformation . The perspective presents best practices that can help prioritize your efforts and mobilize your teams. By working with more than 25,000 customers, including more than 50 percent of the Fortune 100, the paper shares the evaluation frameworks we have built that can be used to navigate the right path for your business, along with the cultural transformations your teams need to make along the way. Embracing the Cloud: Assessment Framework Toyota Material Handling in Northern Europe has recently undergone its own cloud journey. As the team evolved its offerings for industry 4.0, it worked with MongoDB as part of its transformation. Moving from monolithic applications and aging relational databases running on premises to microservices deployed on a multicloud platform, the company completed its migration in just four months. It reduced costs by more than 60 percent while delivering an agile, resilient platform to power its smart factory business growth. To learn more about cloud trends and the role of data in your cloud journey, tune in to the on-demand webinar replay .
Client-Side Field Level Encryption is now on Azure and Google Cloud
We’re excited to announce expanded key management support for Client-Side Field Level Encryption (FLE). Initially released last year with Amazon’s Key Management Service (KMS), native support for Azure Key Vault and Google Cloud KMS is now available in beta with support for our C#/.Net, Java, and Python drivers. More drivers will be added in the coming months. Client-Side FLE provides amongst the strongest levels of data privacy available today. By expanding our native KMS support, it is even easier for organizations to further enhance the privacy and security of sensitive and regulated workloads with multi-cloud support across ~80 geographic regions. My databases are already encrypted. What can I do with Client-Side Field Level Encryption? What makes Client-Side FLE different from other database encryption approaches is that the process is totally separated from the database server. Encryption and decryption is instead handled exclusively within the MongoDB drivers in the client, before sensitive data leaves the application and hits the network. As a result, all encrypted fields sent to the MongoDB server – whether they are resident in memory, in system logs, at-rest in storage, and in backups – are rendered as ciphertext. Neither the server nor any administrators managing the database or cloud infrastructure staff have access to the encryption keys. Unless the attacker has a compromised DBA password, privileged network access, AND a stolen client encryption key, the data remains protected, securing it against sophisticated exploits. MongoDB’s Client-Side FLE complements existing network and storage encryption to protect the most highly classified, sensitive fields of your records without: Developers needing to write additional, highly complex encryption logic application-side Compromising your ability to query encrypted data Significantly impacting database performance By securing data with Client-Side FLE you can move to managed services in the cloud with greater confidence. This is because the database only works with encrypted fields, and you control the encryption keys, rather than having the database provider manage the keys for you. This additional layer of security enforces an even finer-grained separation of duties between those who use the database and those who administer and manage the database. You can also more easily comply with “right to erasure” mandates in modern privacy legislation such as the GDPR and the CCPA . When a user invokes their right to erasure, you simply destroy the associated field encryption key and the user’s Personally Identifiable Information (PII) is rendered unreadable and irrecoverable to anyone. Client-Side FLE Implementation Client-Side FLE is highly flexible. You can selectively encrypt individual fields within a document, multiple fields within the document, or the entire document. Each field can be optionally secured with its own key and decrypted seamlessly on the client. To check-out how Client-Side FLE works, take a look at this handy animation. Client-Side FLE uses standard NIST FIPS-certified encryption primitives including AES at the 256-bit security level, in authenticated CBC mode: AEAD AES-256-CBC encryption algorithm with HMAC-SHA-512 MAC. Data encryption keys are protected by strong symmetric encryption with standard wrapping Key Encryption Keys, which can be natively integrated with external key management services backed by FIPS 140-2 validated Hardware Security Modules (HSMs). Initially this was with Amazon’s KMS, and now with Azure Key Vault and Google Cloud KMS in beta. Alternatively, you can use remote secure web services to consume an external key or a secrets manager such as Hashicorp Vault. Getting Started To learn more, download our Guide to Client-Side FLE . The Guide will provide you an overview of how Client-Side FLE is implemented, use-cases for it, and how it complements existing encryption mechanisms to protect your most sensitive data. Review the Client-Side FLE key management documentation for more details on how to configure your chosen KMS. Safe Harbor The development, release, and timing of any features or functionality described for our products remains at our sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality.