MongoDB and BigID Deliver Scalable Data Intelligence for Enterprise Data
Data never sleeps. Every time someone clicks on a site or sign-up for a newsletter, bytes are created or tracked, in the staggering amount of more than 2.5 quintillion every day. All that data is a key business driver and the foundation for decision-making, but most enterprises aren’t even aware of everything they’re collecting, or where it’s stored. As global protection and privacy regulations evolve and the sheer amount of data grows, they’re struggling to keep up. As consumers, we’d like to think our data is safe. Companies like BigID are helping businesses discover and manage sensitive, personal, and critical data across the entire data ecosystem – and take action for privacy, protection, and perspective. I had the opportunity to discuss the BigID platform and the data technology behind it with Eyal Sacharov, BigID’s VP Research & Chief Architect and Oren Ashkenazy, Head of DevOps. AD: Eyal, Oren, thanks for joining me for this discussion. Can you tell us a bit about yourself and the genesis of the company? Eyal Sacharov: I led the R&D team lead at the founding of BigID. As part of that team, I was responsible for the design and implementation of the product, including the evaluation of relevant technologies. My role has now evolved to BigID's Chief Architect and VP Research, where I oversee the data science team, product integration and customer-driven technology innovation. Oren Ashkenazy: I’m Head of DevOps at BigID, and have a real passion for development technologies and scalable architecture. The genesis of BigID in 2016 was based on the realization that enterprises were struggling to safeguard the sensitive data they collect and process on individuals, on customers, on employees, on clients — even as they looked to transition to being data-driven organizations. Founded by Dimitri Sirota and Nimrod Vax, who came at the challenge with decades of experience in enterprise information security and access management, BigID saw the opportunity to provide an accounting-like framework to help organizations better understand, protect, and responsibly utilize the data they collect and process. AD: For anyone who isn’t familiar with BigID yet, could you describe why you set out to build this and the problem it’s solving? ES & OA: Our initial intent was to address the challenge that you can’t protect or govern what you can’t find. This challenge became particularly acute when global privacy and protection regulations emerged, especially the EU GDPR. GDPR required enterprises to understand and report on not just what data they have, but whose it is and how it’s being collected and processed. The requirements for enterprises to be more transparent and accountable in how they collect and use personal information have made the demand for technology like BigID’s data intelligence platform more urgent. When we were starting BigID, we focused on extending beyond traditional discovery approaches built for finding and enumerating a specific set of identifiers (with highly variable degrees of accuracy and efficiency). We looked to go one level deeper and measure how connected and related that data is to a specific person — that’s what defines it as personal. Gaining that understanding into what is personal data or information is not only important for operationalizing at scale and automating privacy requirements for data access rights and accountability, but crucial for any enterprise that wants to maintain brand trust. Enterprises are building their future on data. Customers, consumers, partners, and employees need to be assured that they can trust enterprises to not only safeguard their data, but to use it in ways that are consistent with their expectations. AD: How would you describe the platform and the unique advantages that BigID gives its customers? ES & OA: We created the first product purpose-built platform for the kind of data discovery required for privacy. Designed using a micro-services architecture that leverages correlation and machine learning for inference and model augmentation, the architecture allows us to scale to support the large volumes of data that enterprises are collecting and processing. By doing so, we also fashioned the first platform that was able to capture context around data — for example,whom it belonged, or whether there was an associated permission, or who had access to it.. Some of these are essential for privacy. But they also play a larger role in providing deeper insights into the what, where, who, why, and when of how data was collected, processed, and shared. As the BigID foundation has evolved to encompass cataloging, classification, and correlation, the platform has delved beyond discovery alone. Today, BigID is the most comprehensive platform in the market to provide organizations insight and intelligence on their most important assets: the sensitive, personal, and critical data that they collect and process. AD: How did you land on MongoDB to help you solve these challenges? ES & OA: As we were architecting our platform, we wanted to ensure that we could scale effectively both horizontally and vertically, and to ensure high-performance for indexing and lookups to support our correlation and ML-based approach. From a technology perspective, we were looking for a modern database system that would be easy to install and maintain, and that wouldn't require our customers and partners to build in-house expertise in order to deploy.. MongoDB appealed to us as an option because of the well-balanced, general purpose approach. It has a good set of capabilities that we couldn’t find with other databases, such as the aggregation framework, solid graph capabilities, lookups, and full text indexing. This allows us to use it for various microservices, both as a standard persistence layer as well as a more advanced computational server. For example, we make extensive use of the aggregation framework across the product to support what we describe as our discovery in depth capabilities that span correlation, multiple forms of classification, and cataloging. Native support for JSON was compelling in this regard as well. Scaling features, such as replica sets and sharding, also gave us confidence we could address customer requirements. The flexible schema for documents suited our collection requirements. Our intent from the outset was to support the full range of data sources we encountered at the customer level; the flexible schema allows us to adjust collection for each of the data connectors based on the specific set of fields for each data source. From a deployment perspective, we liked the ease of installing. In particular, MongoDB is simple to deploy with Docker, which is important for us as BigID itself is Dockerized. Many of our customers have their own MongoDB servers, which makes our deployment easier. MongoDB also offers a range of extended tools with their commercial products, such as Encrypted Storage and LDAP support. These add value to our customers and ensure that their most stringent requirements are met from the database creator. We were also confident that the rate of adoption and large developer community would ensure a robust roadmap and support. And, our experience with support services and professional consulting in the initial stages was top notch. AD: What advice would you give someone who is considering using MongoDB for their next product? ES & OA: We recommend that you explore and evaluate MongoDB in detail. MongoDB is a very strong platform with some powerful features and tools that should be mastered. Make sure to conduct a comprehensive PoC for various use cases relevant for your next product. And in any case, invest in designing the data model upfront to ensure that the appropriate considerations and priorities are in scope. Consider factors like how frequently data updates will be performed, concurrency requirements, and the relative distribution of read-intensive vs. write operations. AD: Where have you deployed MongoDB? On-premises, in the cloud, via MongoDB Atlas? ES & OA: We’ve deployed largely on-prem, but a growing number of our customers are opting for MongoDB Atlas. Another advantage of MongoDB is that we can maintain a persistence layer between on-prem and cloud instances. MongoDB Atlas is also an integral component of the SaaS service we are planning to launch later this year. Having already committed to MongoDB for our core design, Atlas was an easy choice. It stands out not only for f the depth of its cloud service platform support, but also for the assurance of version compatibility as new releases roll out. MongoDB’s native monitoring tools are even more powerful for us in optimizing uptime and performance for a service, relative to the value they provide when BigID is deployed in the customers’ data centers or cloud. MongoDB’s real-time monitoring also provides us with live information on the cluster performance, from hottest collection to frequently used resources to the number of open connections and levels of network utilization. As part of database tuning, the Atlas profile tool is a great feature that helps us to find slow queries that need to be optimized and to detect performance bottlenecks. AD: What tools are you using to deploy and monitor MongoDB? ES & OA: Monitoring our MongoDB clusters is an essential part of being production-ready and ensuring our systems are operating at optimal health. We monitor performance spikes to get a solid view of all of the metrics for the underlying hardware and operating system. As far as monitoring tools go, for deployment we are using HashiCorp Terraform MongoDB Atlas Provider, which is the official plugin approved and tested by HashiCorp. It enables us to add and create MongoDB Atlas clusters to our Terraform environment and then peer them to the BigID application running on Amazon EKS (Elastic Kubernetes Service). AD: Thanks for an informative conversation, Eyal and Oren! We appreciate your time and your partnership. To learn more about our joint solution, or if you’re evaluating MongoDB Atlas, reach out to ISV@MongoDB.com for more information. About BigID: BigID’s data intelligence platform enables organizations to know their enterprise data and take action for privacy, protection, and perspective. By applying advanced machine learning and deep data insight, BigID transforms data discovery and data intelligence to address data privacy, security, and governance challenges across all types of data, in any language, at petabyte-scale, across the data center and the cloud. BigID has raised $146 million in funding since its founding in 2016 and has been recognized for its data intelligence innovation as a 2019 World Economic Forum Technology Pioneer, named to the 2020 Forbes Cloud 100, a Business Insider 2020 AI Startup to Watch, and an RSA Innovation Sandbox winner. Find out more at http://bigid.com or visit us at http://bigid.com/demo to schedule a demo. About MongoDB: MongoDB is the leading modern, general purpose database platform, designed to unleash the power of software and data for developers and the applications they build. Headquartered in New York, MongoDB has more than 18,400 customers in over 100 countries. The MongoDB database platform has been downloaded over 110 million times and there have been more than one million MongoDB University registrations.
Nuxeo Achieves 11-Billion-Object Benchmark on AWS in Partnership with MongoDB Atlas
Pushing the Limits to Demonstrate Nuxeo's Ability to Manage the Largest, Most Complex Content Management Requirements Case Study Executive Summary Having already proven that their Content Services Platform could handle nearly unlimited workloads, Nuxeo set out to smash their established benchmark of 1 billion documents. The challenge: load 11 billion objects while maintaining the highest levels of system performance. MongoDB, Nuxeo’s data platform partner, helped make it possible with powerful tools capable of managing hundreds of metadata tags per object across billions of objects. In addition, Nuxeo leveraged the AWS cloud infrastructure and Amazon Elasticsearch Service – both key components of its Nuxeo Cloud offering – to achieve elastic scalability and the highest levels of indexing and search performance. Customers Expect Content Managment Without Limitations Global companies use Nuxeo to build applications that manage enormous volumes of digital content including scanned images, documents, PDFs, and even rich-media assets like high-resolution photos and video. Nuxeo supports high levels of complexity including customization, layers of security, and sophisticated metadata as well as complicated workflows and business processes. This enables Nuxeo customers to put their content to work solving complex business problems and delivering unique content-enabled solutions. David Woolston, VP Business Development at Nuxeo, describes Nuxeo as a content services platform on steroids. “One thing that drives Nuxeo’s success is our ability to scale up to handle the most complex workloads and largest repositories of documents,” explained Woolston. The secret is their relationship with AWS and MongoDB Atlas, the global cloud database service. As a disruptive player in the content services market, Nuxeo has differentiated itself through technology innovation. The Nuxeo Platform operates on the AWS Cloud, which allows for unparalleled flexibility and scalability. Nuxeo also uses the managed database service, MongoDB Atlas. Atlas is capable of managing hundreds of metadata tags across literally billions of Nuxeo objects, storing them securely and making them easily digestible and queryable in JSON-like documents. This means the Nuxeo team can focus on building new content services and platform capabilities rather than managing a database. Challenge Accepted “The biggest companies in the world are coming to us with their largest workloads and saying, ‘we know you can handle this.’ We believe our technology can scale almost endlessly with MongoDB Atlas and AWS,” said Woolston. “With a 1-billion-object benchmark already completed, we really wanted to push the limits and prove it out with a 11-billion-object benchmark.” The idea was to test Nuxeo from an optimal application and configuration perspective, not to simply throw money at more hardware. “In order to solve our customers’ complex challenges, Nuxeo provides an extremely robust platform. The deployment includes Elasticsearch, MongoDB Atlas, and all the bells and whistles of the Nuxeo Platform itself,” said Joe Quinto, Senior Program Manager for Nuxeo Cloud. “We wanted to push the boundaries of every element – stress as many components as we could. Effectively managing 11 billion documents was our yardstick. The goal was not just to hit the ceiling, but to break through it.” The biggest companies in the world are coming to us with their largest workloads and saying, ‘we know you can handle this.’ We believe our technology can scale almost endlessly with MongoDB Atlas and AWS. David Woolston, VP Business Development, Nuxeo Dynamic Testing with a Two-Phase Approach Nuxeo adopted a two-phase approach for its benchmarking exercise. In the first phase, the Nuxeo team used a single Nuxeo repository configured with MongoDB Atlas and Elasticsearch. The point of the exercise was to test the practical limits of a single-repository approach and also to illustrate the inherent advantages of a NoSQL solution like MongoDB. In the first phase, the team was able to successfully scale a single Nuxeo repository to 3 billion objects with no database sharding, a feat that’s virtually impossible with SQL-based technologies. In the second phase of the project, the Nuxeo team employed a multirepository approach and made use of multiple instances of MongoDB Atlas and the Amazon Elasticsearch Service as well as MongoDB sharding to efficiently scale to over 11 billion objects. For both phases of the benchmark the team used an actual Nuxeo Cloud deployment. This was not a highly orchestrated lab exercise. In both phases of the benchmarking project, the team continuously tested to ensure that the Nuxeo Platform and underlying technologies would scale and perform at a level that would meet enterprise customers’ high expectations. The team monitored metrics to determine if, as the repository grew, users could continue to import new objects and metadata at an extremely high rate of ingestion. Since this was a real-world exercise, Nuxeo employed its default Ingestion Pipeline throughout the benchmarking project, complete with metadata import and full-text indexing. Nuxeo also employed automated testing to address common user activities for content management. This included search (both database queries and full-text searches) and navigation as well as create, read, update, and delete (CRUD) actions. Over three months, as Nuxeo was progressing toward their 11-billion-object benchmark and continuing to load new objects and data into the system, MongoDB Atlas was managing the database. While the test was running, the MongoDB team monitored data access patterns and gave advice on best practices for such a large number of documents. It was important to identify points where they needed to increase capacity on the platform before moving to higher tiers. Using more than 100 metrics, MongoDB was able to identify tipping points and suggest ways to improve the environment before going on to the next level. “The Nuxeo team was prepared with a solid plan and they let us know what they wanted to achieve,” said Diego Burstyn, Sr. Solutions Architect at MongoDB. “Our role was to correlate that with what we were seeing under the hood.” Pushing the Limits While Maintaining Response Times and Throughput As previously mentioned, the key success indicators the Nuxeo team was looking for were response times and throughput. It was important to see that the application was responding as expected, and enormous amounts of data could be loaded in a reasonable amount of time. A critical outcome for the project was to be able to provide Nuxeo customers with real-world guidance and best practices for scaling up the Nuxeo Platform along with key services, like MongoDB Atlas. The team categorized its learnings in three specific areas: elasticity, steps, and real-life usability. Bragging Rights and Lessons Learned The 11-billion-object benchmark test proved Nuxeo’s ability to scale with AWS Cloud elasticity. Flexible scalability allowed the team to expand volume when a capacity limit was reached, or for a temporary need like re-indexing data. By implementing the test with deliberate steps, the team learned about infrastructure and configuration adjustments that needed to be made to maintain optimal performance at extreme high volumes. Testing in an actual environment, with real documents, demonstrated how the Nuxeo Platform can handle enormous document volumes in real-life applications. Traditional content management systems can’t scale the way we can,” said Woolston. “But beyond that, the power of MongoDB Atlas combined with the complete toolset and support we get from AWS allows us to scale in the most efficient and intelligent way possible. Hitting the 11-billion-object benchmark was more than a matter of bragging rights. It provided tangible proof of the unique value proposition Nuxeo offers through its partnership with MongoDB Atlas and AWS. This exercise delivered an understanding of how these different components scale and the best practices to do this efficiently. With these learnings and meaningful data, Nuxeo customers are better prepared to scale up, either in Nuxeo Cloud or in their own cloud environments. Get Started Today Try MongoDB Atlas on AWS: Redeem promo code NuxeoAtlas100 for $100 in Atlas Credits Schedule a demo of the Nuxeo Platform