Benjamin Flast

5 results

Announcing Atlas Data Federation and Atlas Data Lake

Two years ago, we released the first iteration of Atlas Data Lake . Since then, we’ve helped customers combine data from various storage layers to feed downstream systems. But after years spent studying our customers’ experiences, we realized we hadn’t gone far enough. To truly unleash the genius in all our developers, we needed to add an economical cloud object storage solution with a rich MQL query experience to the world of Atlas. Today, we’re thrilled to announce that our new Atlas Data Federation and Atlas Data Lake offerings do just that. We now offer two complementary services, Atlas Data Federation (our existing query service formerly known as Atlas Data Lake) and our new and improved Atlas Data Lake (a fully managed analytic-oriented storage service). Together, these services (both in preview) provide flexible and versatile options for querying and transforming data across storage services, as well as a MongoDB-native analytic storage solution. With these tools, you can query across multiple clusters, move data into self managed cloud object storage for consumption by downstream services, query a workload-isolated inexpensive copy of cluster data, compare your cluster data across different points in time, and much, much more. In hearing from our customers about their experiences with Atlas Data Lake, we learned where they have struggled, as well as the features they’ve been looking for us to provide. With this in mind, we decided to shift the name of our current query federation service to Atlas Data Federation to better align with how customers see the service and are getting value. We’ve seen many customers benefit from the flexibility of a federated query engine service, including querying data across multiple clusters, databases, and collections, as well as exporting data to third-party systems. We also saw where our customers were struggling with data lakes. We heard them ask for a fully managed storage solution so they could achieve all of their analytic goals within Atlas. Specifically, customers wanted scalable storage that would provide high query performance at a low cost. Our new Data Lake provides a high-performance analytic object storage solution, allowing customers to query historical data with no additional formatting or maintenance work needed on their end. How it works Atlas Data Federation encompasses our existing Data Lake functionality with several new affordances. It continues to deliver the same power that it always has, with increased performance and efficiency. The new Atlas Data Lake will now allow you to create Data Lake pipelines (based on your Atlas Cluster backup schedules) and fields on which you can optimize queries. The service takes the following steps: On the selected schedule, a copy of your collection will be extracted from your Atlas backup with no impact to your cluster. During extraction, we build partition indexes based on the contents of your documents and the fields you’ve selected for optimization. These indexes allow your queries to be optimized by capturing the minimums and maximums (and other stats) of the records in each partition, letting you quickly find the relevant data for your queries. Finally, the underlying data lands in an analytic-oriented format inside of cloud object storage. This minimizes data scanned when you execute a query. Once a pipeline has run and a Data Lake dataset has been created, you can select it as a data source in our new Data Federation query experience. You can either set it as the source for a specific virtual collection in a Federated Database Instance or you can have your Federated Database Instance generate a collection name for each dataset that your pipeline has created. Amazingly, no part of this process will consume compute resources from your cluster — neither the export nor the querying of datasets. These datasets provide workload isolation and consistency for long-running analytic queries, a target for ETL jobs using the powerful $out to S3. This makes it easy to compare the state of your data over time. Advanced though this is, it’s only the beginning of the story. We’re committing to evolving the service, improving performance, adding more sources of data, and building new features. All of this will be based on the feedback you, the user, gives us. We can’t wait to see how you’ll use this powerful new tool and can’t wait to hear what you’d like to see next. Try Atlas Data Lake Today

June 7, 2022

Turning MongoDB into a Predictive Database

There’s a growing interest in artificial intelligence (AI) and machine learning (ML) in the business world. The predictive capabilities of ML/AI enable rapid insights from patterns detected at rates faster than manual analysis. Businesses realize that this can lead to increased profits, reduced costs, and accelerated innovation. Although businesses both large and small can benefit from the power of AI, implementing a predictive analytics project can be both complex and time-consuming. MongoDB , Inc. (NASDAQ: MDB), the leading, modern general purpose database platform, and MindsDB , the open-source machine learning platform that brings automated machine learning to the database, established a technology partnership to advance machine learning innovation. This collaboration aims to enhance the ability to streamline predictive capabilities for data science and data engineering teams within organizations to solve real-world business challenges. What is the best approach? Once you have identified the initial ML projects you’d like to focus on, choosing the right tools and methodologies can help speed up the time it takes to build, train, and optimize models. Model selection and feature engineering can be time consuming and difficult if you aren’t aware of the specific dimensions the ML model is going to train on. AutoML models excel at testing a wide variety of different algorithms to model a hypothesis of interest. Existing state-of-the-art AutoML frameworks provide methods to optimize performance including adjusting hyper parameters (such as the learning rate or batch size). The MindsDB AutoML framework extends beyond most conventional automated systems of hyper parameter tuning and enables novel upstream automation of data cleaning, data pre-processing, and feature engineering. To empower users with transparent development, the framework encompasses explainability tools, enables processing for complex data types (NLP, time series, language modeling, and anomaly detection), and gives users customizability by allowing imported models of their choice. MindsDB also generates predictions at the data layer—an additional, significant advancement that accelerates development speed. Generating predictions directly in MongoDB Atlas with MindsDB AI Tables gives you the ability to consume predictions as regular data, query these predictions, and accelerate development speed by simplifying deployment work-flows. Getting started with MindsDB We suggest starting with for a cloud managed version of MindsDB . MindsDB is an open source project (, so you can alternatively install it on your machine and run it locally. For simplicity, we recommend the docker installation described below: Install MindsDB using Docker First, check that you have docker installed by running: docker run hello-world To pull the image, run the following command: docker pull mindsdb/mindsdb Then, run the command below to start the container: docker run -p 47334:47334 -p 47336:47336 mindsdb/mindsdb If docker is not an option, you can follow our docs on how to install MindsDB locally. ( ) Setting up the connection Connecting MindsDB to MongoDB can be done in two ways: by using MindsDB Studio (the GUI) or by using Mongo clients (the terminal). Currently, integration works by accessing MongoDB through MindsDB’s MongoDB API as a new data source. More information about connecting to MongoDB can be found here . Use the Mongo shell to connect to MindsDB’s MongoDB API. Please note that you must have Mongo shell version ≥3.6 to use the MindsDB MongoDB API. If you are following this tutorial using MindsDB Cloud you can skip the section about config.json. There is a default configuration setup before starting the MongoDB API. The Mongo host will be the MindsDB Mongo API which is defined inside the host key as Please find below the config.json example. { "api": { "http": { "host": "", "port": "47334" }, "mysql": {} "mongodb": { "host": "", "port": "47336", "user": "mindsdb", "password": "", "database": "mindsdb" } }, "config_version": "1.4", "debug": true, "integrations": {}, "storage_dir": "/mindsdb_storage" } The location of the above config.json file can be found in the first output line of the log when MindsDB Server is started as a Configuration file value. If you want to change the host, default username or include password, you can make the changes there. To connect to MindsDBs via GUI: We can use MindsDB Studio to create a connection between MindsDB and MongoDB to access the data we wish to train our model on. Visit or from your favorite web browser to access the Studio. From the menu located on the left, select Database Integration. Then, select ADD DATABASE. In the connect to Database window: Select MongoDB as the Supported Database Add the subsequent information as Mongo host, port, username and password Now, we have successfully integrated with the MongoDB database. The next step is to use Mongo-client to connect to MindsDBs Mongo API and train models. To connect to MindsDBs Mongo API for local connection run: mongo --host -u "username" -p "password" If you are using MindsDB cloud, you need to use the username/password to connect to the MindsDB Mongo API. mongo --host -u "cloud_username" -p "cloud_password" Then use MindsDBs database and list collections: use mindsdb show collections Training a new Machine Learning Model using MQL We will leverage the power of Mongo Query Language (MQL) and MindsDB to train a model. The goal of the model is to predict the strength of a concrete mix, with input columns such as the age, amount of water used, types, and quantities of additives used to make the mix stronger. The dataset can be downloaded from Kaggle and represents a potential business use case in everyday construction projects to optimize the strength of a mix while minimizing the amount of material used—a goal that saves on costs without neglecting function. You can follow this tutorial with your data inside Mongodb or simply just import the csv file in a collection called material_strength. Also, you can get the exported collection from the above data on this URL . To train a new model, we need to call the insert() function on the mindsdb.predictors collection. Notably, the following information must be included: db.predictors.insert({ 'name': 'strength', 'predict': 'concrete_strength', 'connection': 'MongoIntegration', 'select_data_query':{ 'database': 'test_data', 'collection': 'material_strength', 'find': {} } }) The ‘name’ is simply the model name, ‘predict’ is the feature that we aim to predict, and ‘connection’ is the name of the MongoDB connection we have created using MindsDB Studio. Inside the select_data_query we should provide the name of the database, collection and find() function to select the data. Once you enter this information, MindsDB begins the training process. To verify that the training has been completed, you can use the find() command to check the model status inside mindsdb.predictors collection e.g.: Successful training will return a ‘status’: ‘complete’ notification. MindsDB Studio provides additional useful information to go beyond predictions and explain the results. The below figure refers to feature importances, automatically calculated and displayed to reveal which columns of your data likely matter for predictive strength. The following information can be obtained from MindsDB studio by selecting the preview option on your trained model. Moreover, the preview option also provides us with a confusion matrix to help us evaluate the performance of our model by buketizing true and predicted values. As this is a regression task, we stratify the true and predicted values to analyze how effective predictions are at reflecting the underlying data patterns. Strongly performing models have a notable diagonal component: this indicates that a model is successful at detecting the relationship between features and the output distribution. Elements located away from the main diagonal imply less accurate predictions (this could be, for example, due to sparse sampling of data in these output regions). The next step is to use the MQL to get the predictions back from the model collection. Querying the model After we have trained a model, we can go ahead and query the model. Using MQL, we will need to call the find() method on the model collection. In addition, we need to provide specific values for which we would like to obtain a prediction. An example would be: db.strength.find({'age': 28, 'superPlasticizer': 2.5, slag: 1, 'water': 162, 'fineAggregate': 1040}) The model created by MindsDB predicts a value of 17.3 with 90% confidence that the true value lies within the confidence_interval lower and upper bounds. One important piece of information is also the important_missing_information value where MindsDB suggests including values of the cement feature to the find() function will improve the prediction. This tutorial highlights the steps to create a predictive model inside MongoDB by leveraging MindsDB’s AutoML framework. Using the existing compute configuration, the example above took less than five minutes, without the need for extensive tooling, or pipelines in addition to your database. With MindsDB’s predictive capabilities inside MongoDB, developers can now build machine learning models at reduced cost, gain greater insight into model accuracy, and help users make better data-based decisions. Modernize with MongoDB and MindsDB MongoDB provides an intuitive process for data management and exploration by simplifying and enriching data. MindsDB helps turn data into intelligent insights by simplifying modernization into machine learning, AI, and the ongoing spectrum of data science. For a limited time, try MindsDB to connect to MongoDB, train models, and run predictions in the cloud. Simply sign-up here . It’s free (final pricing to be announced later this year), and our team is available on Slack and Github for feedback and support. Check it out and let us know what predictions you come up with.

November 10, 2021

MongoDB Atlas Online Archive for Data Tiering is now GA

We’re thrilled to announce that MongoDB Atlas Online Archive is now Generally Available. Online Archive allows you to seamlessly tier your data across Atlas clusters and fully managed cloud object stores, while retaining the ability to query it through a single endpoint. Reduce storage costs. Set the perfect price to performance ratio on your data. Automate data tiering. Eliminate the need to manually migrate or delete valuable data. Queryable archives. Easily federate queries across live and archival data using a unified connection string. With Online Archive, you can bring new use cases to MongoDB Atlas that were previously cost-prohibitive such as high volume time-series workloads, data archival for auditing purposes, historical log keeping and more. Manage your entire data lifecycle on MongoDB Atlas without replicating or migrating it across multiple systems. What is Atlas Online Archive? Online Archive is a fully managed data tiering solution that allows you to tier data across your "hot" database storage layer and "colder" cloud object storage to maintain queryability while optimizing on cost and performance. Online Archive is a good fit for many different use cases, including: Insert heavy workloads, where data is immutable and has lower performance requirements as it ages Historical log keeping and time-series datasets Storing valuable data that would have otherwise been deleted using TTL indexes We’ve received amazing feedback from the community over the past few months while the feature was in beta and we’re now confident in supporting your production workloads. Our users have put the feature through a variety of use cases in production and development workloads which has enabled us to make a wide range of improvements. Online Archive gives me the flexibility to store all of my data without incurring high costs, and feel safe that I won't lose it. It's the perfect solution. Ran Landau, CTO, Splitit Autonomous Archival Management It's easy to get started with Online Archive and it requires no ongoing maintenance once it’s been set up. In order to activate the feature, you can follow these simple steps: Navigate to the “Online Archive” tab on your cluster card and begin the setup flow. Set an archiving rule by selecting a date field, with dot-notation if it’s nested, or creating a custom filter. Choose commonly queried fields that you want your archival queries to be optimized for, with a few things in mind: Your data will always be “partitioned” by the date field in your archive, but can be partitioned by up to two additional fields as well. The fields that you most commonly query should be towards the top of the list (date can be moved to the top or bottom). Query fields should be chosen carefully as they cannot be changed after the fact and will have a large impact on query performance. Avoid choosing a field that has unique values as it will have negative performance impacts for queries that need to scan lots of data. And you’re done! MongoDB Atlas will automatically move data off of your cluster and into a more cost-effective storage layer that can still be queried with a single connection string that combines cluster and archive data, powered by Atlas Data Lake . What's Next? Along with announcing Online Archive as Generally Available, we’re excited to share a few additional product enhancements which should be available in the coming months: Private Link support when querying your archive Incremental deletes of data from your archive Support for BYO key encryption on your archival data Improved performance and stability Try Atlas Online Archive Online Archive allows you to right-size your Atlas clusters by storing hot data that is regularly accessed in live storage and moving colder data to a cheaper storage tier. Billing for this feature will include the cost to store data in our fully managed cloud object storage and usage based pricing for querying archive data. We can’t wait to see what new workloads you’ll bring onto MongoDB Atlas with the new flexibility provided by Online Archive! To get started, sign up for an Atlas account and deploy any dedicated cluster (M10 or higher). Have questions? Check out the documentation or head over to our community forums to get answers from fellow developers. And if we’re missing a feature you’d like to see, please let us know ! Safe Harbor Statement The development, release, and timing of any features or functionality described for MongoDB products remains at MongoDB's sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality. Except as required by law, we undertake no obligation to update any forward-looking statements to reflect events or circumstances after the date of such statements.

November 30, 2020