Last month we announced the Free Tier in MongoDB Atlas and the availability of MongoMirror, a downloadable utility that makes migrating live MongoDB replica sets easy. We’ve since incorporated the same live migration functionality into MongoDB Atlas, our database as a service, and updated the platform with some new features.
Live migration allows us to automate the import of any pre-existing MongoDB 3.0+ replica set into MongoDB Atlas. It works by performing an initial sync between a source database and a target database in Atlas, and then tailing the source oplog to keep the database in Atlas in sync with live updates from your application.
To use live migration, we’ll first need to create a dedicated (M10 or above) Atlas cluster.
In the example below, we’ve created a cluster with M30-size instances in US-EAST-1. You can see from the screenshot below that we now have a new option, “Migrate Data to this Cluster”.
We’ll need the following for the migration:
- Hostname: Port of the primary of our source replica set
- Username/password (if authentication is enabled)
- CAFile (if running MongoDB with SSL enabled and the --sslCAFile option)
Once we have this information, we’re ready to select “I’M READY TO MIGRATE”.
We’ll then be brought to a new window that asks for the location of our source replica set and credentials. The “VALIDATE” button at the bottom of the window checks to make sure that MongoDB Atlas can tail our source oplog to make sure that new operations are captured.
Clicking the “START MIGRATION” button kicks it off. We’ll get an email notification when it’s time to cut over to MongoDB Atlas; all we have to do to finalize the migration process is modify the database connection string in our app.
For a more detailed walkthrough of Live Migration in MongoDB Atlas, click here.
Now in (almost) all AWS regions
MongoDB Atlas is now available in most AWS regions, which means users around the world can leverage the database service with minimal geographical latency. As a reminder, MongoDB Atlas’ support for VPC Peering means teams can easily and securely connect their own VPCs to a new Atlas cluster in one of the new regions.
|AWS Regions (Americas)||AWS Regions (APAC)||AWS Regions (EMEA)|
|us-east-1 (N. Virginia)||ap-southeast-2 (Sydney)||eu-west-1 (Dublin)|
Queryable backup snapshots
Also new in MongoDB Atlas is the ability to query backup snapshots and restore data at the document level in minutes. No longer do we need to restore an entire cluster when all we really need is a small subset of our data.
Within the Atlas UI, we can now select a backup snapshot and click on the new “Query” option.
We can opt to use a downloadable Backup Tunnel, which handles the security for connecting to our backup instance, and use the mongo shell or a MongoDB driver to connect to our backup via the tunnel.
For a walkthrough of using queryable backups in Atlas, click here.
Explore your databases from the Atlas UI
Finally, the new Data Explorer in MongoDB Atlas allows us run queries, view metadata about our databases and collections, and see index usage statistics without resorting to the mongo shell. We can access the Data Explorer for dedicated Atlas clusters by selecting the new “DATA” button associated with our databases.
This will bring us to a new view, shown below.
10-Step Methodology to Creating a Single View of your Business: Part 2
Welcome to part 2 of 3-part single view blog series In Part 1 we reviewed the business drivers behind single view projects, introduced a proven and repeatable 10-step methodology to creating the single view, and discussed the initial "Discovery" stage of the project In today’s part 2, we dive deeper into the methodology by looking at the development and deployment phases of the project In Part 3 , we wrap up with the single view maturity model, look at required database capabilities to support the single view, and present a selection of case studies. If you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper . Develop & Deploy Phases of the Single View Methodology As a reminder, figure 1 shows the 10-step methodology to creating the single view. **Figure 1**: Single view methodology In part 1 of the blog series, we covered the Discover phase. We’ll now dive into the Develop and Deploy phases, starting with Step 4 of the methodology. Step 4: Appoint Data Stewards A data steward is appointed for each data source identified step 3 of the methodology. The steward needs to command a deep understanding of the source database, with specific knowledge of: The schema that stores the source data, and an understanding of which tables store the required attributes, and in what format; The clients and applications that generate the source data; The clients and applications that consume the source data. The data steward should also be able to define how the required data can be extracted from the source database to meet the single view requirements (e.g., frequency of data transfer), without impacting either the current producing or consuming applications. Step 5: Develop the Single View Data Model With an understanding of both what data is needed, and how it will be queried by the consuming applications, the development team can begin the process of designing the single view schema. Identify Common Attributes An important consideration at this stage is to define the common attributes that must appear in every record. Using our customer single view as an example, every customer document should contain a unique customer identifier such as a customer number or email address. This is the field that the consuming applications will use by default to query the single view, and would be indexed as the record’s primary key. Analyzing common query access patterns will also identify the secondary indexes that need to be created for each record. For example, we may regularly query customers against location and products or services they have purchased. Creating secondary indexes on these attributes is necessary to ensure such queries are efficiently serviced. There may also be many fields that vary from record to record. For example, some customers may have multiple telephone numbers for home, office, and cell phones, while others have only a cell number. Some customers may have social media accounts against which we can track interests and measure sentiments, while other customers have no social presence. MongoDB’s flexible document model with dynamic schema is a huge advantage as we develop our single view. Each record can vary in structure, and so we can avoid the need to define every possible field in the initial schema design, while using MongoDB document validation to enforce specific rules on mandatory fields. Define Canonical Field Formats The developers also need to define the canonical format of field names and data attributes. For example, a customer phone number may be stored as a string data type in one source system, and an integer in another, so the development team needs to define what standardized format will be used for the single view schema. We can use approaches such as MongoDB’s native document validation to create and enforce rules governing the presence of mandatory fields and data types. Define MongoDB Schema With a data model that allows embedding of rich data structures, such as arrays and sub-documents, within a single localized document, all required data for a business entity can be accessed in a single call to MongoDB. This design results in dramatically improved query latency and throughput when compared to having to JOIN records from multiple relational database tables. Data modeling is an extensive topic with design decisions ultimately affecting query performance, access patterns, ACID guarantees, data growth, and lifecycle management. The MongoDB data model design documentation provides a good introduction to the factors that need to be considered. In addition, the MongoDB Development Rapid Start service offers custom consulting and training to assist customers in schema design for their specific projects. Step 6: Data Loading & Standardization With our data model defined, we are ready to start loading source data into our single view system. Note that the load step is only concerned with capturing the required data, and transforming it into a standardized record format. In Step 7 that follows, we will create the single view data set by merging multiple source records from the load step. There will be two distinct phases of the data load: Initial load . Typically a one-time operation that extracts all required attributes from the source databases, loading them into the single view system for subsequent merging; Delta load . An ongoing operation that propagates updates committed to the source databases into the single view. To maintain synchronization between the source and single view systems, it is important that the delta load starts immediately following the initial load. For all phases of the data load, developers should ensure they capture data in full fidelity, so as not to lose data types. If files are being emitted, then write them out in a JSON format, as this will simplify data interchange between different databases. If possible, use MongoDB Extended JSON as this allows temporal and binary data formats to be preserved. Initial Load Several approaches can be used to execute the initial load. An off-the-shelf ETL (Extract, Transform & Load) tool can be used to migrate the required data from the source systems, mapping the attributes and transforming data types into the single view target schema. Alternatively, custom data loaders can be developed, typically used when complex merging between multiple records is required. MongoDB consulting engineers can advise on which approach and tools are most suitable in your context. If after the initial load the development team discovers that additional refinements are needed to the transformation logic, then the single view data should be erased, and the initial load should be repeated. Delta Load The appropriate tool for delta loads will be governed by the frequency required for propagating updates from source systems into the single view. In some cases, batch loads taken at regular intervals, for example every 24 hours, may suffice. In this scenario, the ETL or custom loaders used for the initial load would generally be suitable. If data volumes are low, then it may be practical to reload the entire data set from the source system. A more common approach is to reload data only from those customers where a timestamp recorded in the source system indicates a change. More sophisticated approaches track individual attributes and reload only those changed values, even keeping track of the last-modification time in the single-view schema. If the single view needs to be maintained in near real time with the source databases, then a message queue would be more appropriate. An increasingly common design pattern we have observed is using Apache Kafka to stream updates into the single view schema as they are committed to the source system. Download our Data Streaming with Kafka and MongoDB white paper to learn more about this approach. Note that in this initial phase of the single view project, we are concerned with moving data from source systems to the single view. Updates to source data will continue to be committed directly to the source systems, and propagated from there to the single view. We have seen customers in more mature phases of single view projects write to the single view, and then propagate updates back to the source systems, which serve as systems of record. This process is beyond the scope of this initial phase. Standardization In a perfect world, an entity’s data would be consistently represented across multiple systems. In the real world, however, this is rarely the case. Instead, the same attributes are often captured differently in each system, described by different field names and stored as different data types. To better understand the challenges, take the example below. We are attempting to build a single view of our frequent travelers, with data currently strewn across our hotel, flight, and car reservation systems. Each system uses different field names and data types to represent the same customer information. **Figure 2**: The name’s Bond….oh hang on, it might be Bind During the load phase, we need to transform the data into the standardized formats defined during the design of the single view data model. This standardized format makes it much simpler to query, compare, and sort our data. Step 7: Match, Merge, and Reconcile Even after standardizing divergent field names and data types during the data load, inconsistencies can often exist in the data itself. Accurately merging disparate records is one the toughest challenges in building a single view. The good news is that MongoDB has developed tools that can assist in this process. Looking again at our frequent traveler example above, we can see that the customer names are slightly different. These variances in the first and last names would result in storing three separate customer records, rather than aggregating the data into our desired single view. It is not practical, or necessary, to compare each customer record to every other customer record loaded from the source systems. Instead, we can use a grouping function to cluster records with similar matching attributes. This should be executed as an iterative process: Start by matching records against unique, authoritative attributes, for example by email address or credit card number; Group remaining records by matching combinations of attributes – for example a last_name, date_of_birth, and zip_code triple; Finally, we can apply fuzzy matching algorithms such as Levenshtein distance , cosine similarity , and locality sensitive hashing to catch data errors in attributes such as names. Using the process above, a confidence factor can be applied to each match. For those matches where confidence is high, i.e. 95%+, the records can be automatically merged and written to the authoritative single view. Note that the actual confidence factor can vary by use case, and is often dependent on data quality contained in the source systems. For matches below the desired threshold, the merged record with its conflicting attributes can be written to a pending single view record for manual intervention. Inspecting the record to resolve conflicts might be performed by the data steward, or by the application user when they access the record. **Figure 3**:Using MongoDB tools to move from disparate source data to merged and reconciled single view data sets To assist customers, MongoDB consulting engineers have developed tools to facilitate the process above: A Workers framework that parallelizes document-to-document comparisons. The framework allows long running jobs to be partitioned and run over collections of records, maintaining progress of grouping and matching. A Grouping tool allows records to be clustered based on attribute similarity, using algorithms such as Levenshtein to calculate the distance between different documents, and then single-linkage clustering to create precise matches for merging. By combining the Workers framework and Grouping tool, merged master data sets are generated, allowing the project team to begin testing the resulting single view. Step 8: Architecture Design While the single view may initially address a subset of users, well-implemented solutions will quickly gain traction across the enterprise. The project team therefore needs to have a well-designed plan for scaling the service and delivering continuous uptime with robust security controls. MongoDB’s Production Readiness consulting engagement will help you achieve just that. Our consulting engineer will collaborate with your devops team to configure MongoDB to satisfy your application’s availability, performance, and security needs. Step 9: Modify the Consuming Systems With the merged data set created and systems provisioned, we can begin modifying the applications that will consume the single view. The first step will be to create an API that exposes the single view. This will typically be a RESTful web service that abstracts access to the underlying data set. Any number of consuming applications – whether customer-facing web and mobile services, or backend enterprise and analytics applications – can be repointed to the web service, with no or minimal modification the application’s underlying logic. Note that write operations will continue to be committed directly to the source systems. It is generally a best practice to modify one consuming application at a time, thus phasing the development team’s effort to ensure correct operation, while minimizing business risk. Step 10: Implement Maintenance Processes No organization is static. Digital transformation initiatives supported by agile development methodologies are enabling enterprises to innovate faster – whether through launching new services or evolving existing applications. Our single view data model needs to maintain pace with business change. This change can manifest itself in adding new attributes from existing source systems, onboarding entirely new data sources, or creating new application uses for the single view. The project team needs to institutionalize governance around these maintenance processes, defining a strategy on how application changes that generate new attributes of value are integrated into the single view schema. The steps defined above – scoping the required changes, identifying the data producers and stewards, updating the schema, and determining the load and merge strategies – are all essential to maintaining the single view. In some more mature single view projects, the application team may decide to write new attributes directly to the single view, thus avoiding the need to update the legacy relational schemas of source systems. This is discussed in more detail in the Maturity Model section of the whitepaper. As we consider the maintenance process, the benefits of a flexible schema – such as that offered by MongoDB’s document data model – cannot be underestimated. As we will see in the Case Studies section in part 3, the rigidity of a traditional relational data model has prevented the single view schema from evolving as source systems are updated. This inflexibility has scuppered many single view projects in the past. Wrapping Up Part 2 That wraps up the second part of our 3-part blog series. In the final Part 3, we will discuss the single view maturity model, explore required database capabilities to host the single view, and present a selection of case studies. Remember, if you want to get started right now, download the complete 10-Step Methodology to Creating a Single View whitepaper .
Take Advantage of Low-Latency Innovation with MongoDB Atlas, Realm, and AWS Wavelength