GIANT Stories at MongoDB

Scaling Your Replica Set: Non-Blocking Secondary Reads in MongoDB 4.0

MongoDB 4.0 adds the ability to read from secondaries while replication is simultaneously processing writes. To see why this is new and important let's look at secondary read behavior in versions prior to 4.0.

Background

From the outset MongoDB has been designed so that when you have sequences of writes on the primary, each of the secondary nodes must show the writes in the same order. If you change field "A" in a document and then change field "B", it is not possible to see that document with changed field "B" and not changed field "A". Eventually consistent systems allow you to see it, but MongoDB does not, and never has.

On secondary nodes, we apply writes in batches, because applying them sequentially would likely cause secondaries to fall behind the primary. When writes are applied in batches, we must block reads so that applications cannot see data applied in the "wrong" order. This is why when reading from secondaries, periodically the readers have to wait for replication batches to be applied. The heavier the write load, the more likely that your secondary reads will have these occasional "pauses", impacting your latency metrics. Given that applications frequently use secondary reads to reduce the latency of queries (for example when they use "nearest" readPreference) having to wait for replication batches to be applied defeats the goal of getting lowest latency on your reads.

In addition to readers having to wait for replication batch writes to finish, the writing of batches needs a lock that requires all reads to complete before it can be taken. That means that in the presence of high number of reads, the replication writes can start lagging – an issue that is compounded when chain replication is enabled.

What was our goal in MongoDB 4.0?

Our goal was to allow reads during oplog application to decrease read latency and secondary lag, and increase maximum throughput of the replica set. For replica sets with a high write load, not having to wait for readers between applying oplog batches allows for lower lag and quicker confirmation of majority writes, resulting in less cache pressure on the primary and better performance overall.

How did we do it?

Starting with MongoDB 4.0 we took advantage of the fact that we implemented support for timestamps in the storage engine, which allows transactions to get a consistent view of data at a specific "cluster time". For more details about this see the video: WiredTiger timestamps.

Secondary reads can now also take advantage of the snapshots, by reading from the latest consistent snapshot prior to the current replication batch that's being applied. Reading from that snapshot guarantees a consistent view of the data, and since applying current replication batch doesn't change these earlier records, we can now relax the replication lock and allow all these secondary reads at the same time the writes are happening.

How much difference does this make?

A lot! The range of performance improvements for throughput could range from none (if you were not impacted by the replication lock - that is your write load is relatively low) to 2X.

Most importantly, this improves latency for secondary reads – for those who use readPreference "nearest" because they want to reduce latency from the application to the database – this feature means their latency in the database will also be as low as possible. We saw significant improvement in 95 and 99th percentile latency in these tests.

Thread levels 8 16 32 64
Feature off 1 2 3 5
Feature on 0 1 1 0

95th percentile read latency (ms)

Best part of this new feature? You don't need to do anything to enable it or opt-into it. All secondary reads in 4.0 will read from snapshot without waiting for replication writes.

This is just one of a number of great new features coming in MongoDB 4.0. Take a look at our blog on the 4.0 release candidate to learn more. And don’t forget, you’ve still got time to register for MongoDB World where you can meet with the engineers who are building all of these great new features.

Introducing the MongoDB Masters Program for 2018

My name is Michael Lynn and I’m the Worldwide Director of Developer Advocacy at MongoDB. I’m incredibly proud to be a part of the Developer Relations and Marketing team here at MongoDB.

A majority of what we do in Developer Advocacy is related to increasing awareness of MongoDB within the community of developers and data scientists. We do this through involvement in a variety of user groups, industry conferences, and events as well as through management of the MongoDB Masters Program.

This program was created to recognize leaders within their community, experts in MongoDB, and professionals who freely share their knowledge. This year’s class includes returning Masters, as well as new members who have distinguished themselves in the past year.

MongoDB Masters in years past have provided valuable product feedback and driven thought leadership in their fields. We look forward to deepening this relationship over the coming year. This year’s class of Masters will also be encouraged to participate in beta testing programs, share their experiences with MongoDB, and continue to expand and broaden their own voices as leaders in the technical community.

The Masters program has been an incredibly rewarding and valuable program for MongoDB and we greatly appreciate the efforts of our most vocal, and most active supporters. This is why we’ve put so much time and effort into creating a program to recognize these individuals and thank them for their contributions.

Master Honorees enjoy benefits ranging from access to the MongoDB Engineering and Product Management teams to discounted MongoDB Atlas Credits.

Preparations are underway for the MongoDB Masters Summit, which will be held on Tuesday, June 26th as part of MongoDB World 2018. We’ll have several speakers and a special Q&A session with Eliot Horowitz, our Co-Founder, and CTO. We encourage all members of our community to register for MongoDB World 2018, meet the Masters in person, and join our Advocacy Hub to start their own path to becoming a MongoDB Master.

So, all this talk of Masters – how, you might be thinking, do I become a Master?

Before I dive into an explanation of the requirements, please take a moment to review the bios of some of the existing Masters. You’ll easily spot some things in common across all of these incredibly talented and accomplished individuals.

Passion

Masters are passionate about technology and about solutions to technical problems. This passion drives these individuals to do things that few technologists will do. While this attribute is common among the existing and past Masters, it’s not easy to measure. You know it when you see it and it’s woven into the careers of many of the people I’ve encountered surrounding this program.

Impact

If passion is fuel, then impact is fire. Impact is the result of the passionate pursuit of worthy causes. Again, this is an attribute easily found in common across our Masters membership. Measuring impact is also difficult because in many cases, especially when dealing with the Masters, the impact of their actions, projects, and even their careers is widespread. Masters are individuals that positively impact their families, teams, companies, and their communities.

Execution

Execution is the spark that ignites fire. Elegant, efficient and effective solutions to technical challenges rarely, if ever, happen by accident. Rather, truly successful solutions require intelligent, deliberate execution – and in most cases, hard work. I strongly encourage you to spend time with any of the Masters and it will become clear that these individuals know how to execute. They know how to get things accomplished.

These are the attributes of a MongoDB Master and to achieve membership, an individual should be passionate about great technology and about solving technical problems. These individuals should have demonstrated, through successful execution, a massively beneficial impact on their company, team and/or community.

Are you interested in becoming a MongoDB Master, or do you think you may already meet the requirements? I would like to invite you to join us at MongoDB World in New York to learn more; consider completing the nomination form below to have yourself or a colleague considered for a MongoDB Masters membership.

MongoDB Masters membership nomination →

Stratifyd & MongoDB: AI-Driven Business Insights to Keep Customers Happy

2017 was a banner year for MongoDB's partner ecosystem. We remain strategic about engaging with our channels, and the results are validating our approach. Our strong network of ISVs, global system integrators, cloud, resellers, and technology partners is a competitive differentiator that helps us scale.

We are especially excited about the innovation and growth in store for our ISV business in 2018. It's already off to a great start. Our newest ISV partner Stratifyd is a fantastic example of how platforms built around MongoDB address serious market needs with the most cutting-edge, innovative technology.

Stratifyd is an end-to-end customer analytics platform powered by AI. The platform provides competitive advantages to some of the most recognized brands in the world. LivePerson, Etsy, MASCO, Kimberly-Clark, and many more rely on Stratifyd for a 360-degree view of their end customers.

Stratifyd analyzes customer interactions such as online reviews, social media posts, phone calls, emails, chats, surveys, CRM data, and more to turn them into actionable business insights which increase customer acquisition and retention, which is critical to the continued success of Stratifyd’s clients. In addition to these benefits, Stratifyd is just a flat-out cool implementation of AI.

I caught up with Stratifyd's CTO, Kevin O'Dell, to discuss the data technology behind the platform, and how MongoDB drives value for their customers.

---

For anyone that isn’t familiar with Stratifyd yet, how do you describe the platform?

Stratifyd uses human generated data to analyze, categorize, and understand intent with the purpose of changing human behavior. This changes the way brands interact with their customers, but also the way customers interact with brands, increasing customer acquisition and raising retention rates.

What was the genesis of the company? Why did you set out to build this?

Stratifyd was a result of postdoctoral research done at the University of North Carolina, Charlotte. Our founders were researching how AI could analyze unstructured data. During their research, they discovered strong business and government use cases. The founding team was working with numerous three letter agencies on predicting terrorist and disease movements globally. They were able to raise millions of dollars in funding from these agencies. The demand for a product organically grew from there, which led to the development of the Stratifyd platform.

I love the insights Stratifyd can provide – how would you describe the unique advantages that Stratifyd gives its customers?

Stratifyd provides near real-time business intelligence for contact center, marketing, product, and customer experience teams, all based on customer interactions. These insights enable businesses to be proactive rather than reactive in regard to business strategy. Stratifyd customers are able to respond to customer requests, complaints, or general feedback in near real time, changing the way companies interact with their end users. For example, we have empowered a customer with the knowledge to launch a new product line. Another customer gained insights that fundamentally changed how they are rolling out a 700+ million-dollar brand in a new continent.

What kind of feedback are you getting from customers that have deployed Stratifyd in their businesses?

Our customers love using our platform. They are surprised at how simple it is to use, and how powerful it is. They really appreciate how Stratifyd is making AI and machine learning meaningful for them in their day-to day-lives. Stratifyd helps ensure measurable results from day one. Speaking of implementation, they REALLY love that we don’t just use the term day one figuratively – customers are up and running in less than a day.

Talk to me about how you landed on MongoDB. What were you looking for in a database, and what problems were you having before moving to MongoDB?

That's an easy one to answer: speed and flexibility. Stratifyd ingests data from hundreds of sources. We needed a database that could keep up with high read and write request rates while handling a flexible schema. The hardest problem we were trying to solve was lack of secondary indexes; with those, MongoDB accelerated our query response times by at least 100x.

Can you share any best practices for scaling your MongoDB infrastructure? Any impressive metrics around the number of interactions, the volume of reads / writes per second, response times?

As a SaaS-first platform, being always-on is a HUGE best practice for us. MongoDB’s innate replication and failover abilities ensured less than 17 minutes of total downtime last year! Using MongoDB as our backend system, our AI can process a quarter of a million words in less than a minute.

How do you measure the impact of MongoDB on your business?

Our business wouldn’t be able to succeed without MongoDB. The uptime, failover, query response times, secondary indexes, and dynamic schemas have empowered most of Stratifyd’s key differentiators.

What advice would you give someone who is considering using MongoDB for their next project?

With all projects, I recommend truly understanding the requirements for the end results. There is a ton of excellent technology out there, but picking the wrong one can be detrimental to project success. Always run numerous tests, comparing different stacks to make sure you find the right one and fail fast on the wrong technology stack.

Stratifyd has some impressive customers, from the Fortune 500 to some really innovative startups: what’s next for the company?

We have some pretty big things planned for 2018. We're now providing more than just actionable intelligence; we are now streamlining and automating workflows. We are closing the customer feedback loop, which enables us to plug Stratifyd into any business process quickly to deliver measurable results.

New to MongoDB Atlas on AWS — AWS Cloud Provider Snapshots, Free Tier Now Available in Singapore & Mumbai


AWS Cloud Provider Snapshots

MongoDB Atlas is an automated cloud database service designed for agile teams who’d rather spend their time building apps than managing databases, backups, and restores. Today, we’re happy to announce that Cloud Provider Snapshots are now available for MongoDB Atlas replica sets on AWS. As the name suggests, Cloud Provider Snapshots provide fully managed backup storage and recovery using the native snapshot capabilities of the underlying cloud service provider.

Choosing a backup method for a database cluster in MongoDB Atlas

When this feature is enabled, MongoDB Atlas will perform snapshots against the primary in the replica set; snapshots are stored in the same cloud region as the primary, granting you control over where all your data lives. Please visit our documentation for more information on snapshot behavior.

Cloud Provider Snapshots on AWS have built-in incremental backup functionality, meaning that a new snapshot only saves the data that has changed since the previous one. This minimizes the time it takes to create a snapshot and lowers costs by reducing the amount of duplicate data. For example, a cluster with 10 GB of data on disk and 3 snapshots may require less than 30 GB of total snapshot storage, depending on how much of the data changed between snapshots.

Cloud Provider Snapshots are available for M10 clusters or higher in all of the 15 AWS regions where you can deploy MongoDB Atlas clusters.

Consider creating a separate Atlas project for database clusters where a different backup method is required. MongoDB Atlas only allows one backup method per project. Once you select a backup method — whether it’s Continuous Backup or Cloud Provider Snapshots — for a cluster in a project, Atlas locks the backup service to the chosen method for all subsequent clusters in that project. To change the backup method for the project, you must disable backups for all clusters in the project, then re-enable backups using your preferred backup methodology. Atlas deletes any stored snapshots when you disable backup for a cluster.


Free, $9, and $25 MongoDB Atlas clusters now available in Singapore & Mumbai

We’re committed to lowering the barrier to entry to MongoDB Atlas and allowing developers to build without worrying about database deployment or management. Last week, we released a 14% price reduction on all MongoDB Atlas clusters deployed in AWS Mumbai. And today, we’re excited to announce the availability of free and affordable database cluster sizes in South and Southeast Asia on AWS .

Free M0 Atlas clusters, which provide 512 MB of storage for experimentation and early development, can now be deployed in AWS Singapore and AWS Mumbai. If more space is required, M2 and M5 Atlas clusters, which provide 2 GB and 5 GB of storage, respectively, are now also available in these regions for just $9 and $25 per month.

Introducing the Aggregation Pipeline Builder in MongoDB Compass

Building MongoDB aggregations has never been so easy.

The most efficient way to analyze your data is where it already lives. That’s why we have MongoDB’s built-in aggregation framework. Have you tried it yet? If so, you know that it’s one of the most powerful MongoDB tools at your disposal. If not, you’re missing out on the ability to query your data in incredibly powerful ways. In fact, we like to say that “aggregate is the new find”. Built on the concept of data processing pipelines (like in Unix or PowerShell), the aggregation framework lets users “funnel” their documents through a multi-stage pipeline that filters, transforms, sorts, computes, aggregates your data, and more. The aggregation framework enables you to perform extensive analytics and statistical analysis in real time and generate pre-aggregated reports for dashboarding.

There are no limits to the number of stages an aggregation pipeline can have – pipelines can be as simple or as complex as you wish. In fact, the only limit is one’s imagination when it comes to deciding how to aggregate data. We’ve seen some very comprehensive pipelines!

With a rich library of over 25 stages and 100 operators (and growing with every release), the aggregation framework is an amazingly versatile tool. To help you be even more successful with it, we decided to build an aggregation construction user interface. The new Aggregation Pipeline Builder is now available with the latest release of Compass for beta testing. It’s available under the Aggregations tab.

The screenshot below depicts a sample pipeline on a movies collection that produces a listing of the title, year, and rating of all movies except for crime or horror, in English and Japanese which are rated either PG or G, starting with most recent, and sorted alphabetically within each year. Each stage was added gradually, with an ability to preview the result of our aggregation.

This easy-to-use UI lets you build your aggregation queries faster than ever before. There’s no need to worry about bracket matching, reordering stages, or remembering operator syntax with its intuitive drag-and-drop experience and code skeletons. You also get auto-completion for aggregation operators as well as query operators and even document field names.

If you need help understanding a particular operator, click on the info icon next to it and you’ll be taken directly to the appropriate guidance.

As you are building your pipeline, you can easily preview your results. This, in combination with an ability to rearrange and toggle stages on and off, makes it easy to troubleshoot your pipelines. When you are satisfied with the results, the constructed pipeline can be copied to the clipboard for easy pasting in your code, or simply saved in your favorites list for re-use later!

The aggregation authoring experience just got even more incredible with the new Compass aggregation pipeline builder. Why not check it out today?

  • Download the latest beta version of Compass
  • See the documentation for the aggregation pipeline builder in Compass
  • See the aggregation framework quick reference
  • To learn or brush up your aggregation framework skills, take M121 from our MongoDB University – it’s well worth it!

Also, please remember to send us your feedback by filing JIRA tickets or emailing it to: compass@mongodb.com.

MongoDB Atlas Price Reduction - AWS Mumbai

Developers use MongoDB Atlas, the fully automated cloud service for MongoDB, to quickly and securely create database clusters that scale effortlessly to meet the needs of a new generation of applications.

We recognize that the developer community in India is an incredibly vibrant one, one that is growing rapidly thanks to startups like Darwinbox. The team there built a full suite of HR services online, going from a standing start to a top-four sector brand in the Indian market in just two years.

As part of our ongoing commitment to support the local developer community and lower the barrier to entry to using a MongoDB service that removes the need for time-consuming administration tasks, we are excited to announce a price reduction for MongoDB Atlas. Prices are being reduced by up to 14% on all MongoDB Atlas clusters deployed in AWS Mumbai. With this, we aim to give more developers access to the best way to work with data, automated with built-in best practices.

MongoDB Atlas is available in India on AWS Mumbai and GCP Mumbai. It is also available on Microsoft Azure in Pune, Mumbai and Chennai. Never tried MongoDB Atlas? Click here to learn more.

MongoDB 4.0 Release Candidate 0 Has Landed

MongoDB enables you to meet the demands of modern apps with a technology foundation that enables you through:

  1. The document data model – presenting you the best way to work with data.
  2. A distributed systems design – allowing you to intelligently put data where you want it.
  3. A unified experience that gives you the freedom to run anywhere – future-proofing your work and eliminating vendor lock-in.

Building on the foundations above, MongoDB 4.0 is a significant milestone in the evolution of MongoDB, and we’ve just shipped the first Release Candidate (RC), ready for you to test.

Why is it so significant? Let’s take a quick tour of the key new features. And remember, you can learn about all of this and much more at MongoDB World'18 (June 26-27).

Multi-Document ACID Transactions

Previewed back in February, multi-document ACID transactions are part of the 4.0 RC. With snapshot isolation and all-or-nothing execution, transactions extend MongoDB ACID data integrity guarantees to multiple statements and multiple documents across one or many collections. They feel just like the transactions you are familiar with from relational databases, are easy to add to any application that needs them, and and don't change the way non-transactional operations are performed. With multi-document transactions it’s easier than ever for all developers to address a complete range of use cases with MongoDB, while for many of them, simply knowing that they are available will provide critical peace of mind that they can meet any requirement in the future. In MongoDB 4.0 transactions work within a replica set, and MongoDB 4.2 will support transactions across a sharded cluster*.

To give you a flavor of what multi-document transactions look like, here is a Python code snippet of the transactions API.

with client.start_session() as s:
    s.start_transaction():
    try:
        collection.insert_one(doc1, session=s)
        collection.insert_one(doc2, session=s)
    except:
        s.abort_transaction()
        raise
    s.commit_transaction()

And now, the transactions API for Java.

try (ClientSession clientSession = client.startSession()) {
          clientSession.startTransaction();
           try {
                   collection.insertOne(clientSession, docOne);
                   collection.insertOne(clientSession, docTwo);
                   clientSession.commitTransaction();
          } catch (Exception e) {
                   clientSession.abortTransaction();
           }
    }

Our path to transactions represents a multi-year engineering effort, beginning over 3 years ago with the integration of the WiredTiger storage engine. We’ve laid the groundwork in practically every part of the platform – from the storage layer itself to the replication consensus protocol, to the sharding architecture. We’ve built out fine-grained consistency and durability guarantees, introduced a global logical clock, refactored cluster metadata management, and more. And we’ve exposed all of these enhancements through APIs that are fully consumable by our drivers. We are feature complete in bringing multi-document transactions to replica sets, and 90% done on implementing the remaining features needed to deliver transactions across a sharded cluster.

Take a look at our multi-document ACID transactions web page where you can hear directly from the MongoDB engineers who have built transactions, review code snippets, and access key resources to get started.

Aggregation Pipeline Type Conversions

One of the major advantages of MongoDB over rigid tabular databases is its flexible data model. Data can be written to the database without first having to predefine its structure. This helps you to build apps faster and respond easily to rapidly evolving application changes. It is also essential in supporting initiatives such as single customer view or operational data lakes to support real-time analytics where data is ingested from multiple sources. Of course, with MongoDB’s schema validation, this flexibility is fully tunable, enabling you to enforce strict controls on data structure, type, and content when you need more control.

So while MongoDB makes it easy to ingest data without complex cleansing of individual fields, it means working with this data can be more difficult when a consuming application expects uniform data types for specific fields across all documents. Handling different data types pushes more complexity to the application, and available ETL tools have provided only limited support for transformations. With MongoDB 4.0, you can maintain all of the advantages of a flexible data model, while prepping data within the database itself for downstream processes.

The new $convert operator enables the aggregation pipeline to transform mixed data types into standardized formats natively within the database. Ingested data can be cast into a standardized, cleansed format and exposed to multiple consuming applications – such as the MongoDB BI and Spark connectors for high-performance visualizations, advanced analytics and machine learning algorithms, or directly to a UI. Casting data into cleansed types makes it easier for your apps to to process, sort, and compare data. For example, financial data inserted as a long can be converted into a decimal, enabling lossless and high precision processing. Similarly, dates inserted as strings can be transformed into the native date type.

When $convert is combined with over 100 different operators available as part of the MongoDB aggregation pipeline, you can reshape, transform, and cleanse your documents without having to incur the complexity, fragility, and latency of running data through external ETL processes.

Non-Blocking Secondary Reads

To ensure that reads can never return data that is not in the same causal order as the primary replica, MongoDB blocks readers while oplog entries are applied in batches to the secondary. This can cause secondary reads to have variable latency, which becomes more pronounced when the cluster is serving write-intensive workloads. Why does MongoDB need to block secondary reads? When you apply a sequence of writes to a document, then MongoDB is designed so that each of the nodes must show the writes in the same causal order. So if you change field "A" in a document and then change field "B", it is not possible to see that document with changed field "B" and not changed field "A". Eventually consistent systems suffer from this behavior, but MongoDB does not, and never has.

By taking advantage of storage engine timestamps and snapshots implemented for multi-document ACID transactions, secondary reads in MongoDB 4.0 become non-blocking. With non-blocking secondary reads, you now get predictable, low read latencies and increased throughput from the replica set, while maintaining a consistent view of data. Workloads that see the greatest benefits are those where data is batch loaded to the database, and those where distributed clients are accessing low latency local replicas that are geographically remote from the primary replica.

40% Faster Data Migrations

Very few of today’s workloads are static. For example, the launch of a new product or game, or seasonal reporting cycles can drive sudden spikes in load that can bring a database to its knees unless additional capacity can be quickly provisioned. If and when demand subsides, you should be able to scale your cluster back in, rightsizing for capacity and cost.

To respond to these fluctuations in demand, MongoDB enables you to elastically add and remove nodes from a sharded cluster in real time, automatically rebalancing the data across nodes in response. The sharded cluster balancer, responsible for evenly distributing data across the cluster, has been significantly improved in MongoDB 4.0. By concurrently fetching and applying documents, shards can complete chunk migrations up to 40% faster, allowing you to more quickly bring new nodes into service at just the moment they are needed, and scale back down when load returns to normal levels.

Extensions to Change Streams

Change streams, released with MongoDB 3.6, enable developers to build reactive, real-time, web, mobile, and IoT apps that can view, filter, and act on data changes as they occur in the database. Change streams enable seamless data movement across distributed database and application estates, making it simple to stream data changes and trigger actions wherever they are needed, using a fully reactive programming style.

With MongoDB 4.0, Change Streams can now be configured to track changes across an entire database or whole cluster. Additionally, change streams will now return a cluster time associated with an event, which can be used by the application to provide an associated wall clock time for the event.

Getting Started with MongoDB 4.0

Hopefully this gives you a taste of what’s coming in 4.0. There’s a stack of other stuff we haven’t covered today, but you can learn about it all in the resources below.

To get started with the RC now:

  1. Head over to the MongoDB download center to pick up the latest development build.
  2. Review the 4.0 release notes.
  3. Sign up for the forthcoming MongoDB University training on 4.0.

And you can meet our engineering team and other MongoDB users at MongoDB World'18 (June 26-27).

---

* Safe Harbor Statement

This blog post contains “forward-looking statements” within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended. Such forward-looking statements are subject to a number of risks, uncertainties, assumptions and other factors that could cause actual results and the timing of certain events to differ materially from future results expressed or implied by the forward-looking statements. Factors that could cause or contribute to such differences include, but are not limited to, those identified our filings with the Securities and Exchange Commission. You should not rely upon forward-looking statements as predictions of future events. Furthermore, such forward-looking statements speak only as of the date of this presentation.

In particular, the development, release, and timing of any features or functionality described for MongoDB products remains at MongoDB’s sole discretion. This information is merely intended to outline our general product direction and it should not be relied on in making a purchasing decision nor is this a commitment, promise or legal obligation to deliver any material, code, or functionality. Except as required by law, we undertake no obligation to update any forward-looking statements to reflect events or circumstances after the date of such statements.

MongoDB Celebrates Reaching 1,000 Employees

MongoDB
May 18, 2018
Culture

We’re excited to announce that MongoDB has officially reached 1,000 employees across the globe.

For some of us, it’s a time to reflect on how far we’ve come.

For others, this milestone helps to generates excitement for what is to come in the future.

And for the rest of us, it was another moment to be proud of, and a new reason to celebrate.

Were proud of how far we’ve come, and we’re looking forward to the next 1,000.

How to Integrate MongoDB Atlas and Segment using MongoDB Stitch

Jesse Krasnostein
May 17, 2018

It can be quite difficult tying together multiple systems, APIs, and third-party services. Recently, we faced this exact problem in-house, when we wanted to get data from Segment into MongoDB so we could take advantage of MongoDB’s native analytics capabilities and rich query language. Using some clever tools we were able to make this happen in under an hour – the first time around.

While this post is detailed, the actual implementation should only take around 20 minutes. I’ll start off by introducing our cast of characters (what tools we used to do this) and then we will walk through how we went about it.

The Characters

To collect data from a variety of sources including mobile, web, cloud apps, and servers, developers have been turning to Segment since 2011. Segment consolidates all the events generated by multiple data sources into a single clickstream. You can then route the data to over 200+ integrations all at the click of a button. Companies like DigitalOcean, New Relic, InVision, and Instacart all rely on Segment for different parts of their growth strategies.

To store the data generated by Segment, we turn to MongoDB Atlas – MongoDB’s database as a service. Atlas offers the best of MongoDB:

  • A straightforward query language that makes it easy to work with your data
  • Native replication and sharding to ensure data can live where it needs to
  • A flexible data model that allows you to easily ingest data from a variety of sources without needing to know precisely how the data will be structured (its shape)

All this is wrapped up in a fully managed service, engineered and run by the same team that builds the database, which means that as a developer you actually can have your cake and eat it too.

The final character is MongoDB Stitch, MongoDB’s serverless platform. Stitch streamlines application development and deployment with simple, secure access to data and services – getting your apps to market faster while reducing operational costs. Stitch allows us to implement server-side logic that connects third-party tools like Segment, with MongoDB, while ensuring everything from security to performance is optimized.

Order of Operations

We are going to go through the following steps. If you have completed any of these already, feel free to just cherry pick the relevant items you need assistance with:

  1. Setting up a Segment workspace
  2. Adding Segment’s JavaScript library to your frontend application – I’ve also built a ridiculously simple HTML page that you can use for testing
  3. Sending an event to Segment when a user clicks a button
  4. Signing up for MongoDB Atlas
  5. Creating a cluster, so your data has somewhere to live
  6. Creating a MongoDB Stitch app that accepts data from Segment and saves it to your MongoDB Atlas cluster

While this blog focusses on integrating Segment with MongoDB, the process we outline below will work with other APIs and web services. Join the community slack and ask questions if you are trying to follow along with a different service.

Each time Segment sees new data a webhook fires an HTTP Post request to Stitch. A Stitch function then handles the authentication of the request and, without performing any data manipulation, saves the body of the request directly to the database – ready for further analysis.

Setting up a Workspace in Segment

Head over to Segment.com and sign up for an account. Once complete, Segment will automatically create a Workspace for you. Workspaces allow you to collaborate with team members, control permissions, and share data sources across your whole team. Click through to the Workspace that you've just created.

To start collecting data in your Workspace, we need to add a source. In this case, I’m going to collect data from a website, so I’ll select that option, and on the next screen, Segment will have added a JavaScript source to my workspace. Any data that comes from our website will be attributed to this source. There is a blue toggle link I can click within the source that will give me the code I need to add to my website so it can send data to Segment. Take note of this as we will need it shortly.

Adding Segment to your Website

I mentioned a simple sample page I had created in case you want to test this implementation outside of other code you had been working on. You can grab it from this GitHub repo.

In my sample page, you’ll see I’ve copied and pasted the Segment code and dropped it in between my page’s <head> tags. You’ll need to do the equivalent with whatever code or language you are working in.

If you open that page in a browser, it should automatically start sending data to Segment. The easiest way to see this is by opening Segment in another window and clicking through to the debugger.

Clicking on the debugger button in the Segment UI takes you to a live stream of events sent by your application.

Customizing the events you send to Segment

The Segment library enables you to get as granular as you like with the data you send from your application.

As your application grows, you’ll likely want to expand the scope of what you track. Best practice requires you to put some thought into how you name events and what data you send. Otherwise different developers will name events differently and will send them at different times – read this post for more on the topic.

To get us started, I’m going to assume that we want to track every time someone clicks a favorite button on a web page. We are going to use some simple JavaScript to call Segment’s analytics tracking code and send an event called a “track” to the Segment API. That way, each time someone clicks our favorite button, we'll know about it.

You’ll see at the bottom of my web page, that there is a jQuery function attached to the .btn class. Let’s add the following after the alert() function.

analytics.track("Favorited", {
        itemId: this.id,
        itemName: itemName
      });

Now, refresh the page in your browser and click on one of the favorite buttons. You should see an alert box come up. If you head over to your debugger window in Segment, you’ll observe the track event streaming in as well. Pretty cool, right!

You probably noticed that the analytics code above is storing the data you want to send in a JSON document. You can add fields with more specific information anytime you like. Traditionally, this data would get sent to some sort of tabular data store, like MySQL or PostgreSQL, but then each time new information was added you would have to perform a migration to add a new column to your table. On top of that, you would likely have to update the object-relational mapping code that's responsible for saving the event in your database. MongoDB is a flexible data store, that means there are no migrations or translations needed, as we will store the data in the exact form you send it in.

Getting Started with MongoDB Atlas and Stitch

As mentioned, we’ll be using two different services from MongoDB. The first, MongoDB Atlas, is a database as a service. It’s where all the data generated by Segment will live, long-term. The second, MongoDB Stitch, is going to play the part of our backend. We are going to use Stitch to set up an endpoint where Segment can send data, once received, Stitch validates that the request Stitch was sent from Segment, and then coordinate all the logic to save this data into MongoDB Atlas for later analysis and other activities.

First Time Using MongoDB Atlas?

Click here to set up an account in MongoDB Atlas.

Once you’ve created an account, we are going to use Atlas’s Cluster Builder to set up our first cluster (every MongoDB Atlas deployment is made up of multiple nodes that help with high availability, that’s why we call it a cluster). For this demonstration, we can get away with an M0 instance – it's free forever and great for sandboxing. It's not on dedicated infrastructure, so for any production workloads, its worth investigating other instance sizes.

When the Cluster Builder appears on screen, the default cloud provider is AWS, and the selected region is North Virginia. Leave these as is. Scroll down and click on the Cluster Tier section, and this will expand to show our different sizing options. Select M0 at the top of the list.

You can also customize your cluster’s name, by clicking on the Cluster Name section.

Once complete, click Create Cluster. It takes anywhere from 7-10 minutes to set up your cluster so maybe go grab a drink, stretch your legs and come back… When you’re ready, read on.

Creating a Stitch Application

While the Cluster is building, on the left-hand menu, click Stitch Apps. You will be taken to the stitch applications page, from where you can click Create New Application.

Give your application a name, in this case, I call it “SegmentIntegration” and link it to the correct cluster. Click Create.

Once the application is ready, you’ll be taken to the Stitch welcome page. In this case, we can leave anonymous authentication off.

We do need to enable access to a MongoDB collection to store our data from Segment. For the database name I use “segment”, and for the collection, I use “events”. Click Add Collection.

Next, we will need to add a service. In this case, we will be manually configuring an HTTP service that can communicate over the web with Segment’s service. Scroll down and click Add Service.

You’ll jump one page and should see a big sign saying, “This application has no services”… not for long. Click Add a Service… again.

From the options now visible, select HTTP and then give the service a name. I’ll use “SegmentHTTP”. Click Add Service.

Next, we need to add an Incoming Webhook. A Webhook is an HTTP endpoint that will continuously listen for incoming calls from Segment, and when called, it will trigger a function in Stitch to run.

Click Add Incoming Webhook

  • Leave the default name as is and change the following fields:
  • Turn on Respond with Result as this will return the result of our insert operation
  • Change Request Validation to “Require Secret as Query Param”
  • Add a secret code to the last field on the page. Important Note: We will refer to this as our “public secret” as it is NOT protected from the outside world, it’s more of a simple validation that Stitch can use before running the Function we will create. Shortly, we will also define a “private secret” that will not be visible outside of Stitch and Segment.

Finally, click “Save”.

Define Request Handling Logic with Functions in Stitch

We define custom behavior in Stitch using functions, simple JavaScript (ES6) that can be used to implement logic and work with all the different services integrated with Stitch.

Thankfully, we don’t need to do too much work here. Stitch already has the basics set up for us. We need to define logic that does the following things:

  1. Grabs the request signature from HTTP headers
  2. Uses the signature to validate the requests authenticity (i.e., it came from Segment)
  3. Write the request to our segment.events collection in MongoDB Atlas

Getting an HTTP Header and Generating an HMAC Signature

Add the following to line 8, after the curly close brace }.

const signature = payload.headers['X-Signature'];

And then use Stitch’s built-in Crypto library to generate a digest that we will compare with the signature.

const digest = utils.crypto.hmac(payload.body.text(), context.values.get("segment_shared_secret"), "sha1", "hex");

A lot is happening here so I’ll step through each part and explain. Segment signs requests with a signature that is a combination of the HTTP body and a shared secret. We can attempt to generate an identical signature using the utils.crytop.hmac function if we know the body of the request, the shared secret, the hash function Segment uses to create its signatures, and the output format. If we can replicate what is contained within the X-Signature header from Segment, we will consider this to be an authenticated request.

Note: This will be using a private secret, not the public secret we defined in the Settings page when we created the webhook. This secret should never be publicly visible. Stitch allows us to define values that we can use for storing variables like API keys and secrets. We will do this shortly.

Validating that the Request is Authentic and Writing to MongoDB Atlas

To validate the request, we simply need to compare the digest and the signature. If they’re equivalent, then we will write to the database. Add the following code directly after we generate the digest.

if (digest == signature) {
    // Request is valid
} else {
    // Request is invalid
    console.log("Request is invalid");
}

Finally, we will augment the if statement with the appropriate behavior needed to save our data. On the first line of the if statement, we will get our “mongodb-atlas” service. Add the following code:

let mongodb = context.services.get("mongodb-atlas");

Next, we will get our database collection so that we can write data to it.

let events = mongodb.db("segment").collection("events");

And finally, we write the data.

events.insertOne(body);

Click the Save button on the top left-hand side of the code editor. At the end of this, our entire function should look something like this:

exports = function(payload) {

  var queryArg = payload.query.arg || '';
  var body = {};
  
  if (payload.body) {
    body = JSON.parse(payload.body.text());
  }
  
  // Get x-signature header and create digest for comparison
  const signature = payload.headers['X-Signature'];
  const digest = utils.crypto.hmac(payload.body.text(), 
    context.values.get("segment_shared_secret"), "sha1", "hex");
  
  //Only write the data if the digest matches Segment's x-signature!
  if (digest == signature) {
    
    let mongodb = context.services.get("mongodb-atlas");
    
    // Set the collection up to write data
    let events = mongodb.db("segment").collection("events");
    
    // Write the data
    events.insertOne(body);
    
  } else  {
    console.log("Digest didn't match");
  }
  
  return queryArg + ' ' + body.msg;
};

Defining Rules for a MongoDB Atlas Collection

Next, we will need to update our rules that allow Stitch to write to our database collection. To do this, in the left-hand menu, click on “mongodb-atlas”.

Select the collection we created earlier, called “segment.events”. This will display the Field Rules for our Top-Level Document. We can use these rules to define what conditions must exist for our Stitch function to be able to Read or Write to the collection.

We will leave the read rules as is for now, as we will not be reading directly from our Stitch application. We will, however, change the write rule to "evaluate" so our function can write to the database.

Change the contents of the “Write” box:

  • Specify an empty JSON document {} as the write rule at the document level.
  • Set Allow All Other Fields to Enabled, if it is not already set.

Click Save at the top of the editor.

Adding a Secret Value in MongoDB Stitch

As is common practice, API keys and passwords are stored as variables, meaning they are never committed to a code repo – visibility is reduced. Stitch allows us to create private variables (values) that may be accessed only by incoming webhooks, rules, and named functions.

We do this by clicking Values on the Stitch menu, clicking Create New Value, and giving our value a name – in this case segment_shared_secret (we will refer to this as our private secret). We enter the contents in the large text box. Make sure to click Save once you’re done.

Getting Our Webhook URL

To copy the webhook URL across to Segment from Stitch, navigate using the Control menu: Services > SegmentHTTP > webhook0 > Settings (at the top of the page). Now copy the “Webhook URL”.

In our case, the Webhooks looks something like this:

https://webhooks.mongodb-stitch.com/api/client/v2.0/app/segmentintegration/service/SegmentHTTP/incoming_webhook/webhook0

Adding the Webhook URL to Segment

Head over to Segment and log in to your workspace. In destinations, we are going to click Add Destination.

Search for Webhook in the destinations catalog and click Webhooks. Once through to the next page, click Configure Webhooks. Then select any sources from which you want to send data. Once selected, click Confirm Source.

Next, we will find ourselves on the destination settings page. We will need to configure our connection settings. Click the box that says Webhooks (max 5).

Copy your webhook URL from Stitch, and make sure you append your public secret to the end of it using the following syntax:

Initial URL:

https://webhooks.mongodb-stitch.com/api/client/v2.0/app/segmentintegration/service/SegmentHTTP/incoming_webhook/webhook0
Add the following to the end: ?secret=<YOUR_PUBLIC_SECRET_HERE>

Final URL:

https://webhooks.mongodb-stitch.com/api/client/v2.0/app/segmentintegration/service/SegmentHTTP/incoming_webhook/webhook0?secret=PUBLIC_SECRET

Click Save

We also need to tell Segment what our private secret is so it can create a signature that we can verify within Stitch. Do this by clicking on the Shared Secret field and entering the same value you used for the segment_shared_secret. Click Save.

Finally, all we need to do is activate the webhook by clicking the switch at the top of the Destination Settings page:

Generate Events, and See Your Data in MongoDB

Now, all we need to do is use our test HTML page to generate a few events that get sent to Segment – we can use Segment’s debugger to ensure they are coming in. Once we see them flowing, they will also be going across to MongoDB Stitch, which will be writing the events to MongoDB Atlas.

We’ll take a quick look using Compass to ensure our data is visible. Once we connect to our cluster, we should see a database called “segment”. Click on segment and then you’ll see our collection called “events”. If you click into this you’ll see a sample of the data generated by our frontend!

The End

Thanks for reading through – hopefully you found this helpful. If you’re building new things with MongoDB Stitch we’d love to hear about it. Join the community slack and ask questions in the #stitch channel!

Learn more about transactions at MongoDB World

In February, we announced that MongoDB 4.0 will support multi-document transactions. Curious to know what this will look like? Aly Cabral, Product Manager at MongoDB, is excited to share an early version of the syntax:

Each feature we build is with users like you in mind. When you attend our events, you’re able to connect with the people who work on the database you use every day – like Aly. In sessions, Birds of a Feather meetings, and one-on-one in Ask the Experts, you get to ask questions, share ideas, and be heard.

Additionally, you learn tips and tricks from power users and companies that will allow you to optimize your deployments. To get your hands on new tools to accelerate your development goals, join us at MongoDB World, June 26-27 in NYC.

Early Bird pricing ends Friday, May 11. Tickets are going fast! Sign up now to get your discounted conference pass. Don't forget, groups automatically get 25% off!


Event Details:
Date: June 26-27, 2018
Location: New York Hilton Midtown, 1335 6th Ave, New York, NY 10019
Learn More & Sign Up: mongodbworld.com