All the women come to the front, please. All the women in the front. This is about falling in love. You need to be front and center. Thank you so much for joining us at our third session just before lunch. This is the business track of MongoDB World. We're really excited to have you here today. We've had two great stories already, how the VA is reinventing their shared services with MongoDB. We just listened to how to bring together clinical and research data and cancer, another great topic. But I confess, there are more women in this presentation than any of the others. Thank you for joining, because falling in love and MongoDB don't really go together in my mind. I wish they did. And, in fact, we're going to hear how that works.
Look at this picture. It looks like-- oh, what's the name of the movie all about love? Love Actually, thank you. Love Actually, that's the picture it looks like. We're going to hear about that today. We're really pleased to have the CTO of eHarmony. Mr. Thod Nguyen is with us. He doesn't do very many presentations, so we feel particularly honored.
He is a CTO renowned and bringing about agile development, new technologies. He's been the CTO at several other medium and small startups. He decided to join eHarmony, I'm not exactly sure why, hopefully he'll tell us, two years ago, and in that process, MongoDB has come in. He shared that he's also excited about the vision for MongoDB we've heard this morning. And as more and more people straggle in, I'm going to thank him for joining us today. He's going to share the story about how MongoDB is helping make matches, how more and more people can fall in love more quickly. Imagine, if you could take the speed of development that MongoDB brings and bring it to love, we could cut out all that noise in dating and actually just find our soulmates, right? OK. Mr. Thod Nguyen, thank you so much for joining us. Please share the story for us. [APPLAUSE]
Please turn off your phones. Put them on silent. And at the end, I'd ask you to fill-in the survey about whether or not you fell in love today. Well, thank you, you guys, for coming. It has been an honor to be here. I'm very excited to be part of this first ever MongoDB World.
And like you see in a lot of those pictures-- those are not stock photos, if you have any questions. Those are real, successful couples. We have about more than 600,000 success couples since the founding of the company 14 years ago. 600,000 happily married couples. So just to let you know some of the very interesting statistics.
So, well, I'm Thod Nguyen. I'm the CTO at eHarmony. For those of you who haven't heard about eHarmony before, we are a recommendation service that provides singles with long-term and happy relationships. We currently operate across three different continents. We operate in the United States, Canada, Australia, and UK.
So, on average, we have about 438 people that get married on eHarmony every single day. That translates to about 5% of all the new marriages in the US alone. With 51 million plus registered members that we have, and 25 million plus visits per month, and more than 1.2 million people that get married on eHarmony since the company was founded 14 years ago by Dr. Warren, that's a lot of people. That's a lot of use engagement. And that's a lot of activities. And that translates to a lot of data that we need to store, manage, and process.
So how much data are we talking about? Well, our compatibility matching processor, or, also known as CMP application, create about $3 billion plus potential matches per day, with about 25 plus terabytes of user data in our entire matching system. In support, 60 million plus queries, complex multi-attribute queries, daily looking across 250 plus attributes. Our systems store and manage about 200 million plus photos with 15 plus terabytes of data in our photo storage. It also manages about four billion plus relationship questionnaires, with 25 plus terabytes of data.
So this CMP application was built on top of the relational database. And it started to perform quite slow, way too slow. It was taking us more than two weeks to reprocess everyone in our entire matching system. And that was way, way too long for our customer.
So since we migrated to the MongoDB data storage solution, we achieved amazing results. We were able to reduce or decrease the processing time to match by 95% plus, from two plus weeks to less than 12 hours on $3 billion plus potential matches that we created every single day. In terms of the key performance metrics, compared to a year ago, we are seeing about 30% increase in two-way communication, 50% increase in the paid subscribers, and 60% plus increase in traffic growth, in terms of the unique visitors and visits.
So today's talk is about our compatibility matching system, and how and why we rebuilt it on MongoDB data storage solution, and a lesson we learned along the way. So for today's agenda, first I will talk about our compatibility matching system, which is the key to generating all those happy couples and satisfied marriages that I was talking about earlier. Then, I will talk about the old system, how it was architected, and where we ran into problems. Then, I will talk about the new system, our requirements, and the technology we evaluated, and why we selected the MongoDB solution. And finally, I will discuss some of the lessons we learned during the MongoDB transition and some of the new cases we plan to use MongoDB for.
So eHarmony's secret sauce is our compatibility matching system. It consists of a very sophisticated three tier process. The compatibility matching models identify potential matches based on your core compatibility, derived from the 29 dimensions of personality and psychology traits and based on your user set of preferences as well.
The affinity matching models predicts the probability of communication between two people. That is, will these two people connect, or want to connect, even though the two people are very compatible, because they have similar interests, they have similar beliefs, they have similar values. However, they may not want to connect because of other reasons.
For example, they could be completely different age groups. One person could be 30, the other person could be 60. You know, like Donald Sterling, for example. That's a bad example, by the way. I didn't mean to refer to Donald Sterling.
Or they could live about 3,000 miles apart. She lives in Los Angeles, and her soulmate lives in New York. So that's way too far, right? 3,000 miles apart. But also, they may not be attractive to one another. So this leads to the last process, which is our match distribution model. It helps to ensure that we deliver the right matches to the right user at the right time and to deliver as many matches as possible across our entire active network.
So, for the purpose of today's talk, I will stay mostly on the compatibility matching system, allowing us to focus a lot more on the usage of the MongoDB solution. So the compatibility matching system is a two-step process. So traditional search is uni-directional, right? To understand how it works, let's take a look at Nikki as an example.
In this particular scenario, Nikki's in the market looking for a toaster on Amazon, for example. All that really matters in the uni-directional search is to return the toaster that meets the criteria that Nikki had specified. And whichever toaster, she gets to take it home. The poor toasters have no choice in this matter.
What the heck? It sounds like a marriage, right? Just kidding.
So dating is more complex than this, especially when you're trying to create a very meaningful and romantic connection between two people. So dating is bi-directional. Both people need to want to be with one another. At eHarmony, we develop a sophisticated bi-directional system to make sure that the user preferences are met both ways, or bi-directionally.
Let's take Nikki as an example again. This time, she's not looking for a toaster on Amazon. She's tired of Amazon. She wants to go straight to eHarmony. And then, we have a couple more eHarmony users, such as Jeb, Jon, and Nick.
First we need to consider only those that meet Nikki's criteria. In this case, that's only Jeb and John. For us to make a match, Nikki also needs to meet the criteria specified by Jon or Jeb. In this particular case, that's only Jon. So what are some of the criteria that we are talking about? These are simple things like age, distance, religion, ethnicity, income, or education. So this completes the first part of our matching system.
In the second step, we also account for a sophisticated personality traits that a user conveys to us by filling out a long list of questionnaires. Normally, if you meet someone at the bar or at work, a lot of you are, I'm not sure, but, or at a conference like MongoDB World. Hopefully, you're looking for your soulmate here in addition to Mongo. You probably wouldn't start by asking them how well did you behave under the stress? That's not a very good pickup line is it? So with eHarmony, however, we account for all these type of factors behind the scenes seamlessly. Before we accept any user to our matching system, we ask them to complete a long list of questionnaires, about 150 of them, related to their personality, their values, their beliefs, and their attributes. And then, we create a very unique personality profile about you that we're going to use it later in our matching system. We also have a dedicated team doing extensive research in data science and clinical psychologies to define happy relationships and what personality profiles, what type of personality profiles were most compatible in those relationships. And then we model them as mathematical formulas.
So where, in turn, we're applying data science to love. There's a lot of mathematical that we use in our models. So we call them our CMS models, and that is our secret sauce. It's a very complicated secret sauce by the way. So to sum up the entire process, when you're looking for potential compatible partners for Nikki, for example, as the first step, we run reciprocal complex multi-attribute queries to identify potential matches for Nikki. And we only retain the candidates where the criteria are met both ways, or bi-directionally. As a second step, we take the remaining candidates, and we run them through a slew of compatible models that we have accumulated over the last 14 years. Only those candidates who pass the threshold set by the CMS models are retained and positioned as potential compatible matches for Nikki. So, as you can see, the entire process sounds so romantic, isn't it? I hope so.
So let's talk about some fun techie stuff. So, here's what our old system looked like, 10 plus years ago, before my time, by the way. So the CMP is the application that performs the job of compatibility matchmaking. And eHarmony is a 14 year-old company at this point. And this was the first pass of how the CMP system was architected.
In this particular architecture, we have a number of different CMP application instances that talk directly to our central, transactional, monolithic Oracle database. Not MySQL, by the way. We perform a lot of complex multi-attribute queries against this central database. Once we make a billion plus of potential matches, we store them back to the same central database that we have.
At that time, eHarmony was quite a small company in terms of the user base. The data side was quite small as well. So we didn't experience any performance scalability problems or issues. As eHarmony became more and more popular, the traffic started to grow very, very quickly. So the current architecture did not scale, as you can see. It's a very simple architecture.
So there were two fundamental problems with this architecture that we needed to solve very quickly. The first problem was related to the ability to perform high volume, bi-directional searches. And the second problem was the ability to persist a billion plus of potential matches at scale.
So here was our v2 architecture of the CMP application. We wanted to scale the high volume, bi-directional searches, so that we could reduce the load on the central database. So we start creating a bunch of very high-end powerful machines to host the relational Postgres database. Each one of the CMP applications was co-located with a local Postgres database server that stored a complete searchable data, so that it could perform queries locally, hence reducing the load on the central database.
So the solution worked pretty well for a couple years, but with the rapid growth of eHarmony user base, the data size became bigger, and the data model became more complex. This architecture also became problematic. So we had five different issues as part of this architecture.
So one of the biggest challenges for us was the throughput, obviously, right? It was taking us about more than two weeks to reprocess everyone in our entire matching system. More than two weeks. And we had to do this every single day in order to deliver fresh and accurate matches to our customers, especially one of those new matches that we deliver to you may be the love of your life. We don't want to miss that. So of course, this was not an acceptable solution to our business, but also, more importantly, to our customer. So the second issue was, we're doing massive court operation, 3 billion plus per day on the primary database to persist a billion plus of matches. And these current operations are killing the central database. And at this point in time, with this current architecture, we only used the Postgres relational database server for bi-directional, multi-attribute queries, but not for storing. So the massive court operation to store the matching data was not only killing our central database, but also creating a lot of excessive locking on some of our data models, because the same database was being shared by multiple downstream systems.
And the fourth issue was the challenge of adding a new attribute to the schema or data model. Every single time we make any schema changes, such as adding a new attribute to the data model, it was a complete nightmare for both of our engineering team and ops team. We have spent several hours first extracting the data dump from Postgres, massaging the data, copy it to multiple servers and multiple machines, reloading the data back to Postgres, and that translated to a lot of high operational cost to maintain this solution. And it was a lot worse if that particular attribute needed to be part of an index.
So finally, any time we make any schema changes, it requires downtime for our CMP application. And it's affecting our client application SLA. So finally, the last issue was related to since we are running on Postgres, we start using a lot of several advanced indexing techniques with a complicated table structure that was very Postgres-specific in order to optimize our query for much, much faster output. So the application design became a lot more Postgres-dependent, and that was not an acceptable or maintainable solution for us.
So at this point, the direction was very simple. We had to fix this, and we needed to fix it now. So my entire engineering team started to do a lot of brainstorming about from application architecture to the underlying data store, and we realized that most of the bottlenecks are related to the underlying data store, whether it's related to querying the data, multi-attribute queries, or it's related to storing the data at scale. So we started to define the new data store requirements that we're going to select. And it had to be centralized.
We don't want to repeat the same mistake that we had made before using the decentralized SQL solution based on Postgres. It had to be auto-magical. In other words, it had to support auto-scaling. Even though eHarmony has a very big brand, we still want to operate with a very small team. So we wanted a solution where we don't have to spend a lot of time maintaining that solution, like adding a new shard, a new cluster, a new server to the cluster, and so forth. The bottom line is, we wanted to spend as little time as possible.
Built-in sharding. As our big data grow, we want to be able to spec the data to multiple shards, across multiple physical servers, to maintain high throughput performance without any server upgrade. And the third thing related to auto-magical is auto-balancing of data is required to evenly distribute your data across multiple shards seamlessly. And the fourth one is about it must support fast, complex, multi-attribute queries with high performance throughput. And lastly, it ha to be easy to maintain.
So we started looking at the number of different data storage solutions from solar search, I'm sure a lot of you guys know solar very well, especially if you're doing a lot of search. We try to do this as a traditional search, uni-directional. But we realized that our bi-directional searches are driven a lot by the business rule, and it has a lot of restrictions. So it was really hard for us to mimic a pure source solution in this model.
We also looked at Cassandra data store, but we found that API was really hard to map to a SQL-style framework, because it had to coexist with the old data store during the transition. And I think you guys know this very well. Cassandra seemed to scale and perform a lot better with heavy write application and less on heavy read application. And this particular case is read intensive.
We also looked at pgpool with Postgres, but it failed on aspects of ease of management related to auto-scaling, built in sharding, and auto-balancing. And lastly, we looked at the project called Voldemort from LinkedIn, which is the distributive key value pair data store, but it failed to support multi-attribute queries.
So why was MongoDB selected? Well, it's pretty obvious, right? It provided the best of both worlds. It supported fast and multiple-attribute queries and very powerful indexing features with dynamic, flexible data model. It supported auto-scaling. Anytime you want to add a shard, or anytime you want to handle more load, we just add additional shard to the shard cluster. If the shard's getting hot, we add in additional replica to the replica set, and off we go. It has a built in sharding, so we can scale out our data horizontally, running on top of commodity server, not the high-end servers, and still maintaining a very high throughput performance.
Auto-balancing of data within a shard or across multiple shards, seamlessly, so that the client application doesn't have to worry about the internal of how their data was stored and managed. There were also other benefits including ease of management. This is a very important feature for us, important from the operations perspective, especially when we have a very small ops team that manage more than 1,000 plus servers and 2,000 plus additional devices on premise. And also, it's so obvious, it's an open source, with great community support from all of you, and plus the enterprise support from the MongoDB team.
So what are some of the trade-offs when we deploy to the MongoDB data storage solution? Well, obviously, MongoDB's a schema-less data store, right? So the data format is repeated in every single document in a collection. So if you have 2,800 billion or whatever 100 million plus of records in your collection, it's going to require a lot of wasted space, and that translates to high throughput or a larger footprint. Aggregation of queries in MongoDB are quite different than traditional SQL aggregation queries, such as group by or count, but also resulting in a paradigm shift from DBA-focus to engineering-focus.
And lastly, the initial configuration and migration can be very, very long and manual process due to lack of the automated tooling on the MongoDB side. And we have to create a bunch of script to automate the entire process initially. But in today's keynote from Elliott, I was told that, well, they're going to release a new MMS automation dashboard for automated provisioning, configuration management, and software upgrade. This is fantastic news for us, and I'm sure for the entire community as well.
There were a couple key lessons that we learned during the MongoDB migration. Always turn on the firehose. Whenever you're evaluating a solution or testing a solution, use your production data and production query so that you can actually compare apple to apple comparison, in terms of performance and scalability metrics. Unleash the Chaos Monkey. During your low testing, queue one of your MongoDB servers or instances in your sharded cluster to make sure that your cluster and your application still continue to function normally.
Involve the MongoDB team from the start. Even from development to productionalize your cluster so that you can actually get the best possible architectural guidance and support and best practices related to data modeling, queries, indexes, selecting the right shard key, but also help you to productionalize your MongoDB cluster as well. Select a good shard key from the start, such that all your queries can be isolated to a shard so that the Mongo doesn't have to wait to collect the result across all shards. That would definitely impact your query performance in our particular case.
And this one's very important for us. Run your new cluster in a shadow mode. Our entire matching system was based on event-driven, service-oriented architecture, or SOA model, so it's very easy for us to deploy two different CMP clusters in terms of active / passive mode. So we have one cluster running on relational. We have one cluster running on Mongos, MongoDB, sharing the same distributed messaging system.
Basically, the messages are replicated to both of the clusters running on top of relational and Mongo from real production traffic. So we were able to do a lot of tuning in production related to the right shard key, the right capacity in terms of replica and shard, and also create optimizations without impacting our production users. So once we certify the solution in production in a passive mode, we simply switch to the MongoDB-based cluster. This is very important, because it's very difficult to know what your production traffic is like until you put your new cluster inside production as the passive mode, even though you could have the best possible low testing environment. But it's very, very difficult to generate production traffic in that particular case with the kind of scale that we have.
So something interesting that's coming up from eHarmony. Our core mission is to make people's lives better and happier, whether to help you to find a love of your life, across multiple languages, multiple locales, multiple countries, it doesn't matter. Or to help you finding the right job. So our online dating in Australia and UK have been extremely profitable, so we want to expand that success model to 20 other countries in the next couple years.
We are also working on the new job compatibility vertical using our secret sauce, compatibility secret sauce. We call it careers by eHarmony, and we plan to launch this new vertical in December of this year. So as you know very well, we've known this for quite a while, that it's really hard to make any marriage to work if you're not happy at your current job, right? Obviously, believe it or not, 65% of the people in America are not happy with the job they're currently at, and they can be if they get matched with the right job based on the culture of the company, based on the personality of to whom you will report, and in addition to your skills. So we were very, very excited about this new vertical that we're going to be launching in December.
So let's touch base on a couple potential use cases that we may consider using MongoDB for. We're looking at using MongoDB for real time geo-based location, batching services for our mobile devices, using the MongoDB spatial indexes and queries functionality. And I'm very excited we're also looking for replacing our Voldemort storage with a MongoDB base-cluster solution to persist our 3 billion plus potential matches per day. And I'm very excited to hear from today's keynote from Elliott also that they will release a new concurrency model and also the new storage engine model that can actually run on top of either fusion iO SSD or in memory. So, with this new functionality, I'm very excited that we're going to leverage on this solution to store a billion plus of matches that we have.
So here are some of the interesting technology investments that we made to solve the most complex engineering problems that we have and providing long-term attainability, scalability, and innovation at eHarmony. So, for example, we use a lot of Scala. I'm sure a lot of you know, as a functional programming language, to implement our CMS and affinity matching models.
We also use a lot of Hadoop. And with Hive, we also started exploring Spark as the interactive data analytics on top of YARN for massive data mining and data processing. And we also use a lot of R. I'm not sure you guys are familiar with R. R is a revolution as the programming language for predictive analytics in our machine learning models. Additionally, we use a lot of Node.js with HTML 5 to implement our public-facing eHarmony web applications for both the mobile web and the desktop and a slew of other technologies that we're using right now.
So last but not least, we have a lot of, I was told from the recruiting team, so bear with me, we have a lot of open positions right now. So if you're interested, please go to eHarmony.com or jobs.eHarmony.com or reach out to me directly on LinkedIn. So this is the pitch I promised my recruiting team. And thank you very much for your time. And now, I would like to open up for any Q&A. [APPLAUSE]
So, what you're saying, Thod, essentially, is if you've already fallen in love, now it's time to get a better job. Is that it?
It's kind of a combination of both.
So just imagine, when you think about transformation, I finally get the courage. I put my profile in online. I think about it, I think about it, I sweat, I agonize, I finally do it. And I have to wait 15 days. But now, I can finally decide to do it, put my profile in online, and tomorrow morning, 12 hours later, I can find out who the love of my life is. Pretty amazing stuff.
Thank you for that. Questions?
You already have two verticals going with this technology. Have you thought of licensing it?
We are thinking about it. Right now the first vertical's about our core product, which is online dating. We're in the process of expanding it to the job compatibility. But also, we plan to expand it to friendship, like compatible friendships. So we are thinking about-- I'm not sure we're going to be licensing the API, especially it's our secret sauce, but we look at potentially partnering with other companies as well within the compatibility space.
Next question. You talked about quite a few benefits of moving to MongoDB. One thing that I did not hear about is you moved from a very relational environment to a NoSQL environment. From the design of your schema, what kind of changes or what kind of change in thinking you had to do for that?
For us, it's pretty seamless to move from the relational to the MongoDB solution. The schema has been-- I mean, obviously, we had to redesign the schema, but the engineering team feel like it's pretty straightforward in terms of migrating from the data model that we have in relational and then map to a document-based data model, it was not a very difficult, very challenging task for us. But it was a lot more challenging if we start moving toward Cassandra, for example, data model solution.
So you did not do-- is your model in Mongo still pretty relational?
Right. Right. Exactly. We try to model such that we can mitigate the risk, because for us, time to market is very important. So we didn't want to completely revamp our entire data model, especially when you have a massive data that we store. So we try to make it as least risk as possible, but at the same time not impacting on performance throughput. That's really important for us. We cannot compromise on performance and scalability.
OK. Thank you very much again. I think Thod is going to be in the back of the room if you want to ask him questions afterwards. It is now lunch time. Thank you all for being here to hear our stories. This afternoon we have five great more stories about how to transform your business with MongoDB. See you soon.
Don't forget to fill out the surveys. Each and every time you fill out a survey, you get an opportunity for an Xbox. Thanks.