NewAnnouncing MongoDB Atlas Vector Search and Dedicated Search Nodes for genAI use cases

MongoDB in the Private Cloud: Empowering the Business at Goldman Sachs 1 Year Later Transcript

Good afternoon. Welcome back, we've made it through a day and a half. Woo-hoo! Has anyone been here in the business track with me for the last day and a half? All right! Thank you, for those of you who have been here.

We've heard some really great stories about how MongoDB and organizations is creating new ideas, new ways of thinking, new execution. It's been really exciting. We've heard everything from how do we start looking at clinical research together with actual trial data in health care-- looking at solving cancer problems. We've learned how to fall in love. Imagine that, falling in love at MongoDB. I said, yeah, those two things don't really go together. But we heard it just before this. Yesterday morning, we kicked it off with a great presentation from Citigroup-- how they're making quantum leaps forward in the way they manage data across the globe.

I'm happy to say we have five more great tracks of the same kind. Five industries will represent, and I'm really pleased to have with us Goldman Sachs. They're going to tell us about their journey. A year ago, they came and they shard with us what they were looking at for private cloud using MongoDB. They have a year of experience behind them, and we've got three great gentleman to share that story.

We have Warren Finnerty, who's going to kick it off, who's the Managing Director of Technology. We have Bryan Doyle, who's the Vice President of Technology, and we have Jim Pastos, who's also a Vice President of Technology. They're going to share their story one year later about what's happening with private cloud at Goldman Sachs.

Let me just remind you as we get started, please put your mobile phones on silent. There is a survey out there available at the bottom of the MongoDB app. Every time you enter the survey and you fill it in-- only five short questions-- you get into a drawing for an Xbox. Thanks, everybody. Gentlemen, thank you for coming to share your story.

OK. We're going to have, as noted, three speakers here. I'm going to do a little bit of the setup-- hopefully be here for the shortest amount of time. And Bryan then is going to talk a lot about the actual experience with the infrastructure and plumbing that we've built out. And Jim will give you a use case that utilizes that. And we'll all reference obviously-- last year, when we spoke, we had some preconceived notions about how things would play out, and 12 months later we have some lessons learned. And we'll talk through this.

And I'll do a little bit of the setup here on why do we want to actually produce a database offering, or a database on a cloud, database as a service, in general, and what are those attributes-- and why Mongo proved to be a very good fit for that. So I've seen some presentations here. It's been a good experience here in listening to some of the other companies talk about how Mongo has allowed them to revolutionize their business. And a lot of them are a company based around an application which uses a database. And for those people, doing some very custom engineering work and building out Mongo to fit their business problem makes a ton of sense.

What we're talking about is the opposite quadrant here. And so for places like Goldman Sachs and Citi as well, we're big, big shops. And we face many of the same challenges that Amazon does, which is we're trying to very rapidly and in very low-touch form provide, fulfill need for software. And everybody's had tremendous success for stateless computing in this model in it's very easy to provision Windows, Linux other stateless computer environment or computer grids on the fly. One is because it is-- generally the software stacks are stateless, so it's very easy to conjure them up. And two is, they don't rely in any customer bespoke hardware.

So a lot of times, if you look at database stacks, they use very exotic equipment. Scale-up, typically, over the years. And they use possibly some sort of disk technology, a SAN, things that replicate, make the database lossless, give it resiliency properties. All those things are done through very bespoke, hand-cranked infrastructures.

So for database in a cloud, we want to be treating our database farm just as we would a computer farm. We have-- and the things that we've done is we've separated supply from demand And Mongo-- where Mongo winds up being a very, very good fit is that it does not require-- it is both scale-out-- there's been a lot of sessions talking about how to shard appropriately, and Bryan will talk about some of the decisions we made. It runs in very vanilla hardware. It does not require exotic hardware. It is very good-- well behaved-- when virtualized. So one of the properties which is, in point of fact, why people find it so appealing to use it under AWS.

It also disintermediates technology infrastructure, Aman Technology infrastructure. My goal is to be able to run 10 times the amount of kit with a smaller team than I have today. So we really need to not have technology infrastructure staff in the loop. So those are the attributes of this. Separated demand and supply, not bespoke equipment, push out as much of the function to the actual application developer, and not require a big technology infrastructure presence.

Some of these points are listed on this slide as well. Internally, within financial industry certainly, one of the things that maybe is a little more problematic for us when it comes to state is pushing out to a public cloud. It is somewhat tricky probably for everybody to feel comfortable if you push your data out to publicly-- I mean why engineer this yourself? Why have an on-premises solution? Why not go to Rackspace? RackSpace Why not go to AWS? And, over time, I'm sure a lot of that will be done. There's some sensitivities around data and data insurance for us. Our data is absolutely vital. And any tiny exposure of data in a way that we wouldn't have controlled would be really a disastrous event. So that's something we're looking to build all these properties on premises.

And I think what's happened, very often people will say, well, I haven't needed this for 20 years for my Oracle, for my DB2, for my Microsoft SQL Server, for whatever. And I think one of things we need to think about is but that's because you were doing things-- building things manually-- meaning with a human being in the loop running automation-- has been sufficient, because you only fulfill those old needs. Already we have a lot of needs where people want to spin up environments, persist state, do something with that environment, and instead of doing it for weeks or for months, do it for days or hours, and then return that environment, meaning chip it, and only incur resource costs for the period of time they used it. So I think for people I think Mongo is pretty well situated due to its relatively low-touch topology, and due to the fact it doesn't require a scale-up solution or enterprise storage, is really, really well set up to benefit. And I think what's going to happen is there's going to be pretty rapid influxion, where technology stacks that do allow that are going to become dominant, and technology stacks that do not work well on that really take a very serious step back.

One of the things that's very nice about Mongo-- I mentioned that there's no hardware storage, but also just the simplicity of the product. If one looks at a Legacy RDBMS, one of the reasons there's a lot of high-skilled labor that gets involved in the management of that ecosystem is the zillions of tunables just conceptually was not built to run unattended. And Mongo is starting off at a extremely simple form. And in fact, as some of the sessions yesterday covered, the OS largely constrains it. So if you can adjust the size of your OS container, you can tune your Mongo just as you would any kind of compute that you might be running.

And it also pushes out things the technology infrastructure would get involved in, such as how durable is my data? What window for data loss do I have? It pushes that all the way out to the edge into the software. People can set their write concern and they can set their write concern for this piece of software versus that piece of software, and this time of day versus that time of day, and really disintermediates the stack from it. So all these things make Mongo a really nice fit for an on-premises cloud.

In terms of one of the big lessons that we learned here is-- we knew it conceptually, but it was another to face it-- is for computer clouds, generally the ratio of cores to memory is one of those important ratios that people have to decide how many different shapes are they going to offer. And largely they sort of look for a price point. Having too much memory can get quite expensive. And then below a certain point, it really doesn't save you any as to the cost, as the percentage of the total diminishes.

When you add storage to that mix, it's quite perplexing to know exactly-- and Bryan will be up here talking, has been, sort of, had the heavy weight of deciding when to offer additional shapes, and what should those shapes be. Having a three dimensional shape that involves storage as well can be somewhat painful, because there's always demand for yet another unique shape, you know, whether it's low cores, high memory, deep storage, fast storage, small memory-- and we go back and forth. And so that is something we still haven't totally come to terms with.

The other one that maybe is one that, in fact, I was just talking with Bryan earlier in the week about when we were talking about our next-- is sharing-- having a separate-- the same physical asset, assuming you don't go to an AWS and get a virtualized asset-- but having the same physical asset support multiple offerings, especially when you're geographically dispersed like Goldman Sachs is. You're all over the globe. Having a unique form factor for Mongo really is an impediment, because then you have to pre-position hardware all around the globe.

And so as we look at various offerings that we can co-locate on the same physical hardware, there's a strong desire to say, even if it's not optimal-- and there were some talks earlier talking about what's the maximum amount of memory per node-- what's the memory to data ratio? What's the maximum amount of memory that would be considered within normal parameters? And spinning disk versus SSD. So those types of decisions. Sharing physical hardware for us really opens a lot of doors. That being said, if the passage of time comes in, and the geometry changes, that could be problematic.

The other thing is for shops-- if people are here, and you're a medium-sized shop, you might have to look carefully. You may not want to build bespoke Mongo in terms of your cloud, but you also might say, how much-- where does the return on investment end in terms of how much end-to-end automation do you put in? Especially as Mongo comes out with more and more function, do you have more and more self-service automation, more and more autonomics.

And since I run the database technology for the firm, the last note there is it's difficult to change the mindset of DBAs wanting to get involved in the process-- to help, to assist, to tune, to insert themselves in the process. And I don't think that's where we want to be going with this. A lot of the advantage is the product should not require that. And I think with that I will open up it up to Bryan Doyle to talk about the actual lessons learned from our offering in that time. Thanks. And we'll have-- Jim will then talk, and we'll have some Q and A afterwards.

Thanks. Just out of curiosity, how many of you guys saw our talk where we were at the talk last year? I'll just let everybody raise their hands. OK. So what I would recommend doing is going back and actually watching that. I'm going to be referencing some of the things that we spoke about. But since that was a 40 minute talk and this is a 40 minute talk, I can't cover the whole thing.

But basically, the summary is that we went through a lot of metaphors around cookie cutters and whatnot, and talking about how to make shapes in general. So about a year later, I'm obviously-- we've been running with this effort, and we're going to go through it with some of the decisions that we made. But if you look back a year ago, you'll see something like how we were thinking about coming to terms with this.

So a year later we're enabling new use cases at the firm using MongoDB. Most of the DevOps knows that we're running prior to the service offering being available, successfully migrated into the service offering utilizing the Enterprise Edition. We had to do a lot of development, developer training, and education around what that meant coming in and also using MongoDB in general. But it really turns out that incremental engineering efforts in education go a long way over time. And looking back a year, it's really actually remarkable how much we accomplished so far.

We're actually engaged with MongoDB stronger than ever. We're involved in the Financial Service Advisory Group. Obviously, we're speaking here. We've been involved in feedback sessions, going through support issues. And they've been very supportive in terms of things that we run into and taking feedback in terms of making the product better. And in addition, we have an understanding of where the product is going.

MongoDB itself is now integrated into our database platform, and we are doing incremental automation, incremental self service. And over every week there's a possibility that we release new functionality into the offering. And it's really important to do that, because you start getting a very quick feedback cycle in terms of what's available and basically making sure that your priorities can be adjusted as people are running into issues. And also incrementally enabling people to do different things.

Last year we made the meme that MongoDB continues to be a sweet spot between the file system and the relational database. And that continues to be the case today in terms of the use cases that we're seeing. But it is important to note that search, document, relational, and other database platforms are converging into that space as well. So it's important to understand the ecosystem and really understand when Mongo's the right fit for the use case. In order to make the service offering successful, you need to make sure that your developers really understand that as well. They shouldn't be using the tool that's not the right tool, but we want to make the tool available for the use cases that really do benefit from it. And there are a lot of them.

So just in general, it's important to minimize overlapping platforms, but it is important to understand when they actually are overlapping and when they're not. So part of that is really setting developer expectations. So right now there's only one compute shape. If somebody wants to use Mongo in the service offering, they have to use that shape. And it basically becomes a constraint on their design in some cases, and in other cases they will move to a different platform unless we offer a different shape to them. Part of that is basically having fixed data disk size-- to what Warren was talking about, disk size is one of the dimensions that we had to choose. So you actually have a strategy that you have to employ moving into the service offering-- whether you're going to purge documents if you hit that limit, or if you're going to create another shard.

We don't do anything automatic. It's based on basically what the application intends to do. If they are expecting data growth, they need to have sharding in mind, and they need to have a shard key. If they're not looking to shard, they have to have the ability to basically have a purge strategy in place. And as a result, we don't actually automatically provide new shards for them. It has to be something that's actively ordered.

So another thing that we did talk about last year is that an out of region node is required. So we have two data centers in most of our retro areas-- that includes the Americas, Europe, and the Asia Pacific area. And there aren't three. And that is problematic for us in terms of offering solutions that require majority voting. But it also provides a benefit of now us making our third data center being out of region, and then giving us out of the box, out of region resiliency in terms of having a copy of your most up-to-date data.

Now that does cause you to have essentially a developer that now has to choose between whether their data is going to become unavailable, whether they're going to have slower performance, or they're going to be able to turn their durability down if there is an outage that involves one of the nodes in the replicate set that causes their London node, for example, to become one of the two that's in the majority when one of the New York servers is down.

So it does require some education. But again, to Warren's point, it puts all of that decision into the hands of the application. Today, if there are databases down, like it's down. So a lot of this stuff is incrementally improving on top of that. But we do do three nodes. We don't do five nodes. If we did five nodes, you would actually be able to get rid of some of those trade-offs. But we decided to start with a lower-cost solution in terms of data duplication.

So in terms of the service levels, the node cluster database have monitoring. We have daily backups to object storage, and the DBA team is involved in node replacement and cluster support. But overall self-service tools and other things like that are really the way that people are maintaining their databases.

So again, there's no customization. There's a couple of tunables that we can set on a config template side. But overall, the developers aren't involved in a lot of those decisions. There's basically some constraints that are given to them in terms of just this is what the service offering is. And again, features are added in a rolling basis, and things that we're working on include non-product cluster, management and failure testing. Some of the earlier previous presentations referred to doing Chaos Monkey-type stuff. We'll likely eventually get into that space.

Currently we're looking at making RESTful-type API calls to allow the developers to integrate failures into their testing. And if that doesn't prove to be sufficient, then we would likely move into that space as well. Again, data masking, and also production to development syncing. So that way you can end up getting very close to accurate, up-to-date data in the non-prod environments for testing.

So moving along, actually implementing this. We talked last year about late affinity of purpose. This continues to apply, and it goes back to what Warren was talking about in terms of having very vanilla hardware, and then placing something on top of it that makes a Mongo database. And we've taken that pretty much to heart. We have scripts that provision Mongo clusters very quickly, and the only thing that basically slows it down is us sending data over to the regional nodes and getting some of the configuration up. And going from bare VM to a Mongo cluster takes about 20 minutes at this point. And that's a lot better than what a lot of the other offerings are able to provide.

In terms of the virtualized environment that we have, we're running 8 VMs per hypervisor. There's a minimum of 3 VMs per shard, which could be the full cluster itself. They each have 100 gigs of SSD usable data and then there's other data reserved for other partitions, including the dump space, the config log space, the auto partitions, and the operating system. And then again, we co-locate all the processes. So I'll show the topology pretty briefly in the next couple slides. But the processes do get co-located.

Importantly, the Mongo S's are not available for the application teams to co-locate with their apps. The database team has full control over them for security purposes, and to be able to control them at maintenance times. Again, the regions that we have available are listed here.

The targeted use cases in the service offering are for Mongo clusters over the range of 10s to 100s of gigabytes. If they need much smaller than that, we'll go co-locate some of the application, which doesn't happen too often. If somebody needs something much larger, we would have to create a different shape or do a custom Mongo cluster, or they would have to use a different platform.

But again, the future use cases are largely unknown. We're designing this to basically say, here are your constraints, and then people are using Mongo to develop applications. The model that we are obviously working towards, though, is a strict server client model. So there's no application that's basically running Mongo on their own and also being in the service offering. So it's not like we're deploying some sort of virtualized server and then the application team is running it.

And it actually turns out that the New York, New York, London topology is used by over 92% of our use cases. So it turns out that that actually helps us with dealing with the supply of the hardware.

So looking forward, we're integrating this into something called a database resource container. You'll see some of the input parameters that we have to that. But it basically will have the developers indicating what their intent is for the database, and then being able to manipulate those parameters to create orders in the future. Limiting approvals to things that are just sort of good housekeeping in terms of making sure somebody's not using too much capacity, that the use case is proper, and then automating everything else.

And then provisioning. It's really making sure that all the automation work that we're working now fits into the broader database landscape is really important. And Mongo's sort of on the forefront of that.

One thing that I did want to bring up is that cluster name is really important. As you're in a private cloud, and your databases are moving around, name your things. Give them a personality. You need it for tracking backups. You need it to make sure somebody is connecting to the cluster that they think that they're connecting to. You need it as like a key into your inventory systems. And for all of the automation that we're talking about, there needs to be a name associated to the thing that you want to do something against. So in general, make sure that you have cluster names. It would be nice if Mongo had the cluster name in the product itself, but so far we've been able to work with just making sure that the collection of assets have a name, and if we end up being able to plug that into the product in the future, even better. But in general, personality awareness is pretty important.

This is a topology of a two-sharded cluster. Obviously there's no real host names or reports on here, or anything like that. But it does show our parent to child relationships between the different entities, and how they have names. So you'll see here there's a MongoWorld cluster that has two shards, which is called MongoWorld Shard 1 and MongoWorld Shard 2, which all have MongoD's with serial numbers on it. The config servers are co-located in the first shard, and then the Mongo S process runs on each of the different ones. And you can see a graphical representation on the right and the tabular one on the left.

In general, managing capacity is not a science. It's something that really requires us to figure out our run rate. It's something that requires us to know what our initial operating parameters are. There might be some backlog in terms of the service offering not being available. But as the quarters go by, you can basically use a concept borrowed from the military called an OODA loop. If you haven't heard of it, it's worth looking at. But it really lets you figure out what your bandwidth is in terms of how often you can react to changes and decisions.

So if you just take a look at how that mindset works, it really goes down to how often are you able to make the decision. How often are you able to observe? How many orders are coming through? And then you can try to make a proper sort of decision in terms of making it, so that you can have the ability to order things just in time, or close to just in time, to sort of be comfortable with underutilization or overutilization, and what your levels are-- what your comfort level is either from a management perspective, cost perspective, or et cetera.

And again, scaling out existing resource containers is basically what comes into the input parameters for this discussion. So if you aren't able to order more servers, and you're stuck with a certain capacity, you're either going to not have new use cases coming online, you're not going to be able to allow people to grow if they expect to be able to grow, and you could potentially get yourself into a situation where you're constraining the organization. So you have to just be very careful in that respect.

So some lessons shared, pretty quickly. Fixed disk size is important. You have to obviously either be able to purge or shard. And if you shard, you obviously have thought about your shard key, right? So there's nothing that I can do as part of the service offering by creating a second shard if the shard key's not defined. There just won't be anything in it. So it's important to design applications that are going to need that ahead of time.

And again, cluster life cycle management is important. You want to minimize Cartesian products between versions of your OSes from your internal automation software and Mongo versions. And being able to support versions centrally is important, but when you're actually on the local nodes, you don't have to do that so much. Keeping things sort of together is important.

So the platform engineering aspect of all this is very expensive. Every time you onboard a new product it's expensive. And it's really important to get these concepts right from the beginning and take incremental approaches, so that way you can revisit.

And then very quickly, some feature sets-- feature requests that we have in the pipeline. We're looking to do point in time recovery via mongodump, have multi-master, compression, schema validation, and metadata, Mongo S enhancements, and architecture. These slides will be online so you can see sort of the details behind those. And then, from security, having auditing with DML and role-based auditing, having user and role expiration times, having cross-realm support in the user document itself for Kerberos, and then having granular access controls by doing field-level security, other than just having the [INAUDIBLE] framework at the aggregation framework. And with that, I'm going to pass to Jim, who will go through, briefly, some use cases that we've worked with. So, thank you.

Hi. I'm Jim Pastos. I work in applied platforms in the technology division of Goldman Sachs. I'd just like to show you some of our products we're working on, and how they came to be with the power of Mongo pretty much driving them.

So we have this product called Symphony, which is a collaboration platform built in-house, which allows employees of the firm to chat between themselves, or have like two persistent chat rooms. But also, as a social platform, you can post status updates and have your hashtags, and tag different articles, links, and also share within the firm. So additionally to that, we actually pull in different social feeds from outside. So, stuff like Twitter. And for us to be able to do that, we have to have some kind of a rule engine keeping track of all the settings a user has, which kind of keywords are interesting, what's trending, what the company thinks is important for certain groups of people. And all this data is actually stored in Mongo. So our algorithms that go in and pull down the filters per user based on all these feeds, all processed, and Mongo's giving us a very straightforward API to go in and manage and process all this data. It's actually also used for settings, rule memberships, and descriptions, and pretty much even simple key value stores when we want to do something very basic to add that to the client.

We use, obviously, the data replication features of Mongo between our data centers regionally as well. So that's all taken care of. We pretty much as developers don't have to worry about it, because just out of the box it's a very, very powerful platform to work on. But we've also have some issues while developing. So the lack of transactions was quite big, because we have to update several things at once, I guess. And if there's some kind of a network glitch in the middle, we have to, at the application level, figure out what went wrong. Probably manually we revert changes so we don't have an inconsistent state. And also, because of, again, the atomicity of changes that we have in Mongo, we pretty much have to rely on single document changes. And that way we have to sometimes duplicate data between collections. So, again, that's something that we have to worry about at the application level, but there is ways around once you're aware of these issues that Mongo has.

So moving on. For those of you who were here last year, we had this product called Secure Locks. So Orbidrive is like an evolution of that product, which is basically a drive. You can think of it as Dropbox, Google Drive for our users. So initially it came out as an iPhone/iPad application. But now it's more of a drive that we can access from different platforms-- be it web interfaces or REST APIs-- where users can upload files and we store these files in our back-end storage. But we use Mongo to store descriptions of these files, be it tags and we version all these documents, metadata. And all these are indexed, so people can search on specific fields that they might have already tagged after the fact, or while they uploaded the file.

But also, as the product evolves, the power of Mongo-- that we don't have to worry about schemas and stuff, or adding and removing fields and stuff-- we can add features without worrying about decisions that we made when we started the product. So we now have like sharing documents to thousands of users. And that's just literally just writing one entry in Mongo without worrying about our back-end stores or keeping references of who has files and stuff, because the APIs are so straightforward, it's very, very simple to implement all these features. Again we use Mongo's replication to have all these-- the file metadata, I would say-- distributed around the globe.

The other thing I would like to say is, again transactions, again-- our indexes. So we had-- when we index these things, I guess, it's very simple to index tags and streams. But as we added extra requirements, or people added encrypted strings in Mongo, we had to figure out ways that we had to balance between, OK, should we have indexes when the keys are encrypted per different users, where-- that didn't work. So we had to fix problems at the application level that you'd assume you'd get for free in Mongo. But several requirements broke that feature, if you know what I mean. Any questions you can ask later to get into more implementation [INAUDIBLE].

Again, transactional performance. Again, when we upload a document, we need to keep track who has what. And when we update multiple collections, we don't have that safety of a transaction if something goes wrong. So we had to work around the lack of transactions in Mongo. I think I'll leave it at that. If we should go to some questions, if you have questions.

OK! Thank you for that, Warren, Bryan, Jim. We appreciate those updates. One of the things I heard that sounded great was it's better to work towards incremental ROI. Do it slowly and intently. As well as weekly releases. Some really good things. A year later those are some really good lessons learned. Does anyone have any questions? We probably have time for two-- maybe three.

Do you have any customer-facing applications in the MongoDB?

So I would say that there are customer-facing applications, in the sense that people are using some data that's back-end stored. But there's nothing that somebody would be physically logging into a website at this point and using directly.

Any more questions? Go ahead.

How about asset management, or trading models, investment data? You running any money, essentially?

I can't go into specific use cases, but there are representations of financial instruments being put into Mongo.

OK. I think we're really close to time. Thank you for that, gentlemen. It's great to hear one year later. Good to know you're still using us. And if you could take the time to please fill out your surveys. Coming next, we actually have a great case by Forbes-- how they're becoming a software engineering company. So, hope to see again in 10 minutes. Thanks for your time.