Replacing Traditional Technologies with MongoDB: A Single Platform for All Financial Data at AHL Transcript

Full Video

The Man Group is one of the largest hedge fund investors in the world. AHL is one of their subsidiaries, focused on systems trading. They've done some really great things in the last few years, and Gary joins us today after 10 years of doing that, focused on moving from transactional systems into MongoDB with all of their financial data. He also has a Master's degree from Cambridge in natural science.

That's correct.

Excellent. How did I know that? So as we get going, he's going to join us. We're going to do a few questions at the end. And you're just about ready to go. Thanks for joining us this morning.

Thank you. We're up there on that screen, now? Fantastic. So my name's Gary Collier. I'm going to tell you a bit about the story of MongoDB at AHL, and really how we eased our journey to build a single platform for all of our financial market data.

There's one boring legal side I've got to subject you to before the rest of the talk. And because I work for a fund manager, a hedge fund, the compliance rules are all quite complex. People get quite paranoid about what I might say. So just to make it clear, the contents of this presentation are just about the technology that we've built, not about inviting you to make any kind of investments in our funds, that kind of thing.

So some introductions. I'm technology manager for AHL. As was just said previously, AHL is a systematic hedge fund manager. It's based in London. It's part of the broader Man Group. At AHL, I lead the core technology platform team, so responsible for things like market data, and data engineering, and the core systems-- the core research and production trading systems that our funds use.

And with me on the front row down there's James Blackburn. And he's a senior technologist in my group. In the past 18, 24 months or so, James has been responsible for building out much of the data platform I'm going to talk about today. Some of you may already have seen, he did a great presentation at PyData in London a few months ago, which I think got quite a lot of good press with the interplay between Python and MongoDB.

Just to give you a brief flavor of the organization, Man Group and AHL, Man's around 1,000 people strong. AHL's roughly 100 in number. And as an organization, we're really, really heavily reliant upon technology to try and give us the edge in the systematic trading space. And I think fortunately, because of the size and nature of the business, we're usually quite agile. We're usually able to take on board and employ new technologies where we can think it can give us this edge.

And again, another facet of the size, James and I are really intimately connected and embedded in the business. We're just a couple of desks away from our constant researchers or end users. And they'll tell us pretty quick if something we build isn't working well enough for them.

So, as I said, I'm going to tell you the story of MongoDB at AHL. I'm going to start by giving you a bit of a flavor about the problem faced at systematic fund management, and then talk about how we started off small with Mongo. But really as we got to grasp the product, we gained experience and a confidence with it. We started to use it for some really increasingly large and sophisticated data sets. And I'll sum up with a kind of summary at the end of where we are now, lessons learned, and maybe a taste of what the future holds.

So first of all, what's AHL? What does a systematic funds manager like this do? Well essentially it means at start of trading the markets, it's non-discretionary, OK? So when most people think of trading, they picture some guy sitting in front of a Bloomberg terminal, prices flashing red and green, and buy, sell, that sort of thing, being yelled down the phone.

AHL's not like that at all, OK? We employ roundabout 80 quantitative researchers, or quants, and a bunch of technologists. And their job is to really understand financial markets, understand market data, understand statistics and understand how to develop hopefully good software, and to bring all these skills together and construct back-testable, systematic models which can run automatically and trade the financial markets.

The commonly used term here, which you might have heard before, is black box trading. But at AHL, we're really more concerned with longer-term investment strategies. We're not part of the very high frequency trading arms race, which has received some questionable press, perhaps, of late.

In terms of how we develop these models, it would typically start with a research, a quant, having some hypothesis, some intuition perhaps, as to how financial markets behave, and then trying to capture that in the form of a mathematical computer model that can then be simulated and back-tested.

And once we start to get happy with the characteristics of that model-- the operational characteristics, the behavior, does it make money-- we can start to invest, first of all, the firm's own capital in the model. And then when we're completely happy that it's going to be a good thing for investors, we can put client money into the model.

It's really, at all stages of that process-- right from the initial interactive research, the idea formation, right through to the mission critical production trading side-- that the need for fast access to parallel data sets comes in.

So a little bit of history as to how we got to the point where we could even consider using a technology like MongoDB for this sort of business and this sort of data. And back pre-2011 or so, like a lot of systematic fund managers were and I think a lot still are, we used different languages for the research side of the business and the production, trading side of the business. And I guess this is the first mention, in this talk, of the impedance mismatch that we read so much about.

And what this mean in practice was that a bunch of guys or girls would design the model, code up the model in something like, language like, R or MATLAB or S+, and they weren't so happy with it. It would be thrown over the fence essentially, to a bunch of techies who would recode it in something like Java or C++.

So back in 2011, I ran a project to build, from the ground up, a brand new research and trading environment in Python. And why Python? Python allowed us to build a really fantastic interactive development environment, research environment, for our quants. But at the same time, it's language which our developers could really use to build some really great properly engineered software.

Now at the time, I bet that felt a bit like a leap of faith. But the payback of removing this impedance mismatch has been truly enormous. We've now got quants and technologists sitting side by side, talking and working in the same language. And productivity improvement has been really fantastic as a result of that.

But what dud this mean for our data? So like a lot of organizations, over the years we've grown organically with a massive number of different data sources, and lots of different flat files, RDBMSs. There was even some NoSQL in there. And typically our users has direct access to the data.

So as part of the rewrite in Python, we built an API to abstract away all of the difference end users, both the interactive users and the automated processes, from the underlying source of the data. So, great. All our data's now behind an API, and so we're starting to get into some semblance of good shape, but there still a whole mix of different technologies living back there, lots of different moving parts.

And onboarding new data into this rather heterogeneous system was typically never a particularly easy task. And performance as well. Both that interactive user experience and also the needs of, like, cluster computations for running the simulations I alluded to earlier-- neither of those things were in a position where I would have like them to be either. So was there one technology which could address all of those different concerns? Well, perhaps there was.

So we started small with some low-frequency data, but still nevertheless important data for our end users. So what does that look like? Well, the inputs to systematic trading models is often a time series of data, so reflecting the evolution of some attributes, it could be something as simple as price over time.

So we started a couple of small but key data sets. The first data here represents futures, financial futures market data, sampled once per day and spanning a number of years. A couple hundred futures markets, maybe 100 megabytes of data in total. The second and similar but slightly larger data set represents the minute-by-minute evolution of securities, again over a period of about 20 years, so a slightly larger data set here.

But ultimately data's the thing that our quantitative researchers really live and breathe. If there's a new source of data that they can analyze, and from which they can extract information and predictive power about how a financial market might behave in the future, that's the sort of thing that gives us an edge.

So at the back of my mind, looking at these data sets, I'm also thinking, what's next? What are our quants going to look at next? It could be something as simple as Twitter data. It could be something like analyzed satellite imagery data. Look at patches of green and brown, and think how's the crops? How are wheat futures going to be affected this year? That kind of thing.

So I'm really thinking of the what next, and making sure whatever platform we can come up with will handle that what next as well.

So a quick recap of the previous solution and how that handled the data. A mix of different files, [INAUDIBLE] format, some proprietary in-house format, on a rather expensive parallel file system. Most had a number of different database instances as well, as the performance problems of this particular estate. Onboarding new data here, as I said, typically involved specialized development efforts-- new UIs, new bespoke tables to handle the data.

So to sum up the challenge we were facing at AHL, we've got a whole bunch of different, fairly large data sets. A whole bunch of different end users-- some of them interactive users, some of them systematic, cluster compute users. And we want to ship this data to them as quickly and efficiently as we can. But if it can move the game on at the same time, then that would be great, too. Now one area, I guess, we were somewhat deficient at in the past was how a thing changes the data. Financial data can and does change. Vendors issue corrections to data. Operations people scrub the data. So wouldn't it be fantastic if we could record all of these changes, versions and control starts, so we can look back in time and ask, what did the data look like? And as I said, we need to make sure that whatever we build is easy to extend to new data sources.

So what's our new solution look like? Well at AHL, we've got a compute cluster. Depending on what you're used to, it's either relatively modest or relatively lavish, with 96 nodes, a whole bunch of CPU cores, and round running Linux. So we took a dozen of these nodes, fitted some commodity SSDs into them, and used that to build ourselves a MongoDB cluster to house that data.

Now I don't have too much time to talk about the schema design here. If you want to know more of the details of that, talk to my colleague, James, here, if you can grab him over at lunch. Or look up his PyData conference from earlier in the year.

Essentially the schema is similar in concept to the Mongo GridFS, but adds version store capability on top, so that the data split into each data library is essentially composed of a chunk section containing the data, and a metadata section containing the master data and version information about the data.

In terms of charting and replicas, each primary had a couple of replicas with a secondary read allowed, and we sharded all the data by symbol to follow MongoDB's best practice of making sure that any query of the data possible returns the data from just one shard.

This was built, that cluster, to house the existing data. We also built a number of different adapters to third-party external data sources and vendors, the idea being here, where possible, these adapters could query the size and shape of the external data and then structure it in the appropriate format inside the cluster. The aim here was to give our researchers or end users as great and as flexible a degree of self-service as possible, so they could ask the data. And so therefore, the [INAUDIBLE] is having to build these new tables to house it.

In terms of what the performance looked like, a couple of hopefully interesting graphs, before and after. This first graph here shows the performance our previous solution. So if you remember these RDBMS and file-based solution, across the x-axis you've got a number of rows worth of data, a number of day's worth of data being returned. Across the y-axis you've got the response time in seconds. And then the color coding is showing the number of different parallel users trying to access that data at the same time. So the play is 25, 50 users, up to 200 plus in red.

Mongo solution, in comparison, was much, much, much better. Really a couple of orders of magnitude faster to retrieve the data, but also just as important, consistency of retrieval times.

In terms of what this meant for our end users, the online experience for them is much more fluid, so the interactive research that the guys are doing. Now let's say they're really asked for a time series of data, it's there. It's in their Python environment. It's ready for them to interact with. It's ready for them to plot. It's ready for them to run analytics on. And obviously this is much more efficient for cluster computation as well, where we want to perhaps run a large number of models in parallel and perturb some of them in different ways, and see what the end results are.

Moving on to the slightly larger data set, again a marked improvement in performance and consistency retrieval time under MongoDB.

So conclusions from phase 1, if you like. Our quants, our researchers, a bunch of very, very happy quants. We've built a solution which is way faster than what they're used to. For all of the different data sizes, we've been there, and for all of the different load levels we threw at it.

At the same time, we built these game-changing new features. We're starting to remove this impedance mismatch between the data and their usage of the data. We're getting new data on board for them within a matter of minutes, in many cases. We've also built this powerful version store capability for them. They can really ask what did the data look like at any point back in time.

At the same time, we've done this and really started to achieve some quite significant cost savings for the organization. We're moving off an expensive proprietary piece of hardware and onto commodity SSDs.

Before moving on, it's probably also worth saying that as a technology, it's a technology that our developers really enjoyed working with. The integration between MondoDB and Python is really quite seamless. And I also think it's a sign that something's good with a technology when usage of it starts to spread organically rather than by some architectural team dictate. So developers from other parts of AHL started looking to use MongoDB to build small solutions for their particular problems.

So success, I think, so far. But were there any bigger challenges we could throw at Mongo? What about single stock equities? Now as a problem sets, these are quite different to our futures in a number of different ways. First of all, there are many, many more thousands of stocks in the world than there are financial futures markets.

Again, we wanted to deal with many, many years of time series data. But rather than just attributes such as price or traded volume, there's often many, many more different data items associated with single stock. Broker data. Accounting data. Fundamental data. You name it. So this leads to the ability to build increasingly complex and sophisticated trading models.

If you take a closer look at what we have built here, some of that the technical details, the source of most, not all but most of the data in this instance, was a third-party managed RDBMS solution. Roughly a terabyte of data, 10,000, 20,000, 30,000 or so stocks globally.

And building an equity-based model from this data would typically involve extracting a number of different data items for the sorts of information we think would contain some predictive power-- so your price history, earnings per share, that kind thing, and then performing a number of different computations on that data before coming up with a bunch of trading signals, buy/sell signals, for a big basket, potentially, of different equities.

And in terms of the size of the data that we're dealing with here, perhaps a trading model might consist of 250 separate different data items. For each of those data items, there might be a 600-megabyte matrix, if you consider we're essentially dealing with a universe of 10,000 stocks across the x-axis of this matrix with 20 years' worth of data, the scale of the problem suddenly becomes quite large.

And quants, researchers, as part of their day-to-day job, often want to perturb this data or add in new sources of data, and see the ripple-through effects of that onto the trading signals, onto the buy and sell signals that we generate.

And to make things even more challenging, a whole team of quants might want to work on a particular model at the same time, but without treading on each other's toes. So here we've got, I think, an interesting challenge which can leverage some of the work we've already done with fast, scalable I/O using Mongo, with our version store capabilities using a GridFS-like system.

So what we did, we extended what we'd done previously and built our Mongo cluster to also act as-- I put the title in the top there-- a multi-user versioned, interactive graph-based computation platform for the research and trading of single stock equity data.

I don't have any fancy charts unfortunately, not like in the first part of the presentation. But using Mongo in this way really did reduce the time it took to recompute some of these trading models from hours down to minutes.

So to conclude that, as well as our futures and F/X quants, we've got some very happy equities trading quants as well. We also built those guys a faster solution than they had previously used.

But what next? We've looked at futures and F/X. We've looked at single stocks. But did we have any bigger problems in AHL that we could use Mongo to help? Well what about tick data? So for those of you not familiar, tick data is every change in price of a security, so changes to things like bid, and ask, and traded price, as a result of every trade, every piece of market activity.

And at AHL, that translated into roughly 400 million or so messages per day. Now I said at the start that we weren't in the high frequency trading space, and we're not particularly. But we do use tick data. It's important to us both as the golden source of data from which we downsample some of the low frequency data sets-- that daily data, that once per minute data that I showed at the start. And we'll also use tick data to make sure that we get best execution of those trades that we do want to make in the market. So 400 million messages a day, is this big data? Well it's an often bandied about term, and in this case I don't think so. And volume, we've amassed roughly 30 terabytes or so of tick data over a number of years. And peak congestion rates, at periods of high market activity, might tip up to 120,000, 150,000 or so ticks per second.

In terms of variety of the data, we're typically here dealing with very sparsely populated matrices, and tracking the evolution of bid, ask, and traded price over time. Now if you want, you can go to the market and you can buy a third-party tick store. There's a number of vendors that will sell you one. And back in 2008 or so, before technologies like MongoDB were around, we did just that. But you'll find they're expensive. They often come with some proprietary query language which isn't necessarily the easiest thing to learn or to program. Architecture's often database-centric and not ideal for this new world of cheap cluster computation. So unless of course you want to pay the vendor for lots of cores, they'll sell you lots of licenses, but that's even more expensive. So there's a real opportunity here at AHL to save some money for us by moving to Mongo.

So in terms of the architecture that we built, the source of all this data is a bunch of real-time vendor data feeds, all distributed on a third-party message bus. So what we did, we built a number of different collectors that subscribe to data on this bus and publish, store, all of the ticks. Remember, 100,000 or so, 150,000 per second. They store all these ticks in Kafka. Now Kafka provides us with real-time absistent messaging. It's implemented as it distributes it, commit log.

It gives us the great ability to buffer ticks, batch ticks, replay ticks in the event of any sort of the system failure. It gives automated failover. There's a fantastic background paper on Kafka by a chap at LinkedIn called Jay Kreps. It's really, really well worth a read.

So we've batched up all our ticks in Kafka, and put them into our MongoDB cluster which, in this case, really was just a couple commodity, off-the-shelf, super-micro white boxes. They cost about $25,000 each, I think. A bunch of disks and RAM in each one.

Database structure here is probably slightly unorthodox. We adopted a sharded cluster approach. But with a Mongo database's structure, this single database held a year's worth of ticks for a particular source. So Reuters data 2013 would be one database. And this is really driven by granularity, backup, and restore concerns. If anything did go wrong, and fortunately it hasn't, we're in a position to really just restore a single database as a unit, not have to restore an entire 30-terabyte cluster.

Again not much time to talk about the schema here. Grab James or myself over lunch or look at James's PyData presentation if you want some more information.

But one thing is worth pointing out. We're using some clever compression here in using Mongo. We got down to a storage point around about 10, 12 bytes per tick of data, which was around 40 percent of disk storage requirements of our previously solution. And this data was distributed to the end users, running on the cluster at compute device, as I spoke about earlier, over an infinite band network.

How did that perform? Here are the results of some tests presenting, first of all, our previous proprietary solution with a number of queries over a period of time, and then comparing that with Mongo. Our previously solution maxes out at roughly 8 million ticks or so per second. Improving that was just not possible without spending lots more money on lots more licenses and lots more cores.

MongoDB, on the other hand, quickly and easily scaled up to an aggregate of 250 million ticks per second, distributed across our cluster. And even then, Mongo's not the limiting factor. We could really easily saturate our IP over infinite band network with compressed tick data to our clients. So 25 times greater tick throughput, with really just two pieces of commodity white box hardware.

On this theme of getting more from less, these graphs just show a little bit more detail about how much more efficient Mongo was in terms of usage of our network and usage of our CPU than our previous solution.

Now I've got some very happy execution quants as well, and also some very happy accountants at this time. Saved them quite a lot of money. So our quants are happy. We're delivering ticks to our compute cluster way faster than before. They can fit their execution models much, much quicker. And I'm not sure I can say this, because Jim might raise our prices, but we've achieved something like a 40 times reduction in licensing costs, if you look at that the cost of support for the Mongo service here, compared to the cost of licensing for the previous tick store.

To wrap up, where are we right now? Well, a few key facts. Performance. Everything in terms of data access for that show is running way, way faster than it was. We've achieved this whilst getting some really significant cost savings for the business, in terms of not just software licensing, but hardware as well.

Everything's much more efficient now. Not just the process, it's also the people side of things. We're spending less time doing support because we've simplified things, moving to one technology rather than a whole bunch of different tech.

At the same time, we've delivered some real game changers for our end users. Onboarding new data, the impedance mismatch, going, gone. And onboarding data's now minutes often, rather than days.

And last and not least, the people side of things. It's often overlooked, I think, but in terms of the knowledge-based industries, it's often the most important aspect, that ability to attract and retain great staff. And a quote from one of the quants sticks in my mind. And he said, the reason I love working here so much is because the technology is just so good. And Mongo certainly played no small part in that for us.

What next? Well our market data infrastructure is now in a pretty mature state. There are some opportunities to extend things further across AHL. The black box strategies which I told you about earlier, they typically generate some quite large volumes of diagnostic data, aircraft flight recorder style, I guess, is one similarity. And so certainly opportunities to put many terabytes of data we generate there on a monthly basis into Mongo for easier and faster analysis.

As I said, AHL's just one facet, it's one fund manager within the broader Man Group. There's a lot of interest, based on what we've done, to start using this technology across the group as a whole.

And there's a chance, but probably only a small chance, to look to open source this stuff, because as I said at the start of the talk, technology's one of the things which gives AHL, in the systematic trading business, an edge.

So thanks very much for your time and attention. Sorry to occupy so much of the pre-lunch slot. If there's time, I can take a couple of questions. And if it's something particularly techie, then I'll drag James on stage. Or you can grab us over lunch.

So thank you so much for that. You talked not only about the incredible improvements AHL got as a technology-driven company with new technology, but you went so far as to say that not only your quants are happy, but your accountants are happy.

Everyone's happy. Our quants--

Accountants happy? Come on. Oxymoron? Oxymoron? That's game-changing in and of itself. So we have time for maybe two questions and then we'll move on to lunch.

Hi. You mentioned going away from custom tick databases or proprietary tick databases to Mongo. What did you do to kind of migrate both your developers and the system from all those one-letter languages, that we all love so much, to Mongo? Because those systems tend to have a lot of statistics packages built in, a lot of homegrown kind of add-ons, developed with that.

OK, in terms of the migration, starting with the stats side of things, it was a pretty monumental effort, and probably 20 to 30 man-year's worth of development building a Python system that was rich enough and complete enough that we could persuade our end users to adopt. They can be quite a-- not stroppy bunch, but they're certainly-- quants are quite an opinionated bunch of people.

So it was a labor of love, if you like. It involved not just the technical side of things, but addressing the people concerns. We ended up doing funky videos and all sorts of things to bring people on board and get them to adopt Python, and adopt the brand.

In terms of migrating the data, and the tick data in particular, that's one quite gnarly thing that sticks in my mind. I was quite aware that if I migrated all the tick data and then we discovered, post turn-off of the old system, there's a problem with it, I was probably going to be-- bang, I'm gone, because I've lost all of our data.

So I had quite a rigorous process there of not only checking all the data history, but checking day by day. What we put into Mongo, did it match what went into the tick store? Does that answer the question?

Last question.

Yes, did you do a stack benchmark on your new system? I know you're a stack man. I wasn't familiar with the tick data that you put up there. I was just wondering if you did a stack benchmark.

Stack benchmark in what way?

Like a stack M3 benchmark on your tick system, to see how it compared with other tick analytics-type of systems.

No, other than the tick throughput benchmarks-- so how many million ticks per second can we deliver to our cluster. That was the only benchmark that we ran. In terms of tick ingestion rates, we're not even close to things touching the sides at 100,000 or so per second.

Free White Paper

Read our free white Paper: RDBMS to MongoDB Migration Guide and learn about:

  • Schema design, moving from relational tables to flexible and dynamic BSON documents
  • Data integrity, and how MongoDB ensures atomicity, data durability and strong consistency
  • Query model & data analysis in MongoDB
  • Data Migration, including scripts and parallel operations
  • Operational Considerations, like scaling, HA, monitoring, security and backup

Download Your Copy