Case Study: Chinese search company Sogou runs MongoDB Nature of the ApplicationsWhy MongoDB?Sogou Results
China-based Sogou is one of China’s top search engines, and one of the world’s top-100 most heavily trafficked websites . MongoDB helps to make Sogou fast by handling its log and report data. Macro Huang, a Sogou engineer with years of experience deploying MongoDB at Sogou and his former employer, filled us in on the details. Sogou uses MongoDB primarily for storing log and report data. The first MongoDB application stores advertising customers’ report data, including page views, cost, clicks, click-through rates, etc. Sogou runs 3 separate clusters to store 3 different types of reports, with each cluster containing a 3-node replica set, together holding over 1 billion documents. For query purposes , Sogou built a customer index, querying up to 6 of the 17 fields included in each document using a Java driver via a custom ODM . Sogou separates its data into several MongoDB instances based on time range, with one year per database. Because Sogou’s data always has “hot” parts (most recent data being the most hot), the company moves the oldest databases to older machines, as necessary. Sogou’s second application is a logging system, which the search company uses for storing key operation logs. This application involves a 2-node replica set, with roughly 2 billion documents total. Sogou keeps a maximum of 120 days’ worth of data in its cluster, dumping older data into BSON format and holding it in a Hadoop cluster. This cluster has only one customer index, and up to 10 query fields. Each document has 15 fields excluding the _id. Queries on the application can cross a maximum of 92 days’ worth of data, or 1.5 billion queries with an average response time of sub-1 second, from 500 milliseconds to 1 second, depending on data size. Sogou was looking for something much faster than a relational database, which could be scaled quickly. Sogou looked at a range of NoSQL alternatives, but determined that MongoDB’s data model and index capacity made it a good fit. It also helped that MongoDB was so easy to learn. While Huang had previous experience with MongoDB, most of his team did not, and come from a RDBMS background. As Huang tells his colleagues, new MongoDB users can be up and running in 10 minutes, and doing real work in 30 minutes. That’s simply not possible with traditional RDBMS options. The experience so far? Very positive. Performance has been fast, which was Sogou’s primary consideration, and the learning curve for new users is exceptionally short. The search company plans to expand the size of these existing deployments and broaden its use of MongoDB to other applications. Tagged with: Sogou, China, reporting, logging, search engine
Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively.
Log everything! But how? All of you must know by now how valuable your data is to your product and business: KPI calculation, funnel analysis, A/B testing, cohort analysis, cluster analysis, logictic regressionÃ¢â‚¬Â_none of this is possible without a lot of data, and the most obvious way to get more data is logging. But how? As we started talking to our customers at Treasure Data , we realized that there was no effective tool to log data in a flexible yet disciplined way. So, we rolled up our sleeves and authored our own log collector and open-sourced it as Fluentd under the Apache 2.0 license. Fluentd is a lightweight, extensible logging daemon that processes logs as a JSON stream. It's designed so that the user can write custom plugins to configure their own sources and sinks (input and output plugins in Fluentd parlance). In just six months, Fluentd users have contributed almost 50 plugins . These plugins combined with the loggers written in several programming languages ( Ruby , Python , PHP , Perl , Java and more ) allow Fluentd to be a great polyglot service. Apache, TSV or CSV. TCP or UDP. MongoDB or MySQL. S3, HDFS or flat files. Chances are good Fluentd can talk to your existing system fluently (Okay, this pun was intended). fluent-mongo-plugin, the most popular Fluentd plugin Yes, that's right. fluent-mongo-plugin, the output plugin that lets Fluentd write data to MongoDB directly, is by far the most downloaded plugin! fluent-plugin-mongo's popularity should come with little surprise: MongoDB is based on schema-free, JSON-based documents, and that's exactly how Fluentd handles events. In other words, there is a one to one correspondance between Fluend events and Mongo documents. Also, MongoDB and Fluentd both aim to be easy to install and get up and running. If you love the agility and flexibility of MongoDB, chances are good you will also like Fluentd. How to send data into MongoDB from Fluentd I assume the reader already has MongoDB up and running . There are a couple of ways to install Fluentd: Ruby gem Fluentd and its plugins are available as Ruby gems. It's as easy as $ gem install fluentd $ gem install fluent-mongo-plugin Debian/RPM packages We have also packaged Fluentd and some of its plugins as td-agent (...tdâ€œ stands for Treasure Data). Of course, fluent-plugin-mongo is pre-packaged with td-agent for you :-p Here are the links to the packages. Debian package RPM package Now that we have everything, let's configure Fluentd to send data into MongoDB! In this example, we will import Apache logs into MongoDB. The location of your configuration file depends on how you installed Fluentd. If you went the Ruby gem route, it should be /etc/fluentd/fluentd.conf , and if you downloaded td-agent , it should be /etc/td-agent/td-agent.conf . Open your config file and add <source> type tail format apache path /var/log/apache2/access_log tag mongo.apache </source> These lines tell Fluentd to tail the Apache log at /var/log/apached/access_log . The tailed lines are parsed into JSON and given the tag ...mongo.apacheâ€œ. The tag decides how these events will be routed later. In the same config file, add # plugin type type mongo # mongodb db + collection database apache collection access # mongodb host + port host localhost port 27017 # interval flush_interval 10s </match> If your MongoDB instance is not running locally with the default port of 27017, you should change the host and port parameters. Otherwise, this is it. All of your Apache logs will be imported to MongoDB immediately. Fluentd + MongoDB = Awesome Sauce The popularity of MongoDB suggests a paradigm shift in data storage. Traditional RDBMs have their time and place, but sometimes you want more relaxed semantics and adaptability. MongoDB's schema-less document is a good example: it's flexible enough to store ever-changing log data but structured enough to query the data later. In contrast, logging is moving in the opposite direction. Logging used to be structure-free and ad hoc with bash-based poorman's data analysis tools running everywhere. However, such quick and dirty solutions are fragile and unmaintenable, and Fluentd tries to fix these problems. It's exciting to see this synergy between Fluentd and MongoDB. We are confident that more and more people will see the value of combining a flexible database (like MongoDB) with a semi-structured log collection mechanism (like Fluentd) to address today's complex data needs. Acknowledgement Many thanks to 10gen for inviting us to give a talk on Fluentd and letting us write this guest post. Also, we thank Masahiro Nakagawa for authoring and maintaining fluent-plugin-mongo . Tagged with: fluentd, logs, log, logging, apache, open source, treasure data, MongoDB, Mongo, NoSQL, Polyglot persistence, 10gen