This is a guest blog by Subha Ramachandran, Vice President, Product Management at SAP Labs. It represents her personal views, thoughts, and opinions. It is not endorsed by SAP nor does it constitute any official communication from SAP.
The bulk of work in any Big Data initiative is in preparation of the data – specifically data integration and ensuring data quality. With native support of MongoDB in the new release of SAP Data Services 4.2 SP04, those tasks just got easier.
Bring in Data that Lives in Various Source Systems
SAP Data Services for MongoDB simplifies the extract transform and load (ETL) of data from the database by preserving the fidelity of JSON structures instead of artificially flattening them, which can result in data redundancies/repetition. Within SAP Data Services, ETL developers can operate on hierarchical structures, perform required transformations, and flatten as needed to fuel analytics use cases. For example, users can load multi-level machine/equipment data stored in MongoDB into SAP HANA, Hadoop, or any other data warehouse on a regular basis (e.g. daily or weekly ) for analytics.
Simplify & Maximize Performance with SAP Data Services
SAP Data Services provides a rich set of native out-of-the-box transformations, with over 80 built-in functions in its library, including native text data processing, data masking, and data quality transformations to standardize, validate, cleanse and enrich data.
MongoDB's dynamic schema allow SAP Data Services to automatically scan collections in parallel to quickly infer the metadata. The software also supports the pushdown of allowed operations to MongoDB and thus maximizes performance. The SAP Data Services 4.2 SP04 release supports both single node and replica set deployments of MongoDB.
In summary, the SAP Data Services 4.2 SP04 release allows you to combine MongoDB’s schema flexibility with a market leading set of data integration and data quality capabilities. This gives developers the power to easily extract, transform, and load MongoDB data as part of any Big Data initiative. SAP Data Services and MongoDB help you deliver a complete and accurate view of your data, allowing you to identify new insights and convert them into business value.
Stay tuned for further details on what’s to come in 2015, including plans for support of MongoDB sharded clusters.
Find out more about SAP Data Services, Data Services 4.2SP 04 native support of MongoDB and why SAP Data Services is recognized as a Leader in the Magic Quadrants for both Data Integration and Data Quality tools.
Building your business is hard. Scaling your business data should not be.
Building your business is hard. Scaling your business data should not be. That's the message Sailthru CTO and co-founder Ian White relayed recently in New York. Over the course of a half-hour, White explained how Sailthru first did application-level sharding of its data out of necessity, but later moved to MongoDB's auto-sharding to massively simplify development. Success in the Billions Sailthru makes it easy for ecommerce and media brands to personalize content across a variety of channels, including email, onsite, mobile, social and more. As the company's customer base has swelled to over a billion users, 125 million content documents (e.g., URLs and products relevant to particular users) and 5 billion messages per month, Sailthru has come to store over 40 terabytes of data in MongoDB across 120 nodes on mostly physical infrastructure. As White suggests, "You can’t store this volume of data on just one node. We had to shard." Application-level Sharding at Sailthru When Sailthru first started, it didn't need sharding. But within two years Sailthru's customer count and data volumes were high enough that the company needed to partion its data. The question was: How? While some applications are either read heavy (online media site) or write heavy (logging and clickstream), Sailthru is both. As White explains, "We have to be able to read data and write personalized recommendations in real-time. MongoDB is a great database for this." Sailthru adopted MongoDB in the early days -- over four years ago. Prior to MongoDB 1.6, Sailthru partitioned much of its infrastructure using in-app sharding logic, as MongoDB didn't yet support auto-sharding. Sailthru partitioned data by client. Their application would examine each query, and dispatch to the appropriate replica set and collections based on a mapping configuration. This approach worked fine for a time at Sailthru. However, as Sailthru’s data grew, application-level sharding introduced significant code complexity and administration overhead. Application-level sharding also contributed to uneven load distribution, something Sailthru was able to Band-Aid by scaling up with more expensive servers. But the database team still had to manually rebalance and reallocate resources – every time Sailthru onboarded a sizable client that required a new shard, the database team would have to go in and add another line to the config file and redeploy. It was painful and demanding. Enter Auto-Sharding With the introduction of automatic sharding in 2010’s 1.6 release, the database itself manages the effort of distributing and balancing data across shards automatically. Sharding is transparent to applications – for 1 or 100 shards, the application code is the same. Setting up a sharded cluster involves making a critical decision - choosing a shard key. The shard key is the value the databse uses to determine placement of the document within shards. The Sailthru team considered several options, including sharding on client ID, MongoDB ID, or email. MongoDB supports multiple sharding strategies, and each is appropriate for different use cases. Ultimately, they opted to use hash-based sharding and MongoDB’s ObjectId as the shard key. With this approach, MongoDB does the work of ensuring a uniform distribution of reads and writes by randomizing the placement of documents across shards. To make the actual migration from application-level sharding to auto-sharding, the team used an open source tool created by MongoDB called MongoConnector. In the process of the migration, Sailthru forked the project, making significant contributions specific to their use case. With this change, it’s now possible for Sailthru to add shards without making any change to the code base. This meant that during a critical ramp-up time of tight resources and tight cash, Sailthru was able to focus their engineering efforts on improving their service and building new features, ensuring their phenomenal success. Build the Next Big Thing on MongoDB Thousands of organizations use MongoDB to build high-performance systems at scale. If you're interested in reading up on your own, download our Operations Best Practices white paper for additional information on operating and deploying a MongoDB system: Ops Best Practices About Kelly Stirman Kelly Stirman is Director of Products at MongoDB. Kelly works closely with customers, partners and the open-source community to articulate how MongoDB is quickly becoming the world's most popular database. For over 15 years he has worked at the forefront of database technologies. Prior to MongoDB, Kelly served in executive and leadership roles at Hadapt, MarkLogic, Oracle, GE, and PricewaterhouseCoopers.
Leaf in the Wild: Qihoo Scales with MongoDB
Leaf in the Wild posts highlight real world MongoDB deployments. Read other stories about how companies are using MongoDB for their mission-critical projects. 100+ apps, 1,500+ Instances, 20B Queries per Day Qihoo is China’s number 1 Android mobile distribution platform. Qihoo is also China’s top malware protection company, providing products for both web and mobile platforms. A MongoDB user since 2011, Qihoo has built over 100 different applications on MongoDB – including new services and migrations from MySQL and Redis – running on 1,500+ instances and supporting 20 billion queries per day. I had the chance to sit down with Yang Yan Jie, the Senior DBA at Qihoo to learn more about how and why they use MongoDB, his scaling best practices, and recommendations for those getting started with the database. Can you start by telling us about Qihoo? Qihoo 360 Technology Co. Ltd. is a leading Chinese Internet company. At the end of June 2014, we had around 500 million monthly active PC Internet users and over 640 million mobile users. Recognizing malware protection as a fundamental need of all Internet and mobile users, we built our large user base by offering comprehensive, effective and user-friendly Internet and mobile security products and services to protect users' computers and mobile devices against malware and malicious websites. Our products and services are supported by our cloud-based security technology, which we believe is one of the most advanced and robust technologies in the malware protection industry. We monetize our user base primarily through online advertising and Internet value-added services. In terms of our market position, we are: A top three Internet Company as measured by user base in China No. 1 Android-based mobile distribution platform in China No. 1 provider of Internet and mobile malware protection products and services in China No. 2 PC search engine in China When did Qihoo start using MongoDB? We were a very early adopter of MongoDB, building our first applications on the database back in 2011. I think we were using version 1.8 then! How is Qihoo using MongoDB today? MongoDB has become our standard modern database platform. We now have over 100 applications powered by MongoDB – both external customer-facing services and internal business applications. In total we have more than 1,500 MongoDB instances running on our in-house built “HULK” cloud platform, collectively serving 20 billion queries per day. Three particularly critical applications for our business are: Location-based mobile search application. We use MongoDB with its geospatial indexes and queries to deliver geo-aware search results to mobile users. The user can be searching for anything, from a local restaurant, to a shop, to a car dealership. The app will detect their location and serve search results based on proximity. MongoDB handles 1.2 billion queries per day from this application. Caching layer for user authentication data. Qihoo is a central portal for many Chinese Internet users. We have many partners that our users can connect to directly after logging into our site. We provide Single Sign On (SSO) to multiple services so users don’t need to keep providing their security credentials as they navigate around the web. The user’s SSO session is cached in MongoDB for ultra-fast access. MongoDB supports millions of concurrent users, handling 30,000 operations per second and 1.8 billion queries daily. Log analytics platform. We need to know our infrastructure is running well. Our internal business users also want to measure user engagement with new promotions and campaigns. To accomplish this, we collect log data from all of our Linux, Apache web server and Tomcat servers, and stream it directly into MongoDB. From there, our internal business users can generate real time analytics and reports using our PHP-based Business Intelligence (BI) platform. MongoDB stores 2.5 billion documents at any one time across 18 shards configured with 3-node replica sets for always-on availability. MongoDB serves nearly 3 billion queries per day, including 1 billion writes. What other databases do you use? MongoDB is one of the three database technologies used in our company. It isn’t necessarily suitable for all applications, so we also use MySQL for relational data problems and Redis for certain caching use-cases. Over time, we have migrated more than a dozen projects from MySQL and Redis to MongoDB. What factors drove this migration? Our goal is to use the best technology where it best fits. In the case of MySQL, migration was driven by scalability and developer productivity. As a relational database, MySQL does not scale out, so as our user base grew above 100 million active users, we hit the limits of how far we could push MySQL. MongoDB auto-sharding allows us to scale on-demand using commodity hardware. The MongoDB data model is also far more flexible. Our developers can get more done and iterate faster with MongoDB than they can with the relational model. In the case of Redis, the migrations were driven by cost and flexibility. We found that MongoDB meets our low latency caching requirements for many applications, while it’s on-disk persistence reduces the need to provision costly systems configured with high-memory footprints. In addition, there is much more you can do with MongoDB’s document data model than you can with Redis’ Key-Value model. This translates directly to richer application functionality. For applications where data volumes are expected to grow rapidly, we choose MongoDB over Redis. Tell us about the platforms you are running MongoDB on. Most of our applications are PHP based. We run CentOS on x86 hardware. We have standardized on local SSD storage as this gives us the best performance. We are running MongoDB 2.4 and the latest 2.6 releases. We are also looking forward to MongoDB 3.0! How is MongoDB configured? We run both single replica sets and sharded clusters, depending on the application. We have data centres across the country, with the main ones located in Beijing. We deploy MongoDB on our private cloud across multiple data centers, both for disaster recovery and for low latency local reads and writes. We don’t control our own fiber, so network quality is out of our control. For the most critical apps, we spin up identical MongoDB clusters in multiple data centers and use our own message queue to replicate between them – this gives us assurance of maintaining availability in the face of network partitions. How do you manage your MongoDB deployment? We have developed a centralized orchestration web platform, which we call the HULK cloud platform. It is used by nearly all of our technical engineers to control our mission critical infrastructure and services. It is a complex piece of engineering which we are very proud of. When we originally started the cloud platform project, we hoped it would allow our engineers to stand on the shoulders of giants, relying on the platform to speed up the time to market for their applications. Hence we named it “HULK”. HULK currently provides elastic services such as Web, relational database, NoSQL and distributed storage, etc. At same time, the open platform concept attracted various internal teams to move their applications onto the platform. The re-platforming of these applications provided immediate access to other LoBs internally, and in the process of doing that we helped the business groups to attain higher efficiency and greater technology expertise. MongoDB is one of the most critical services on HULK and it is fully integrated into the platform with a high degree of automation, allowing us to operate more than 1,500 MongoDB instances with just one and a half DBAs. The DBAs can perform “one click deployment” and “one click upgrade” tasks via the HULK management interface. All backup and monitoring is fully automated. For instance, if you add a new MongoDB node or cluster, HULK automatically configures the monitoring and backup strategy, as well as deploy the necessary agents. For developers, they can monitor a multitude of MongoDB metrics and status. In addition, they can open a ticket right on the management portal itself, instead of using email or IM, all with a few mouse clicks. How do you backup MongoDB? We use a combination of approaches, governed by the application’s RPO and RTO objectives: Filesystem backups. This is the default approach. We shut down a secondary replica set member and snapshot the filesystem image Incremental replication. For continuous backup, we have built a tool that tails the MongoDB oplog. We use this approach for more critical apps where we need faster restoration of service Delayed replicas. We use this approach for additional assurances, again governed by how quickly we need to bring the data back Can you share any best practices on scaling your MongoDB infrastructure? There are three tips I would like to share: From a DBA perspective, invest time to understand application usage. The developers will give their guidance, but we generally take any number they give us and add 50%! If you encounter performance issues, start with your hardware. We found upgrading from hard disks to SSDs gave us an instant performance boost without any other optimizations. For highly dynamic, write-intensive workloads, make sure you monitor storage fragmentation and compact regularly if needed. Are you measuring the impact of MongoDB on your business? Yes – in terms of time to market. An example of the impact this makes is our reaction to the 2014 earthquake in Yunnan province. Everyone in China wanted to have access to the latest updates and to be able to check in on friends and family in the region. The business felt the best way to do this was to build an app that verified and then consolidated newsfeeds from multiple sources. We designed the app in the morning after the earthquake, coded it in the afternoon and launched it in the evening. One business day from concept to production. Only MongoDB could support that velocity of development. Are you looking forward to MongoDB 3.0? We started testing MongoDB 3.0 and filing bugs as soon as we could get our hands on the first Release Candidate. We are especially excited about document level concurrency control. This will further improve write scaling and fully saturate the latest generation of dense multi-core systems we are using now. Compression is also a huge benefit for us. We have standardized on SSDs, so compression means we can pack more onto each drive, which will bring costs down. It will also give us another performance boost as fewer bits are read from disk, making better use of disk I/O cycles. What advice would you give to those considering using MongoDB for their next project? MongoDB’s document data model and dynamic schema bring great flexibility and power. But they also bring great responsibility! I’d recommend not storing multitudes of different document types and formats within a single collection as it makes ongoing application maintenance complex. Split out documents of different types and structures into their own collections. We have implemented tools that scan and sample documents from each collection. If variances in structure exceed our best practices, we alert the devs so they can go and address the issue. So that is where I’d start. Mr. Yang – I’d like to thank you for taking the time to share your insights with the MongoDB community. Struggling to scale your relational database? Download our Migration White Paper: Migration White Paper About the Author - Mat Keep Mat is part of the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.