Retail Architecture Best Practices Part 1: Building a MongoDB Product Catalog

MongoDB
May 4, 2015 | Updated: December 12, 2022
#Technical

In today’s digital-first economy, flawless e-commerce experiences are integral to securing and maintaining customer loyalty.

We’re here to help. In this guide, we’ll discuss:

Why retailers prefer MongoDB Guidance on data modeling for product catalogs
How to make products discoverable
Driving conversions with fast and relevance-based search

In order to deliver the AI/ML-enriched and hyper-personalized retail experiences that are table stakes in the digital ecosystem, retailers are moving away from monolithic architectures. The first focus for this transformation is the product catalog, which is the window into modern e-commerce platforms.

As consumers’ preferred shopping methods shifted to e-commerce, customer preference for real-time updates has increased dramatically. A retailer’s ability to adopt this change has a direct influence on customer loyalty and business growth. A lack of a real-time, consolidated view of customer history, orders, inventory, and supply chain network updates will hurt the bottom line.

Product catalog data management is a complex problem for retailers. After years of relying on multiple monolithic, vendor-provided systems, retailers have learned that product catalogs built on legacy databases are unsuitable for modern e-commerce experiences.

In today’s vendor-provided systems, product data must frequently be moved back and forth using ETL processes to ensure all systems (in store, mobile shopping, and ecommerce) are operating on the same data set. This approach is slow, error prone, and expensive in terms of development and management. In response, retailers are now making data services available individually as part of a distributed microservices.

Data models established using Relational databases carry restrictions and limitations when it comes to realigning with newly discovered attributes and other modifications required. Inconsistent and fragmented data structures (imagine a supplier introducing new attributes that were previously not known - size, availability season etc) can be visible to end customers as inconsistent product information across channels. Many retailers using Relational databases must introduce new tables to track newer attributes that take time and effort to propagate across channels which maintain their own copy of the product data.

Single view, scalability, flexibility, and search

Modern retailers understand the need for improved web, mobile, and social media commerce experiences. Today’s customers expect the digital-first experience to follow them to brick and mortar stores, too. The next frontier of retail personalization is the seamless connection, and blurring, of the digital and physical retail experience.

How will retailers get to the hybrid future of shopping? Many are approaching the challenge by investing in cloud migration, typically with a “lift-and-shift” approach by replicating their on-premise infrastructure in the cloud. For a more transformational approach to retail modernization, a developer data platform providing a single, unified view across all your data assets, scale, flexibility, and integrated search functionality can thoroughly reshape floundering modernization strategies.

Single view, scalability, flexibility, and search are foundational for product catalog modernization. Here’s why:

Single view: A single view of customer data breaks down data silos, allowing retailers to utilize data to offer a consistent, integrated service. Data duplication is reduced, data consistency is increased, and a holistic view of the customer is made available to the application stack, so retailers can focus on offering stronger personalized client experiences.
Scalability: As retailers add more products to meet market demand and growth acceleration goals, including moving to markets in new regions, product catalog access slows. Retailers need application data platforms that can scale without sacrificing performance.
Flexibility: Product catalog contains the details of various product attributes with high degree of variability that continue to evolve over a period of product life cycle. A flexible document model allows evolution of diverse data sets prevalent in retail data silos, and utilizes them to a retailers’ advantage, without worrying about data duplication, stale data, or data silos. For example a retail wants to carry a specific product line only for northern regions of Canada where it is required (ice fishing gear etc)
Search: Fast, sophisticated, intuitive, relevance-based search is key to provide superior customer experience across channels, including recommendations, geospatial merchandising, and systems availability. Typical Retailer establishes a search infrastructure which becomes an additional piece of the puzzle that must be managed and maintained. MongoDB Atlas solves this problem for enabling built-in capabilities comparable with any search engine based on Lucene.

MongoDB helps developers build retail product catalogs with seven key solutions:

Omnichannel catalog
Single view
Real-time analytics
Payments
Fast & relevant full-text search
Real-time inventory and supply chain
Retail logistics: personalization

Why MongoDB works for mission critical applications

MongoDB application data platform radically simplifies data architecture by providing commonly known for its strengths in traditional relational systems with ACID transactions, secondary indexes, unions, joins, security and enterprise management, the real value comes from application data platform approach.

MongoDB’s document model resembles intuitive data modeling aligning the real world objects within native data structures, in a flexible manner. Idiomatic drivers enable developers to work with data as code. The MongoDB Query API and drivers are idiomatic to your programming language, whether it’s C#, SWIFT, Java, Javascript, etc. Ad Hoc queries, indexing, full text search, and real time aggregations provide powerful ways for accessing grouping, transforming, searching and analyzing data to support any class of workload.

This empowers you to build a global database over multiple cloud platforms, anywhere, any time with our scalability and distribution.

In retailers’ product catalogs, a single item could have thousands of variants – for example, a single shoe style that comes in six different colors and 12 sizes. Product prices vary widely based on store or geography, and each product comes with reviews, promotions, data sheets, images, videos and more. Altogether, a single SKU for a product can represent tens or hundreds of different attributes, each of which can be unique to that SKU.

MongoDB’s document data model helps contain and organize the chaos of retailers’ data silos into a consolidated rich JSON document. This model enables retailers to dramatically reduce hours spent on database administration. Development teams are freed up to focus on launching new features instead of grappling with rigid, tabular data models that bear no relation to the product being modeled, and that need to be constantly changed as new products are added to the catalog.

The specifics: data models and per store pricing

Now that you’ve learned the key reasons why retailers choose MongoDB for e-commerce solutions, we’ll take a look at the specifics of how we put some of these to use in our retail reference architecture to support a number of features, including:

Searching for products and product variants
Retrieving per store pricing for items
Enabling catalog browsing with faceted search

Product Data Model The first thing we need to consider is the data model for our items. In the following examples we are showing only the most important information about each item, such as category, brand and description:

{
	“_id”: “30671”, //main item ID
	“department”: “Shoes”,
	“category”: “Shoes/Women/Pumps”,
	“brand”: “Calvin Klein”,
	“thumbnail”: “http://cdn.../pump.jpg”,
	“title”: “Evening Platform Pumps”,
	“description”: “Perfect for a casual night out or a formal event.”,
	“style”: “Designer”,
	…
}

This type of simple data model allows us to easily query for items based on the most common predicates. For example, using db.collection.findOne, which will return a single document that satisfies a query:

Get item by ID db.definition.findOne({_id:30671})
Get items for a set of product IDs db.definition.findOne({_id:{$in:[30671;452318]}})
Get items by category prefix db.definition.findOne({category:/^Shoes/Women/})

Notice how the second and third queries used the $in operator and a regular expression, respectively. When performed on indexed collections, MongoDB provides high throughput and low latency for common queries that the users are generally looking for.

Variant Data Model Another important consideration for our product catalog is item variants, such as available sizes, colors, and styles. Our item data model above only captures a small amount of the data about each catalog item. So what about all of the available item variations we may need to retrieve, such as size and color?

One option is to store an item and all its variants together in a single document. This approach has the advantage of being able to retrieve an item and all variants in a single query. However, it is not the best approach in all cases. It is an important best practice to avoid unbounded document growth. If the number of variants and their associated data is small, it may make sense to store them in the item document.

Another option is to create a separate variant data model that can be referenced relative to the primary item:

{
	“_id”: ”93284847362823”, //variant sku
	“itemId”: “30671”, //references the main item
	“size”: 6.0,
	“color”: “red”
	…
}

This data model allows us to do fast lookups of specific item variants by their SKU number:

db.variation.find({_id:93284847362823})

As well as all variants for a specific item by querying on the itemId attribute:

db.variation.find({itemId:30671}).sort({_id:1})

In this way, we maintain fast queries on both our primary item for displaying in our catalog, as well as every variant for when the user requests a more specific product view. We also ensure a predictable size for the item and variant documents. MongoDB document model not only gives the flexibility of dynamic schema but also enforce governance through schema validation.

Per store pricing

Another consideration when defining the reference architecture for our product catalog is pricing. We’ve now seen a few ways that the data model for our items can be structured to quickly retrieve items directly or based on specific attributes. Prices can vary by many factors, like store location, geograpies, local currencies and taxation laws. We need a way to quickly retrieve the specific price and other relevant factors of any given item or item variant. This can be very problematic for large retailers, since a catalog with a million items and one thousand stores means we must query across a collection of a billion documents to find the price of any given item.

We could, of course, store the price for each variant as a nested document within the item document, but a better solution is to again take advantage of how quickly MongoDB is able to query on _id. For example, if each item in our catalog is referenced by an itemId, while each variant is referenced by a SKU number, we can set the _id of each document to be a concatenation of the itemId or SKU and the storeId associated with that price variant. Using this model, the _id for the pair of pumps and its red variant described above would look something like this:

Item: 30671_store23
Variant: 93284847362823_store23

This approach also provides a lot of flexibility for handling pricing, as it allows us to price items at the item or the variant level. We can then query for all prices or just the price for a particular location:

*All prices: db.prices.find({_id:/^30671/})

Store price: db.prices.find({_id:/^30671_store23/})

We could even add other combinations, such as pricing per store group, and get all possible prices for an item with a single query by using the $in operator: db.prices.find({_id:{$in:

[	“30671_store23”,
	“30671_sgroup12”,
	“93284847362823_store23”,
	“93284847362823_sgroup12” ]}})

Product catalog search: How hard is it to build full-text search?

Product catalogs contain a high degree of variability of information per product type or category. Complex product structures like serialized items/bundles/kits require an exhaustive process to update the catalog as well as to search. Product search bar on an ecommerce website need to be powered by a search technology that infers the intent of the customer to surface the right product(s) to further enhance customer experience. A product catalog is only as good as the results its search engine provides. Customers expect to find what they are looking for based on a few keystrokes and intelligent guesses by the commerce platform. They depend on the search bar, which is essentially a window into your ecommerce store. The faster and more relevant search results need to be to satisfy customer expectations, the more complex and difficult it is to build.

A common misperception is that a fast database can support the search requirements for e-commerce interfaces. But database queries are much less dynamic than search queries. Database queries are ideal for situations when developers know what queries to expect, based on how the application works. They can index fields based on common query patterns to improve performance, and queries are optimized for correctness and integrity. For example: “show me all navy blue Calvin Klein evening shoes in size 8”—it’s straightforward, there’s an exact answer, and I can create an index on the relevant brand, title, size, and color fields.

On the other hand, search queries are designed for situations where developers don’t know what user queries will look like in advance. They are often expressed in natural language questions and may require searching across a lot of different fields. These queries are optimized for speed and relevance; they return results immediately, and sort them by how closely they match the users and search terms. For example: “show me all women’s pumps for evening and casual wear — in this case, we have to consider the color, brand, image alt text, meta tags, etc. And the next query might be entirely different—it could ask, “show me all evening shoes priced between $50-$100.”

Legacy product search architectures use several different systems to deliver this capability, including: data in RDBMS, a separate search engine, and a caching layer to allow faster responses to page rendering resulting in complex architectures and tremendous synchronization need across systems.

Modern retail search requirements for e-commerce sites

Today’s online shoppers are savvy internet users. We’re all accustomed to the quick, robust search results we get from Google. Once again, there are 7 key solutions that MongoDB offers to help retailers build search experiences that run like Google and give consumers the relevant results they need to make quick and easy purchases.

Here are the key search requirements for e-commerce sites:

Fuzzy search, autocomplete, synonyms, and analyzers help users get the right search results quickly and easily.
Faceted search and counts help users efficiently navigate categorized search results on several attributes
Highlighted extract snippets help users understand a product’s relevance to their search term
Geospatial search allows users to filter and return results by location
Response within milliseconds for hundreds or thousands of items
Relevance tuning returns sponsored or preferred products higher up in the results set
Page rendering/pagination, which requires deterministic ordering

How retailers build search for product catalogs today

The way that developers in the retail industry build search today can be grouped into three buckets: simple database search, bolt-on search engine, and customized search.

Simple database search: examples of this are the $text and $regex operators in MongoDB. They are easy to use and since they are common database commands, there’s no data sync required. On the flip side, they are often quite slow leading to poor performance, they are limited to just text data, and provide really no way to tune the relevance of results.
Bolt-on search engine: examples of this are Elasticsearch, Algolia, AWS OpenSearch Service, Azure Cognitive Search, and in some cases, Solr. These are popular and well-known search engines available on the market and generally provide fast, relevant results. However, they are yet another system to develop against, pay for, and manage, and require constantly syncing data to and from a database. We’ll dig more into the pains shortly.
Customized search: examples of this are Solr and Lucene. Now, Lucene is the open source technology that powers many of these solutions, including Elasticsearch, Solr, and our very own Atlas Search. And while it’s great for search and you can have complete control over what features to use, it’s incredibly expensive to manage and maintain yourself, and requires seasoned experts on the team to run successfully.

Think twice before bolting on a search engine to your database

Option two for building out search, the bolt-on method, can lead to major complexity and architectural sprawl in your ecommerce platform.

First, bolting on a search engine to your database results in lower developer productivity. When it comes to building search functionality, developers have to use different drivers and query languages to access their database and search clusters. The learning curve and need to switch context can lead to wasted developer hours.

Second, architectural complexity. Since the data that needs to be searched is stored in the database cluster, there must be mechanisms in place to sync that data to the separate search cluster. This involves scripts and processes to transform and structure the data into a format that the search cluster can index and query. And when the underlying schema changes in the database, developers have to spend time coordinating those changes between the two systems.

Finally, there’s operational overhead. Every icon in the diagram above requires its own infrastructure and support to ensure that it’s highly available, secure, backed up, using the latest software, scaled to meet changing demands, etc.

All of this costs time, people and money. Think about it. You have three separate systems for database, search, and sync — even in fairly small companies, there could be 3-4 employees just maintaining this small part of the application.

How MongoDB helps: MongoDB Atlas Search

The biggest challenge for our product catalog is enabling users to browse and discover products using natural language queries and giving results quickly, even if the user doesn’t know the name of the product, or spells it incorrectly. While many users will want to search our product catalog for a specific item or criteria they are looking for, many others will want to browse, then narrow the returned results by any number of attributes. So given the need to create a page like this:

We have many challenges mentioned above: response time, multiple attributes, variant-level attributes, multiple variants, page sorting, pagination, and typos. Another challenge for retailers is controlling which products are surfaced first in search results. For example, a company may be running a promotional campaign, prioritizing its own brand or excess stock on clearance. All of these complexities fall under the umbrella of relevance tuning.

For catalog and content search, MongoDB’s document model handles a massive variability of catalog and content data. MongoDB Atlas provides real-time analytics capabilities with time series collections to capture clickstreams and conversions.

Atlas Search provides all of the features you need to deliver rich and personalized experiences to your users, including fuzzy matching; autocomplete; lightning fast facets and counts; highlighting; relevance scoring; geospatial queries; and synonyms, all backed by support for multiple analyzers and languages. These capabilities come together to help you boost user engagement and improve customer satisfaction with your applications – from product catalog and content search to powering complex, ad-hoc queries in your line-of-business applications.

Atlas Search is part of MongoDB Atlas, the multi-cloud application data platform that combines transactional processing, relevance-based search, real-time analytics, mobile edge computing with cloud sync, and cloud data lake in an elegant and integrated data architecture. Through a flexible document data model and unified query interface, Atlas provides a superior developer experience to power almost any class of application. At the same time it meets the most demanding requirements for resilience, scale, and data privacy.

Most importantly, its geo-distributed clusters provide a distributed architecture for resilience and low latency. This allows for auto-scale and sharding to handle promotional traffic spikes.

The benefits are immense: users can quickly find the most relevant matches using flexible search terms in any language and research and compare product and content categories. Your platform will be able to summarize product or content directly within the search results, and can boost preferred search results for promotions.

Learn more

Now that you have explored our e-commerce solutions for product catalogs and search, dive into our next post in the series: Approaches to inventory optimization.

← Previous

New Compression Options in MongoDB 3.0

MongoDB 3.0 introduces compression with the WiredTiger storage engine. In this post we will take a look at the different options, and show some examples of how the feature works. As always, YMMV, so we encourage you to test your own data and your own application. Why compression? Everyone knows storage is cheap, right? But chances are you’re adding data faster than storage prices are declining, so your net spend on storage is rising. Your internal costs might also incorporate management and other factors, so they may be significantly higher than commodity market prices. In other words, it still pays to look for ways to reduce your storage needs. Size is one factor, and there are others. Disk I/O latency is dominated by seek time on rotational storage. By decreasing the size of the data, fewer disk seeks will be necessary to retrieve a given quantity of data, and disk I/O throughput will improve. In terms of RAM, some compressed formats can be used without decompressing the data in memory. In these cases more data can fit in RAM, which improves performance. Storage properties of MongoDB There are two important features related to storage that affect how space is used in MongoDB: BSON and dynamic schema. MongoDB stores data in BSON, a binary encoding of JSON-like documents (BSON supports additional data types, such as dates, different types of numbers, binary). BSON is efficient to encode and decode, and it is easily traversable. However, BSON does not compress data, and it is possible its representation of data is actually larger than the JSON equivalent. One of the things users love about MongoDB’s document data model is dynamic schema. In most databases, the schema is described and maintained centrally in a catalog or system tables. Column names are stored once for all rows. This approach is efficient in terms of space, but it requires all data to be structured according to the schema. In MongoDB there is currently no central catalog: each document is self-describing. New fields can be added to a document without affecting other documents, and without registering the fields in a central catalog. The tradeoff is that with greater flexibility comes greater use of space. Field names are defined in every document. It is a best practice to use shorter field names when possible. However, it is also important not to take this too far – single letter field names or codes can obscure the field names, making it more difficult to use the data. Fortunately, traditional schema is not the only way to be space efficient. Compression is very effective for repeating values like field names, as well as much of the data stored in documents. There is no Universal Compression Compression is all around us: images (JPEG, GIF), audio (mp3), video (MPEG), and most web servers compress web pages before sending to your browser using gzip. Compression algorithms have been around for decades, and there are competitions that award innovation . Compression libraries rely on CPU and RAM to compress and decompress data, and each makes different tradeoffs in terms of compression rate, speed, and resource utilization. For example, one measure of today’s best compression library for text can compress 1GB of Wikipedia data to 124MB compared to 323MB for gzip, but it takes about almost 3,000 times longer and 30,000 times more memory to do so. Depending on your data and your application, one library may be much more effective for your needs than others. MongoDB 3.0 introduces WiredTiger, a new storage engine that supports compression. WiredTiger manages disk I/O using pages. Each page contains many BSON documents. As pages are written to disk they are compressed by default, and when they are read into the cache from disk they are decompressed. One of the basic concepts of compression is that repeating values – exact values as well as patterns – can be stored once in compressed form, reducing the total amount of space. Larger units of data tend to compress more effectively as there tend to be more repeating values. By compressing at the page level – commonly called block compression – WiredTiger can more efficiently compress data. WiredTiger supports multiple compression libraries. You can decide which option is best for you at the collection level. This is an important option – your access patterns and your data could be quite different across collections. For example, if you’re using GridFS to store large documents such as images and videos, MongoDB automatically breaks the large files into many smaller “chunks” and reassembles them when requested. The implementation of GridFS maintains two collections: fs.files, which contains the metadata for the large files and their associated chunks, and fs.chunks, which contains the large data broken into 255KB chunks. With images and videos, compression will probably be beneficial for the fs.files collection, but the data contained in fs.chunks is probably already compressed, and so it may make sense to disable compression for this collection. Compression options in MongoDB 3.0 In MongoDB 3.0, WiredTiger provides three compression options for collections: No compression Snappy (enabled by default) – very good compression, efficient use of resources zlib (similar to gzip) – excellent compression, but more resource intensive There are two compression options for indexes: No compression Prefix (enabled by default) – good compression, efficient use of resources You may wonder why the compression options for indexes are different than those for collections. Prefix compression is fairly simple – the “prefix” of values is deduplicated from the data set. This is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field. Prefix indexes also provide one very important advantage over Snappy or zlib – queries operate directly on the compressed indexes, including covering queries. When compressed collection data is accessed from disk, it is decompressed in cache. With prefix compression, indexes can remain compressed in RAM. We tend to see very good compression with indexes using prefix compression, which means that in most cases you can fit more of your indexes in RAM without sacrificing performance for reads, and with very modest impact to writes. The compression rate will vary significantly depending on the cardinality of your data and whether you use compound indexes. Some things to keep in mind that apply to all the compression options in MongoDB 3.0: Random data does not compress well Binary data does not compress well (it may already be compressed) Text compresses especially well Field names compress well in documents (the additional benefits of short field names are modest) Compression is enabled by default for collections and indexes in the WiredTiger storage engine. To explicitly set the compression for the replica at startup, specify the appropriate options in the YAML config file . use the command line option -- wiredTigerCollectionBlockCompressor . Because WiredTiger is not the default storage engine in MongoDB 3.0, you’ll also need to specify the -- storageEngine option to use WiredTiger and take advantage of these compression features. To specify compression for specific collections, you’ll need to override the defaults by passing the appropriate options in the db.createCollection() command. For example, to create a collection called email using the zlib compression library: db.createCollection( "email", { storageEngine: { wiredTiger: { configString: "blockCompressor=zlib" }}}) How to measure compression The best way to measure compression is to separately load the data with and without compression enabled, then compare the two sizes. The db.stats() command returns many different storage statistics, but the two that matter for this comparison are storageSize and indexSize. Values are returned in bytes, but you can convert to MB by passing in 1024*1024: > db.stats(1024*1024).dataSize + db.stats(1024*1024).indexSize 1406.9201011657715 This is the method we used for the comparisons provided below. Testing compression on different data sets Let’s take a look at some different data sets to see how some of the compression options perform. We have four databases: Enron This is the infamous Enron email corpus . It includes about a half million emails. There’s a great deal of text in the email body fields, and some of the metadata has low cardinality, which means that they’re both likely to compress well. Here’s an example (the email body is truncated): { "_id" : ObjectId("4f16fc97d1e2d32371003e27"), "body" : "", "subFolder" : "notes_inbox", "mailbox" : "bass-e", "filename" : "450.", "headers" : { "X-cc" : "", "From" : "michael.simmons@enron.com", "Subject" : "Re: Plays and other information", "X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "eric.bass@enron.com", "X-Origin" : "Bass-E", "X-FileName" : "ebass.nsf", "X-From" : "Michael Simmons", "Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)", "X-To" : "Eric Bass", "Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Here’s how the different options performed with the Enron database: Flights The US Federal Aviation Administration (FAA) provides data about on-time performance of airlines . Each flight is represented as a document. Many of the fields have low cardinality, so we express this data set to compress well: { "_id" : ObjectId("53d81b734aaa3856391da1fb"), "origin" : { "airport_seq_id" : 1247802, "name" : "JFK", "wac" : 22, "state_fips" : 36, "airport_id" : 12478, "state_abr" : "NY", "city_name" : "New York, NY", "city_market_id" : 31703, "state_nm" : "New York" }, "arr" : { "delay_group" : 0, "time" : ISODate("2014-01-01T12:38:00Z"), "del15" : 0, "delay" : 13, "delay_new" : 13, "time_blk" : "1200-1259" }, "crs_arr_time" : ISODate("2014-01-01T12:25:00Z"), "delays" : { "dep" : 14, "arr" : 13 }, "taxi_in" : 5, "distance_group" : 10, "fl_date" : ISODate("2014-01-01T00:00:00Z"), "actual_elapsed_time" : 384, "wheels_off" : ISODate("2014-01-01T09:34:00Z"), "fl_num" : 1, "div_airport_landings" : 0, "diverted" : 0, "wheels_on" : ISODate("2014-01-01T12:33:00Z"), "crs_elapsed_time" : 385, "dest" : { "airport_seq_id" : 1289203, "state_nm" : "California", "wac" : 91, "state_fips" : 6, "airport_id" : 12892, "state_abr" : "CA", "city_name" : "Los Angeles, CA", "city_market_id" : 32575 }, "crs_dep_time" : ISODate("2014-01-01T09:00:00Z"), "cancelled" : 0, "unique_carrier" : "AA", "taxi_out" : 20, "tail_num" : "N338AA", "air_time" : 359, "carrier" : "AA", "airline_id" : 19805, "dep" : { "delay_group" : 0, "time" : ISODate("2014-01-01T09:14:00Z"), "del15" : 0, "delay" : 14, "delay_new" : 14, "time_blk" : "0900-0959" }, "distance" : 2475 } Here’s how the different options performed with the Flights database: MongoDB Config Database This is the metadata MongoDB stores in the config database for sharded clusters. The manual describes the various collections in that database. Here’s an example from the chunks collection, which stores a document for each chunk in the cluster: { "_id" : "mydb.foo-a_\"cat\"", "lastmod" : Timestamp(1000, 3), "lastmodEpoch" : ObjectId("5078407bd58b175c5c225fdc"), "ns" : "mydb.foo", "min" : { "animal" : "cat" }, "max" : { "animal" : "dog" }, "shard" : "shard0004" } Here’s how the different options performed with the config database: TPC-H TPC-H is a classic benchmark used for testing relational analytical DBMS. The schema has been modified to use MongoDB’s document model. Here’s an example from the orders table with only the first of many line items displayed for this order: { "_id" : 1, "cname" : "Customer#000036901", "status" : "O", "totalprice" : 173665.47, "orderdate" : ISODate("1996-01-02T00:00:00Z"), "comment" : "instructions sleep furiously among ", "lineitems" : [ { "lineitem" : 1, "mfgr" : "Manufacturer#4", "brand" : "Brand#44", "type" : "PROMO BRUSHED NICKEL", "container" : "JUMBO JAR", "quantity" : 17, "returnflag" : "N", "linestatus" : "O", "extprice" : 21168.23, "discount" : 0.04, "shipinstr" : "DELIVER IN PERSON", "realPrice" : 20321.5008, "shipmode" : "TRUCK", "commitDate" : ISODate("1996-02-12T00:00:00Z"), "shipDate" : ISODate("1996-03-13T00:00:00Z"), "receiptDate" : ISODate("1996-03-22T00:00:00Z"), "tax" : 0.02, "size" : 9, "nation" : "UNITED KINGDOM", "region" : "EUROPE" } ] } Here’s how the different options performed with the TPC-H database: Twitter This is a database of 200K tweets. Here’s a simulated tweet introducing our Java 3.0 driver: { "coordinates": null, "created_at": "Fri April 25 16:02:46 +0000 2010", "favorited": false, "truncated": false, "id_str": "", "entities": { "urls": [ { "expanded_url": null, "url": "http://mongodb.com", "indices": [ 69, 100 ] } ], "hashtags": [ ], "user_mentions": [ { "name": "MongoDB", "id_str": "", "id": null, "indices": [ 25, 30 ], "screen_name": "MongoDB" } ] }, "in_reply_to_user_id_str": null, "text": "Introducing the #Java 3.0 driver for #MongoDB http://buff.ly/1DmMTKu", "contributors": null, "id": null, "retweet_count": 12, "in_reply_to_status_id_str": null, "geo": null, "retweeted": true, "in_reply_to_user_id": null, "user": { "profile_sidebar_border_color": "C0DEED", "name": "MongoDB", "profile_sidebar_fill_color": "DDEEF6", "profile_background_tile": false, "profile_image_url": "", "location": "New York, NY", "created_at": "Fri April 25 23:22:09 +0000 2008", "id_str": "", "follow_request_sent": false, "profile_link_color": "", "favourites_count": 1, "url": "http://mongodb.com", "contributors_enabled": false, "utc_offset": -25200, "id": null, "profile_use_background_image": true, "listed_count": null, "protected": false, "lang": "en", "profile_text_color": "", "followers_count": 159678, "time_zone": "Eastern Time (US & Canada)", "verified": false, "geo_enabled": true, "profile_background_color": "", "notifications": false, "description": "Community conversation around the MongoDB software. For official company news, follow @mongodbinc.", "friends_count": , "profile_background_image_url": "", "statuses_count": 7311, "screen_name": "MongoDB", "following": false, "show_all_inline_media": false }, "in_reply_to_screen_name": null, "source": "web", "place": null, "in_reply_to_status_id": null } Here’s how the different options performed with the Twitter database: Comparing compression rates The varying sizes of these databases make them difficult to compare side by side in terms of absolute size. We can take a closer look at the benefits by comparing the storage savings provided by each option. To do this, we compare the size of each database using Snappy and zlib to the uncompressed size in WiredTiger. As above, we’re adding the value of storageSize and indexSize. Another way some people describe the benefits of compression is in terms of the ratio of the uncompressed size to the compressed size. Here’s how Snappy and zlib perform across the five databases. How to test your own data There are two simple ways for you to test compression with your data in MongoDB 3.0. If you’ve already upgraded to MongoDB 3.0, you can simply add a new secondary to your replica set with the option to use the WiredTiger storage engine specified at startup. While you’re at it, make this replica hidden with 0 votes so that it won’t affect your deployment. This new replica set member will perform an initial sync with one of your existing secondaries. After the initial sync is complete, you can remove the WiredTiger replica from your replica set then connect to that standalone to compare the size of your databases as described above. For each compression option you want to test, you can repeat this process. Another option is to take a mongodump of your data and use that to restore it into a standalone MongoDB 3.0 instance. By default your collections will use the Snappy compression option, but you can specify different options by first creating the collections with the appropriate setting before running mongorestore, or by starting mongod with different compression options. This approach has the advantage of being able to dump/restore only specific databases, collections, or subsets of collections to perform your testing. For examples of syntax for setting compression options, see the section “How to use compression.” A note on capped collections Capped collections are implemented very differently in the MMAP storage engines as compared to WiredTiger (and RocksDB). In MMAP space is allocated for the capped collection at creation time, whereas for WiredTiger and RocksDB space is only allocated as data is added to the capped collection. If you have many empty or mostly-empty capped collections, comparisons between the different storage engines may be somewhat misleading for this reason. If you’re considering updating your version of MongoDB, take a look at our Major Version Upgrade consulting services: UPGRADE WITH CONFIDENCE About the Author - Asya Asya is Lead Product Manager at MongoDB. She joined MongoDB as one of the company's first Solutions Architects. Prior to MongoDB, Asya spent seven years in similar positions at Coverity, a leading development testing company. Before that she spent twelve years working with databases as a developer, DBA, data architect and data warehousing specialist.

April 30, 2015

Next →

Five Languages, One Goal: A Developer's Path to Certification Mastery

MongoDB Community Creator Markandey Pathak has become a certified developer in five different programming languages: C#, Java, Node.JS, PHP, and Python. Pursuing multiple certifications equips developers with a diverse skill set, making them invaluable team members. Fluency across different programming languages enables them to foster platform-agnostic solutions and promote adaptability, collaboration, and informed decision-making, which are crucial for success in the global tech landscape. To understand what led Markandey to take on so many certifications while managing a busy and successful career, we spoke with him to gain insights into the challenges and triumphs he faced. What motivated you to pursue certification in multiple programming languages, and how has achieving such a diverse set of skills impacted your career? C was the first programming language I learned, followed by C# and the .NET ecosystem a few years later. Transitioning to a new language like C# after knowing one was straightforward. I then delved into ASP.NET, JAVA, and subsequently PHP. Despite the differing syntax of these languages, I found that fundamental programming concepts remained consistent. This enlightening realization led me to explore JavaScript and, later, Python. Such a diverse skill set made me a go-to resource for many senior leaders seeking insights. This versatility allowed me to transcend categorization based on programming ecosystems in the workplace, evolving my mindset to develop platform-agnostic solutions. I believe in the adage of being a jack of all trades while still mastering one or more. I took on the challenge of discovering MongoDB drivers available for various platforms. I created sample applications to practice basic MongoDB concepts using specific drivers, and soon, everything fell into place effortlessly. What tips or advice would you share with someone who looks up to your achievement and aspires to become a certified developer in multiple languages like C#, Java, Node.JS, PHP, and Python? How can they effectively approach learning and mastering these languages? Before attempting proficiency in MongoDB across multiple languages, it's crucial to prioritize understanding fundamental concepts such as data modeling practices, CRUD operations, and indexes. Mastering MongoDB's shell, MongoSh, is essential to grasp the workings of MongoDB's read and write operations. Following this, individuals should select a programming environment they're most adept in and practice executing MongoDB operations within that ecosystem. Constructing a personal project can aid in practically observing various MongoDB concepts in action. Utilizing resources such as MongoDB Certification Learning Paths , practice tests, and MongoDB Documentation is vital for excelling in certification exams. Additionally, it's advisable to undertake the initial certification in the programming language one feels most comfortable with. Reflection is key; saving or emailing exam scores enables individuals to identify areas needing improvement for future attempts. With proficiency in C#, Java, Node.JS, PHP, and Python, how do you perceive the role of versatility in today's tech industry, especially regarding job opportunities and project flexibility? Programming languages, very much like spoken languages, are merely a medium. The most important thing is knowing what to say. The tech industry depends on problems, and developers seek solutions to them. Once they have a solution, programming languages help make those solutions a reality. It’s not hard to learn different programming languages or even to master them. Knowing the basics of different programming ecosystems can give developers an edge regarding job opportunities. It makes them flexible and enables them to make crucial and informed decisions in choosing the correct tech stack or defining good architecture for solutions. In your experience, how does fluency in multiple languages enhance collaboration and innovation within development teams, particularly in today's globalized tech landscape? Fluency or even practical awareness about programming languages or ecosystems promotes versatility in problem-solving, facilitates cross-functional collaboration, supports agile development, enables integration with legacy systems, fosters global collaboration, reduces dependency, and empowers informed decision-making, all of which are crucial for staying competitive in today's globalized tech landscape. As a MongoDB Community Creator, how do you leverage your expertise in these five languages to contribute to and engage with the broader tech community? What advice would you offer aspiring developers seeking to expand their skill set? I aim to open-source my MongoDB-focused projects across various ecosystems, accompanied by detailed articles outlining their construction. Since these projects were designed with exams in mind, they serve as skill-testing tools for developers and comprehensive guides to the various components comprising certification exams. I advocate for developers to choose a favorite language and compare others to it, as this approach facilitates a quicker and more efficient understanding of concepts. Relating new information to familiar concepts makes learning easier and more effective. The MongoDB Community Advocacy Program is a vibrant global community designed for MongoDB enthusiasts who are passionate about advocating for the platform. Our Community Creators Program welcomes members of all skill levels eager to deepen their involvement in advancing MongoDB's community and technology. We empower our members to expand their expertise, visibility, and leadership by actively engaging with and advocating for MongoDB technologies among users worldwide. Join us and amplify your impact within the MongoDB community! Elevate your career with MongoDB University 's 1,000+ learning assets. Access free courses and hands-on labs, and earn certifications to boost your skills and stand out in tech.

April 24, 2024