Last month, 10gen announced its sponsorship for hackNY, the non-profit aiming to federate the next generation of hackers for New York City. We’ve been longtime supporters of hackNY and were excited to present the founders, Evan Korth and Chris Wiggins, with a donation of $75,000 at MongoNYC.
At the conference, Chris and Evan sat for a brief interview, where they talked about how technology is transforming the key industries in New York City, from media, advertising, publishing, finance and beyond. They explained how hackNY’s model, which organizes student hackathons and summer fellowship programs, give students practical, hands-on experience with programming that they don’t receive at university. Open source technologies like MongoDB are a great fit for hackathons since they enable developers to rapidly prototype, with the knowledge that they can scale their applications.
Libbson is a new shared library written in C for developers wanting to work with the BSON serialization format. Its API will feel natural to C programmers but can also be used as the base of a C extension in higher-level MongoDB drivers. The library contains everything you would expect from a BSON implementation. It has the ability to work with documents in their serialized form, iterating elements within a document, overwriting fields in place, Object Id generation, JSON conversion, data validation, and more. Some lessons were learned along the way that are beneficial for those choosing to implement BSON themselves. Improving small document performance A common use case of BSON is for relatively small documents. This has a profound impact on the memory allocator in userspace, causing what is commonly known as “memory fragmentation". Memory fragmentation can make it more difficult for your allocator to locate a contiguous region of memory. In addition to increasing allocation latency, it increases the memory requirements of your application to overcome that fragmentation. To help with this issue, the bson_t structure contains 120 bytes of inline space that allows BSON documents to be built directly on the stack as opposed to the heap. When the document size grows past 120 bytes it will automatically migrate to a heap allocation. Additionally, bson_t will grow it’s buffers in powers of two. This is standard when working with buffers and arrays as it amortizes the overhead of growing the buffer versus calling realloc() every time data is appended. 120 bytes was chosen to align bson_t to the size of two sequential cachelines on x86_64 (each 64 bytes). This may change based on future research, but not before a stable ABI has been reached. Single allocation for nested documents One strength of BSON is it’s ability to nest objects and arrays. Often times when serializing these nested documents, each sub-document is serialized independently and then appended to the parents buffer. As you might imagine, this takes quite the toll on the allocator. It can generate many small allocations which were only created to have been immediately discarded after appending to the parents buffer. Libbson allows for building sub-documents directly into the parent documents buffer. Doing so helps avoid this costly fragmentation. The topmost document will grow its underlying buffers in powers of two each time the allocation would overflow. Parsing BSON documents from network buffers Another common area for allocator fragmentation is during BSON document parsing. Libbson allows parsing and iteration of BSON documents directly from your incoming network buffer. This means the only allocations created are those needed for your higher level language such as a PyDict if writing a Python extension. Developers writing C extensions for their driver may choose to implement a “generator" style parsing of documents to help keep memory fragmentation low. A technique we’re yet to explore is implementing a hashtable-esque structure backed by BSON, only deserializing the entire buffer after a threshold of keys have been accessed. Generating BSON documents into network buffers Much like parsing BSON documents, generating documents and placing them into your network buffers can be hard on your memory allocator. To help keep this fragmentation down, Libbson provides support for serializing your document to BSON directly within a buffer of your choosing. This is ideal for situations such as writing a sequence of BSON documents into a MongoDB message. Generating Object Ids without Synchronization Applications are often doing ObjectId generation, especially in high insert environments. The uniqueness of generated ObjectIds is critical to avoiding duplicate key errors across multiple nodes. Highly threaded environments create a local contention point slowing the rate of generation. This is because the threads must synchronize on the increment counter of each sequential ObjectId. Failure to do so could cause collisions that would not be detected until after a network round-trip. Most drivers implement the synchronization with an atomic increment or a mutex if atomics are not available. Libbson will use atomic increments and in some cases avoid synchronization altogether if possible. One such case is a non-threaded environment. Another is when running on Linux as both threads and processes are in the same namespace. This allows the use of the thread identifier as the pid within the ObjectId. You can find Libbson at https://github.com/mongodb/libbson and discuss design choices with its author, Christian Hergert, who can be found on twitter as @hergertme .
MongoDB Query API Webinar: FAQ
Last week we held a live webinar on the MongoDB Query API and our lineup of idiomatic programming language drivers. There were many great questions during the session, and in this post, what I want to do is share the most frequently asked ones with you. But first - here is a quick summary of what MongoDB Query API is all about if you are unfamiliar with it. What is MongoDB Query API? MongoDB is built upon the document data model . The document model is designed to be intuitive, flexible, universal, and powerful. You can easily work with a variety of data, and because documents map directly to the objects in your code, it fits naturally in your app development experience. MongoDB Query API lets you work with data as code and build any class of application faster by giving you extensive query capabilities natively in any modern programming language. Whether you’re working with transactional data, looking for search capabilities, or trying to run sophisticated real-time analytics, MongoDB Query API can meet your needs. MongoDB Query API has some unique features like its expressive query, primary and secondary indexes, powerful aggregations and transformations, on-demand materialized views, and more — enabling you to work with data of any structure, at any scale. Some key features to highlight: Indexes To optimize any workload and query pattern you can take advantage of a large set of index types like multi-key (for arrays), wildcard, geospatial, and more and index any field no matter how deeply nested it is within your documents. Fully featured secondary indexes are document-optimized and include partial, unique, case insensitive, and sparse. Aggregation Pipeline Aggregation pipeline lets you group, transform, and analyze your data to support any class of workload. You can choose from dozens of aggregation stages and over 200 operators to build modular and expressive pipelines. You can also use low-code tools like MongoDB Compass to drag and drop stages, examine intermediate output, and export to your programming language of choice. On-Demand Materialized Views The powerful $merge aggregation stage allows you to combine the results of your aggregation pipeline with existing collections to update and enrich data without having to recompute your entire data set. You can output results to sharded and unsharded collections while simultaneously defining indexes on each view Geospatial and Graph Utilize MongoDB’s built-in natively ability to store and run queries against geospatial data Use operators like $graphLookup to quickly traverse connected data sets These are just a few of the features we highlighted in the MongoDB Query API webinar. No matter what type of application you are thinking of building or managing, MongoDB Query API can meet your needs as the needs of your users and application change. FAQs for MongoDB Query API Here are the most common questions asked during the webinar: Do we have access to the data sets presented in this webinar? Yes, you can easily create a cluster and load the sample data sets into Atlas. Instructions on how to get started are here . How can I access full-text search capabilities? Text search is a standard feature of MongoDB Atlas. You can go to cloud.mongodb.com to try it out using sample data sets. Does VS code plugin support Aggregation? Yes, it does. You can learn more about the VS code plugin on our docs page. If you need to pass variable values in the aggregation, say the price range from the app as an input, how would you do that? This is no different than sending a query - since you construct your aggregation in your application you just fill in the field you want with value/variable in your code. Is there any best practice document on MongoDB query API to have stable performance and utilize minimum resources? Yes, we have tips and tricks on optimizing performance by utilizing indexes, filters, and tools here . Does MongoDB support the use of multiple different indexes to meet the needs of a single query? Yes, this can be accomplished by the use of compound indexes. You can learn more about it in our docs here . If you work with big data and create a collection, is it smarter to create indexes first or after the collection is filled (regarding the time to create a collection)? It is better to create the indexes first as they will take less time to create if the collection is empty, but you still have an option to create the index once the data is there in the collection. There are multiple great benefits of MongoDB’s indexing capabilities: When building indexes, there is no impact on your app’s availability since the index operation is online. Flexibility to add and remove indexes at any time. Ability to hide indexes to evaluate the impact of removing them before officially dropping them. Where do I go to learn more? Here are some resources to help you get started: MongoDB Query API page MongoDB University MongoDB Docs You can also check out the webinar replay here .