This is a guest post by Yash Badiani, Practice Head - Big Data, CIGNEX Datamatics.
Record keeping and document archiving are such common practices within enterprises that their importance often goes unrecognized. An efficient archivist was the person who preserved records with such a systematic finesse and structured pattern that archives filed decades ago could be retrieved in matter of minutes. But when enterprise transactions took an innovative leap through computers playing an important role in operations, the volume of data to be managed by an archivist went beyond their scope. The digital data explosion has paved the way for content archival applications that can seamlessly manage operational data.
The Evolution of Data Archival Solutions
As a first step towards data archival solutions, the enterprise used turnkey applications to archive emails, legal documents, invoices, and important documents leveraging the raw disk capacity and the security footprint of the applications. But, the world of data archiving was set for another innovative leap, as the requirements of data to be archived bulged not only in VOLUME, but also VARIETY as enterprises identified that their data archiving policy aren't limited to storing emails and invoices but also include log management, enterprise videos, audios, and images on the web, social media feeds, audit trails, data from online transactions etc.
In addition, there was immense metadata associated with all the content which weren't completely leveraged leading to difficulties in content retrieval. Suddenly, the resident enterprise application was under scrutiny much like our pedantic data archivist as the data infused was out of the permissible boundary of their limitations. But the challenge was not just limited to massive volume and variety:
- Compliance to regulatory document retention practices
- Ability to retrieve required information instantaneously through complex search queries
- Enterprise applications focusing on Information Lifecycle Management often are associated with heavy licensing and acquisition costs. At times costs are associated with the scalability requirements making it difficult to tailor based on end user needs
Not Yet Enough for Big Data
The advent of Big Data has given us a new outlook to address these challenges. The ability of Big Data technologies to store large volumes of structured and unstructured data, arriving at high rates, all at low cost, makes it the most suitable candidate to take the position of data archival solution.
Here are the key requirements of a data archival solution:
- Scalable to large volumes and variety of data
- Tiered storage: High availability and widely accessibility (Web, mobile)
- Support for analytical and content applications
- Supports workflow automation
- Integrates legacy applications
- Runs on public, private and hybrid cloud environments
- Ability to self-heal without customer intervention
MongoDB as the Foundation for a Scalable Data Archival Solution
Going through our above wish list, it doesn't take us much time to recognize that MongoDB passes the litmus test. Given below is one proposed design we architected on how we can leverage MongoDB as a scalable back-end solution to come up with an enterprise-class data archival solution:
- Scalable service layer - REST-ful web service API layer enabling enterprises to integrate with the front-end application of their choice, scalable to handle high throughputs and requests (ingestion of petabytes of data/day)
- Data persistence layer - Easily leveraging GridFS to store large binary size files and MongoDB collections for associated metadata. We can also use sharding for better write distribution thereby peaking the solution performance
- Indexing/Searching layer - While MongoDB offers secondary indexing, we can integrate Solr to leverage features like quick response time, full text search, faceted and range search, hit highlighting etc.
- Synchronization layer - Controller synchronizing persistence of file and indexing of metadata while queuing incoming requests
Other than the design features, MongoDB offers numerous advantages in designing applications and integrating them with front-end technologies due to MongoDB’s rich driver support. Its replicated setup allows us to keep systems up-to-date with no downtime. The application is deployable on cloud as SaaS, and allows analytics on stored objects.
Benefits of a MongoDB-based Archival Solution
Among other things, a MongoDB-based data archival solution offers the following benefits:
- Extendable solution designed to accomplish long-term storage needs
- Fast and effective search of content by name, keywords or even complete text
- Cost effective, runs on commodity hardware
A data archival solution leveraging MongoDB would offer tremendous value for various enterprise use cases. For example, consider the Media and Publishing market. A news website might produce a huge amount of content each day, including news articles, feeds for readers, related videos and audio content, images, logs, user comments and chat transcripts. Not only would such an organization produce such varied content, but it would also need to archive the content for long-term retention and future reference. In addition, archival of articles is becoming standard procedure for compliance, auditability, and litigation support purposes.
By designing a data archival solution leveraging MongoDB, the data archivist not only has the advantage of business agility but also benefits from a broader scope for analysis and lack of dependence on IT for organization of her files.
The data archival space has come a long way. With an enterprise data archival solution leveraging MongoDB , we can be assured that the challenges around VOLUME, VARIETY & VELOCITY of data can be handled in an agile and elegant way.
CIGNEX Datamatics Inc. (a subsidiary of Datamatics Global Services Ltd.) is the global leader in Commercial Open Source Enterprise solutions and a global partner of 10gen (MongoDB) offering advisory consulting, implementation, and support services around the MongoDB application. Since year 2000, CIGNEX Datamatics has implemented over 400 Open Source enterprise solutions addressing enterprise requirements from Portals, Content to Big Data solutions.
For more details, contact: Yash Badiani at yash dot badiani at cignex dot com.