When you need to ingest, process and analyze data sets that are too sizable and/or complex for conventional relational databases, the solution is technology organized into a structure called a Big Data architecture. Use cases include:
- Storage and processing of data in very large volumes: generally, anything over 100 GB in size
- Aggregation and transformation of large sets of unstructured data for analysis and reporting
- The capture, processing, and analysis of streaming data in real-time or near-real-time
Table of Contents:
- Components of Big Data Architecture
- Benefits of Big Data Architecture
- Big Data Architecture Challenges
Components of Big Data Architecture
Big Data architectures have a number of layers or components. These are the most common:
1. Data sources
Data is sourced from multiple inputs in a variety of formats, including both structured and unstructured. Sources include relational databases allied with applications such as ERP or CRM, data warehouses, mobile devices, social media, email, and real-time streaming data inputs such as IoT devices. Data can be ingested in batch mode or in real-time.
2. Data storage
This is the data receiving layer, which ingests data, stores it, and converts unstructured data into a format analytic tools can work with. Structured data is often stored in a relational database, while unstructured data can be housed in a NoSQL database such as MongoDB Atlas. A specialized distributed system like Hadoop Distributed File System (HDFS) is a good option for high-volume batch processed data in various formats.
3. Batch processing
With very large data sets, long-running batch jobs are required to filter, combine, and generally render the data usable for analysis. Source files are typically read and processed, with the output written to new files. Hadoop is a common solution for this.
4. Real-time message ingestion
This component focuses on categorizing the data for a smooth transition into the deeper layers of the environment. An architecture designed for real-time sources needs a mechanism to ingest and store real-time messages for stream processing. Messages can sometimes just be dropped into a folder, but in other cases, a message capture store is necessary for buffering and to enable scale-out processing, reliable delivery, and other queuing requirements.
5. Stream processing
Once captured, the real-time messages have to be filtered, aggregated, and otherwise prepared for analysis, after which they are written to an output sink. Options for this phase include Azure Stream Analytics, Apache Storm, and Apache Spark Streaming.
6. Analytical data store
The processed data can now be presented in a structured format – such as a relational data warehouse – for querying by analytical tools, as is the case with traditional business intelligence (BI) platforms. Other alternatives for serving the data are low-latency NoSQL technologies or an interactive Hive database.
7. Analysis and reporting
Most Big Data platforms are geared to extracting business insights from the stored data via analysis and reporting. This requires multiple tools. Structured data is relatively easy to handle, while more advanced and specialized techniques are required for unstructured data. Data scientists may undertake interactive data exploration using various notebooks and tool-sets. A data modeling layer might also be included in the architecture, which may also enable self-service BI using popular visualization and modeling techniques.
Analytics results are sent to the reporting component, which replicates them to various output systems for human viewers, business processes, and applications. After visualization into reports or dashboards, the analytic results are used for data-driven business decision making.
The cadence of Big Data analysis involves multiple data processing operations followed by data transformation, movement among sources and sinks, and loading of the prepared data into an analytical data store. These workflows can be automated with orchestration systems from Apache such as Oozie and Sqoop, or Azure Data Factory.
Benefits of Big Data Architecture
1. Parallel computing for high performance
To process large data sets quickly, big data architectures use parallel computing, in which multiprocessor servers perform numerous calculations at the same time. Sizable problems are broken up into smaller units which can be solved simultaneously.
2. Elastic scalability
Big Data architectures can be scaled horizontally, enabling the environment to be adjusted to the size of each workload. Big Data solutions are usually run in the cloud, where you only pay for the storage and computing resources you actually use.
3. Freedom of choice
The marketplace offers many solutions and platforms for use in Big Data architectures, such as Azure managed services, MongoDB Atlas, and Apache technologies. You can combine solutions to get the best fit for your various workloads, existing systems, and IT skill sets.
4. Interoperability with related systems
You can create integrated platforms across different types of workloads, leveraging Big Data architecture components for IoT processing and BI as well as analytics workflows.
Big Data Architecture Challenges
Big data of the static variety is usually stored in a centralized data lake. Robust security is required to ensure your data stays protected from intrusion and theft. But secure access can be difficult to set up, as other applications need to consume the data as well.
A Big Data architecture typically contains many interlocking moving parts. These include multiple data sources with separate data-ingestion components and numerous cross-component configuration settings to optimize performance. Building, testing, and troubleshooting Big Data processes are challenges that take high levels of knowledge and skill.
3. Evolving technologies
It’s important to choose the right solutions and components to meet the business objectives of your Big Data initiatives. This can be daunting, as many Big Data technologies, practices, and standards are relatively new and still in a process of evolution. Core Hadoop components such as Hive and Pig have attained a level of stability, but other technologies and services remain immature and are likely to change over time.
4. Specialized skill sets
Big Data APIs built on mainstream languages are gradually coming into use. Nevertheless, Big Data architectures and solutions do generally employ atypical, highly specialized languages and frameworks that impose a considerable learning curve for developers and data analysts alike.