August 30, 2023 | Updated: October 16, 2023
Let’s consider a fictitious bank that has a credit card offering for its customers. Transactional data might land in their database from various sources such as a REST API call from a web application or from a serverless function call made by a cash machine. Regardless of how the data was written to the database, the database performed its job and made the data available for querying by the end-user or application. The mechanics are database-specific but the end goal of all databases is the same. Once data is in a database the bank can query and obtain business value from this data. In the beginning, their architecture worked well, but over time customer usage grew and the bank found it difficult to manage the volume of transactions. The company decides to do what many customers in this scenario do and adopts an event-streaming platform like Apache Kafka to queue these event data. Kafka provides a highly scalable event streaming platform capable of managing large data volumes without putting debilitating pressure on traditional databases. With this new design, the bank could now scale supporting more customers and product offerings.
Life was great until some customers started complaining about unrecognized transactions occurring on their cards. Customers were refusing to pay for these and the bank was starting to spend lots of resources figuring out how to manage these fraudulent charges. After all, by the time the data gets written into the database, and the data is batch loaded into the systems that can process the data, the user's credit card was already charged perhaps a few times over.
However, hope is not lost. The bank realized that if they could query the transactional event data as it's flowing into the database they might be able to compare it with historical spending data from the user, as well as geolocation information, to make a real-time determination if the transaction was suspicious and warranted further confirmation by the customer.
This ability to continuously query the stream of data is what stream processing is all about.
From a developer's perspective, building applications that work with streaming data is challenging. They need to consider the following:
Different serialization formats: The data that arrives in the stream may contain different serialization formats such as JSON, AVRO, Protobuf or even binary.
Different schemas: Data originating from a variety of sources may contain slightly different schemas. Fields like CustomerID could be customerId from one source or CustID in another and a third could not even use the field.
Late arriving data: The data itself could arrive late due to network latency issues or being completely out of order.
Operational complexity: Developers need to be concerned with reacting to application state changes like failed connections to data sources and how to efficiently scale the application to meet the demands of the business.
Security: In larger enterprises, the developer usually doesn’t have access to production data. This makes troubleshooting and building queries from this data difficult.
Stream processing can help address these challenges and enable real-time use cases, such as fraud detection, hyper-personalization, and predictive maintenance, that are otherwise difficult or extremely costly to overcome. While many stream processing solutions exist, the flexibility of the document model and the power of the aggregation framework are naturally well suited to help developers with the challenges found with complex event data.