Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing
At Uber, we use robust data processing systems such as Apache Flink and Apache Spark to power the streaming applications that helps us calculate up-to-date pricing, enhance driver dispatching, and fight fraud on our platform. Such solutions can process data at a massive scale in real time with exactly-once semantics, and the emergence of these systems over the past several years has unlocked an industry-wide ability to write streaming data processing applications at low latencies, a functionality previously impossible to achieve at scale. However, since streaming systems are inherently unable to guarantee event order, they must make trade-offs in how they handle late data.
Typically, streaming systems mitigate this out-of-order problem by using event-time windows and watermarking. While efficient, this strategy can cause inaccuracies by dropping any events that arrive after watermarking. To support systems that require both the low latency of a streaming pipeline and the correctness of a batch pipeline, many organizations utilize Lambda architectures, a concept first proposed by Nathan Marz.