Articles | Open Access | DOI: https://doi.org/10.37547/tajet/Volume07Issue06-14

Data Consistency in Distributed Multi-Stage Event Processing Pipelines

Khrystyna Terletska , Senior Software Engineer at Datadog New York, USA

Abstract

The article examines the problem of ensuring end-to-end data consistency in distributed multi-stage event processing pipelines, which are actively used in modern real-time systems. The relevance of the study is determined by the rapid growth of streaming analytics needs and the widespread use of Apache Kafka, making message latency, duplication, and disorder critical factors for industries ranging from fintech to IoT. The goal of this work is to propose a formal model that unifies an extended event representation and a set of invariants that guarantee correct processing even in the presence of component failures. The novelty of the approach lies in the formalization of an event as a tuple ⟨id, tsₛᵣ????, p, v, σ⟩, where id is responsible for deduplication, tsₛᵣ???? records the time of occurrence, p specifies the partition, v is the payload, and σ is the schema version, which enables ordering recovery and supports format evolution. The pipeline is modeled as a directed acyclic graph (DAG) of operators having the properties of determinism, idempotence, and monotonicity. CRDT aggregates are used for convergence in duplication; SLA alerts from watermark mechanisms are used to minimize data loss. The main findings indicate that, under specified conditions, the system can tolerate delays, failures, and redeliveries without compromising consistency. Extended events and formal operators enable state recovery; stream semantics are ensured by four invariants. This research is particularly relevant for professionals designing and operating real-time event-driven systems, stream processing applications, microservices architectures, and high-integrity data integration pipelines.

Keywords

streaming event processing, distributed systems, data consistency, logical clocks, vector clocks, CRDT, schema evolution, checkpoints, Apache Kafka, Exactly-Once Semantics.

References

“Streaming Analytics Market Size, Share, Growth Drivers & Opportunities,” Markets and Markets, Sep. 2024. https://www.marketsandmarkets.com/Market-Reports/streaming-analytics-market-64196229.html (accessed Apr. 30, 2025).

“Apache Kafka,” Apache. https://kafka.apache.org/ (accessed May 01, 2025).

G. Manepalli, “Clocks and Causality - Ordering Events in Distributed Systems,” Exhypothesi, Nov. 16, 2022. https://www.exhypothesi.com/clocks-and-causality/ (accessed May 02, 2025).

“Kafka 4.0 Documentation,” Apache. https://kafka.apache.org/documentation/ (accessed May 03, 2025).

J. Rao, “Configuring Durability, Availability, and Ordering Guarantees,” Confluent. https://developer.confluent.io/courses/architecture/guarantees/ (accessed May 05, 2025).

“Lamport’s logical clock,” Geeks for Geeks, Oct. 02, 2023. https://www.geeksforgeeks.org/lamports-logical-clock/ (accessed May 07, 2025).

“Vector Clocks in Distributed Systems,” Geeks for Geeks, Oct. 14, 2024. https://www.geeksforgeeks.org/vector-clocks-in-distributed-systems/ (accessed May 08, 2025).

T. Yasser, T. Arafa, M. ElHelw, and A. Awad, “Keyed watermarks: A fine-grained watermark generation for Apache Flink,” Future Generation Computer Systems, vol. 169, p. 107796, Aug. 2025, doi: https://doi.org/10.1016/j.future.2025.107796.

“Schema Evolution and Compatibility for Schema Registry on Confluent Platform | Confluent Documentation,” Confluent. https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html (accessed May 10, 2025).

“Schema Registry API Reference,” Confluent. https://docs.confluent.io/platform/current/schema-registry/develop/api.html (accessed May 13, 2025).

Valeriu Crudu, “Exactly-Once Semantics in Kafka - Advanced Topics Explained,” MoldStud. https://moldstud.com/articles/p-exactly-once-semantics-in-kafka-advanced-topics-explained (accessed May 15, 2025).

S. Laddad, C. Power, M. Milano, A. Cheung, and J. M. Hellerstein, “Katara: synthesizing CRDTs with verified lifting,” Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA2, pp. 1349–1377, Oct. 2022, doi: https://doi.org/10.1145/3563336.

“SAGA Design Pattern,” Geeks for Geeks, Nov. 08, 2024. https://www.geeksforgeeks.org/saga-design-pattern/ (accessed May 20, 2025).

Article Statistics

Copyright License

Download Citations

How to Cite

Khrystyna Terletska. (2025). Data Consistency in Distributed Multi-Stage Event Processing Pipelines. The American Journal of Engineering and Technology, 7(06), 127–134. https://doi.org/10.37547/tajet/Volume07Issue06-14