Watermarks in stream processing systems: Semantics and comparative analysis of apache flink and google cloud dataflow

Tyler Akidau, Edmon Begoli, Slava Chernyak, Fabian Hueske, Kathryn Knight, Kenneth Knowles, Daniel Mills, Dan Sotolongo

Research output: Contribution to journalConference articlepeer-review

17 Scopus citations

Abstract

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams. First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches: • Computing a single correct answer, as in notifications. • Reasoning about a lack of data, as in dip detection. • Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models. • Safely and punctually garbage collecting obsolete inputs and intermediate state. • Surfacing a reliable signal of overall pipeline health. Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.

Original languageEnglish
Pages (from-to)3135-3147
Number of pages13
JournalProceedings of the VLDB Endowment
Volume14
Issue number12
DOIs
StatePublished - 2021
Event47th International Conference on Very Large Data Bases, VLDB 2021 - Virtual, Online
Duration: Aug 16 2021Aug 20 2021

Funding

We would also like to acknowledge the Google for Education program for supporting parts of this research with their grant of Google Cloud resources to Dr. Begoli. We would also like to thank Apache Software Foundation (ASF) for their support. This manuscript has been in part co-authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy.

Fingerprint

Dive into the research topics of 'Watermarks in stream processing systems: Semantics and comparative analysis of apache flink and google cloud dataflow'. Together they form a unique fingerprint.

Cite this