Streaming and OpenLineage

Skip to first unread message

Ross Black

May 3, 2021, 2:01:54 AM5/3/21
to OpenLineage


I am only just starting to look at various metadata tools and standards for data governance, lineage, etc.  (so please forgive my lack of understanding).

A significant amount of our data and processing is currently in streaming platforms (Kafka streaming, Apache Flink, Spark, etc).

Since that the Core Model of OpenLineage includes "Run" and "Job" which are more batch-related concepts, I am confused as to how OpenLineage might be applied to streaming data systems.
Is OpenLineage only meant for batch data systems?
Are there ideas/plans on how it would apply to streaming?


Julien Le Dem

May 4, 2021, 11:32:04 PM5/4/21
to Ross Black, OpenLineage
Yes, it is planned to cover streaming as well.
Even if we think of a streaming job as running continuously it still has a lifecycle.
A streaming job will still have runs as it gets stopped, upgraded and started again.
The job version will track whether the code has been updated.
A dataset could be a kafka topic. In that sense you might want to capture metadata like the offsets at which you a=started or stopped the job.
Facets are meant to allow capture metadata specific to certain types of jobs or datasets.

The difference is that batch jobs usually consumes and produce a predefined amount of data when a streaming job does not.

You received this message because you are subscribed to the Google Groups "OpenLineage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Yaniv Ben-Hamo

Jun 19, 2022, 3:08:00 PM6/19/22
to OpenLineage
Love OpenLineage.
Same question here. At the moment we built our own mechanism on top of our message broker but would love to integrate OpenLineage in the future instead.

Reply all
Reply to author
0 new messages