Open Lineage Questions

165 views
Skip to first unread message

Andrzej Wrobel

unread,
Mar 24, 2021, 6:17:50 AM3/24/21
to OpenLineage
Hi,
I am working in IBM IGC Lineage area and with my colleagues we are reviewing opportunity to adopt Open Lineage format in our next solutions. It is very interesting idea to make it standard and I also have seen that you have touched on Egeria part with Mandy so that's great!!
We have came up with questions to understand where this format would be heading, please let me know if we could discuss this here in the threads or should I jump on your open lineage live meetings

- Has OpenLineage considered wether these events should contribute to Auditing compliance?  ie. Would an auditor need this level of detail to reconstruct/track how customer's data was consumed, updated and moved through the system and who/what process consumed, updated and moved the data through the system.  My/our assertion is that they should.  The job and run elements do not appear to be sufficient to provide that information:
  • Who/What process/component initiated the job/action?
  • What process/component was targeted by the action (this might be the producer)?
- what about case when you add asset manually , not through job, is that covered by OpenLineage spec ?

- any plans for supporting general events that can be captured as part of open provenance model (e.g., asset downloaded) This might need to add the concept of an "event". We will also need to make field like transition, run optional

- What about cases on more detailed level, like DataSet Column connected to another Column in the output, is this something that would be covered ? 

- Are you thinking about granularity in the job on component level, so for example input column A connected to component 1 of Job 1 , then component 1 connected to component 2 of Job 2 and component 2 connected to output column B ?

- Is there any place in current job facet to put Transformation information that is related to given job that moved the data?

- It looks like with current design it's required to send at least two events with types  START and COMPLETE  . Do you think about changing it so that event provider could only send complete ? I believe for lineage tracking it would be only required. 

Thanks!
Andrzej

Julien Le Dem

unread,
Mar 24, 2021, 1:04:49 PM3/24/21
to Andrzej Wrobel, OpenLineage
Hello Andrzej,
Thanks for reaching out. You are welcome to joining the slack as well if you haven't done so already: https://github.com/OpenLineage/OpenLineage/blob/main/README.md#community

Please find my answers to your questions inline below.
You are welcome to open proposal tickets on github for some of those points if you wish to.
Julien

On Wed, Mar 24, 2021 at 3:17 AM Andrzej Wrobel <andrzej...@gmail.com> wrote:
Hi,
I am working in IBM IGC Lineage area and with my colleagues we are reviewing opportunity to adopt Open Lineage format in our next solutions. It is very interesting idea to make it standard and I also have seen that you have touched on Egeria part with Mandy so that's great!!
We have came up with questions to understand where this format would be heading, please let me know if we could discuss this here in the threads or should I jump on your open lineage live meetings

- Has OpenLineage considered wether these events should contribute to Auditing compliance?  ie. Would an auditor need this level of detail to reconstruct/track how customer's data was consumed, updated and moved through the system and who/what process consumed, updated and moved the data through the system.  My/our assertion is that they should.  The job and run elements do not appear to be sufficient to provide that information:
  • Who/What process/component initiated the job/action?
  • What process/component was targeted by the action (this might be the producer)?
Yes, definitely, lineage events defined in OpenLineage are meant to model and describe what has happened. What job consumed what data and produced what data when. 
The notion of facet helps enrich that core metadata with other information like described above.
I would expect adding something along those lines:
 - a Job or Run facet do describe the who/what initiated the job/action
 - a Job facet to describe the job itself. 
The producer is meant to describe the code producing the metadata to be able to deal with potential bugs or difference in behavior in metadata producer


 
- what about case when you add asset manually , not through job, is that covered by OpenLineage spec ?

I would cover it in the spec yes. "Job" is meant to be a generic notion. It is something that takes datasets (0-n) as an input and outputs to dataset(s) (0-n). My first impression is that we can extend this notion to cover manual modification
 
- any plans for supporting general events that can be captured as part of open provenance model (e.g., asset downloaded) This might need to add the concept of an "event". We will also need to make field like transition, run optional

Yes, the spec is still evolving a bit. Yes, transition is optional and you can send other events.
Although in this case, an "asset downloaded" is still a job with an input (where we download from) and output (where we download to)
 
- What about cases on more detailed level, like DataSet Column connected to another Column in the output, is this something that would be covered ? 
Yes, I would think you'd want to add an output facet that covers the column level lineage for each column in the output
 
- Are you thinking about granularity in the job on component level, so for example input column A connected to component 1 of Job 1 , then component 1 connected to component 2 of Job 2 and component 2 connected to output column B ?

Could you clarify what you mean by component in this context? 
Here's an answer, but I'm not sure I fully get your use case for this one.
Jobs can be nested through the notion of a parent job. For example in AIrflow, you would have a job for the DAG and a job for each task (whose parent is the DAG). Similarly a Spark job has actions. You want to capture the more granular lineage in that case.
 
- Is there any place in current job facet to put Transformation information that is related to given job that moved the data?

You would want to add a JobFacet for that in the spec. There is one for example for the special case of SQL. More can be added.
 
- It looks like with current design it's required to send at least two events with types  START and COMPLETE  . Do you think about changing it so that event provider could only send complete ? I believe for lineage tracking it would be only required. 

That makes sense, this came up before. We need to tweak the root schema for this. You are welcome to offer a suggestion.

 
Thanks!
Andrzej

--
You received this message because you are subscribed to the Google Groups "OpenLineage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openlineage...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlineage/cc1c5ef9-50e1-44d0-9ea2-cef12ca2baf1n%40googlegroups.com.

Himanshu Gupta

unread,
Mar 31, 2021, 2:40:19 AM3/31/21
to OpenLineage
hi Julien,

Few further questions here.

1. Wanted to confirm that we can leave transition field unspecified. Consider events like owner updated for an asset. We will not really like to specify the start and end for such an event. Specifying start and end of such an event is really not important here. We would like to raise only a single event, after the owner has been updated. 

2. Is it fine to have the same asset as part of both inputs and outputs (in openlineage format)? In the case of owner_updated event - we will need to have the same asset as part of both inputs and outputs. This is because, this event takes an existing asset as input -- so this asset is part of input assets. Also, this event updates the metadata of the asset. In this sense, the event outputs a new state of the asset. So this asset is also part of output assets.

3. We can specify the details of previous owner and new owner as part of facets.We can define relevant facets for the asset -- in both input and output sections.

4. We will consider the owner_updating_process as a job here.

Thanks
Himanshu

Himanshu Gupta

unread,
Mar 31, 2021, 5:04:37 AM3/31/21
to OpenLineage
hi Julien

Please have a look at the following description on https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md
-----
Run Event: and event describing an observed state of a job run. It is required to at least send one event for a START transition and a COMPLETE/FAIL/ABORT transition. Aditional events are optional.
-----
This states that at least one event is required for a START transition and one for COMPLETE/FAIL/ABORT transition. thats why I wanted to confirm - is the transition field optional? From this description - it looks like that it is mandatory.

thanks
Himanshu

Himanshu Gupta

unread,
Mar 31, 2021, 5:27:06 AM3/31/21
to OpenLineage
hi Julien,

one more question. In the format as mentioned in https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md the section inputs and outputs can contain any class of assets, and not necessarily datasets/tables.
For example - consider that an ML model M1 is retrained to output a second ML model M2. In this case - we will think of model_retraining as a job, and record model M1 in inputs and record model M2 in outputs.  This semantics is consistent with openlineage format right?

thanks
Himanshu 

Julien Le Dem

unread,
Mar 31, 2021, 9:38:46 PM3/31/21
to Himanshu Gupta, OpenLineage

Thank you Himanshu,
those are great questions and I think the point you're raising deserves a spec for adding Dataset facets outside of a run. (Your owner_updated example). Which would be best through opening an issue on the OpenLineage github.

My answers inline below:

On Wed, Mar 31, 2021 at 2:27 AM Himanshu Gupta <him....@gmail.com> wrote:
hi Julien,

one more question. In the format as mentioned in https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md the section inputs and outputs can contain any class of assets, and not necessarily datasets/tables.
For example - consider that an ML model M1 is retrained to output a second ML model M2. In this case - we will think of model_retraining as a job, and record model M1 in inputs and record model M2 in outputs.  This semantics is consistent with openlineage format right?


Yes. Definitely model the Output model as an Output Dataset. An interesting facet to add here would be the model metrics (precision, recall, ...) if available as a Dataset facet.

Is the training job reading model M1 in this case? I would expect it to read from a training set dataset.
Input datasets are meant to the assets read by the job.


Please have a look at the following description on https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md
-----
Run Event: and event describing an observed state of a job run. It is required to at least send one event for a START transition and a COMPLETE/FAIL/ABORT transition. Aditional events are optional.
-----
This states that at least one event is required for a START transition and one for COMPLETE/FAIL/ABORT transition. thats why I wanted to confirm - is the transition field optional? From this description - it looks like that it is mandatory.


It is mandatory but accepts an "OTHER" value


1. Wanted to confirm that we can leave transition field unspecified. Consider events like owner updated for an asset. We will not really like to specify the start and end for such an event. Specifying start and end of such an event is really not important here. We would like to raise only a single event, after the owner has been updated. 



 
2. Is it fine to have the same asset as part of both inputs and outputs (in openlineage format)? In the case of owner_updated event - we will need to have the same asset as part of both inputs and outputs. This is because, this event takes an existing asset as input -- so this asset is part of input assets. Also, this event updates the metadata of the asset. In this sense, the event outputs a new state of the asset. So this asset is also part of output assets.


A Dataset can be an input and output , yes. We don't require the graph to be a DAG.
However in your case you are not modifying the data itself, which is what an output implies.
This is a very good question and requires more discussion actually. Do you mind opening an issue on the OpenLineage github to discuss the best way to add a facet to a dataset outside of a run?

 
3. We can specify the details of previous owner and new owner as part of facets.We can define relevant facets for the asset -- in both input and output sections.


That makes sense. Let's discuss this in the github issue above.

4. We will consider the owner_updating_process as a job here.

That will become a job that reads and writes to all the datasets. I'll start a google doc to help explore the options we have here.
 

 

Himanshu Gupta

unread,
Apr 1, 2021, 12:21:42 AM4/1/21
to OpenLineage
Hi Julien,
Thanks for the answers. Yeah, I will open the relevant git issues.

couple of further points

1. So it is allowed to raise only single openlineage event for events like owner_updated with transition = OTHER.

Wanted to confirm this as this point is not clear from the discussion above. The documentation says  -- "It is required to at least send one event for a START transition". So this START event is not really needed.

2. We can define our own custom facets depending on the use-case.   It is not needed that these facets be first defined in openlineage spec, right?  For example - for model if we want to specify model precision and recall -- we can define our own custom facet providing this information and add it to the model asset.

Thanks you,
Himanshu

Himanshu Gupta

unread,
Apr 5, 2021, 2:02:29 PM4/5/21
to OpenLineage
Hi Julien,

I have opened the git issue around representing metadata update events in open-lineage format.

another question to confirm my understanding -  we still need to design and implement our own persistent store. we will need to parse the openlineage format events and persist the lineage formation over there. Openlineage is just a format and by itself, does not provide a persistent store.

I was wondering  if it is possible to have a 30 mins call sometime to discuss various queries, we have around extensions to openlineage format.

Thanks
Himanshu

Himanshu Gupta

unread,
Apr 12, 2021, 6:38:34 AM4/12/21
to OpenLineage
Hi Julien,
  Would like to request your comments here. this will help in various discussions we are having around openlineage internally
Thanks
Himanshu

Julien Le Dem

unread,
Apr 21, 2021, 9:38:18 PM4/21/21
to Himanshu Gupta, OpenLineage
Hello Himanshu,
Sorry I missed this email. OpenLineage is just the spec, however the Marquez project is a lineage and metadata repository that implements it and stores the lineage information.
I'll send you an email privately to schedule some time.
Best
Julien

Reply all
Reply to author
Forward
0 new messages