Hi,I am working in IBM IGC Lineage area and with my colleagues we are reviewing opportunity to adopt Open Lineage format in our next solutions. It is very interesting idea to make it standard and I also have seen that you have touched on Egeria part with Mandy so that's great!!We have came up with questions to understand where this format would be heading, please let me know if we could discuss this here in the threads or should I jump on your open lineage live meetings- Has OpenLineage considered wether these events should contribute to Auditing compliance? ie. Would an auditor need this level of detail to reconstruct/track how customer's data was consumed, updated and moved through the system and who/what process consumed, updated and moved the data through the system. My/our assertion is that they should. The job and run elements do not appear to be sufficient to provide that information:
- Who/What process/component initiated the job/action?
- What process/component was targeted by the action (this might be the producer)?
- what about case when you add asset manually , not through job, is that covered by OpenLineage spec ?
- any plans for supporting general events that can be captured as part of open provenance model (e.g., asset downloaded) This might need to add the concept of an "event". We will also need to make field like transition, run optional
- What about cases on more detailed level, like DataSet Column connected to another Column in the output, is this something that would be covered ?
- Are you thinking about granularity in the job on component level, so for example input column A connected to component 1 of Job 1 , then component 1 connected to component 2 of Job 2 and component 2 connected to output column B ?
- Is there any place in current job facet to put Transformation information that is related to given job that moved the data?
- It looks like with current design it's required to send at least two events with types START and COMPLETE . Do you think about changing it so that event provider could only send complete ? I believe for lineage tracking it would be only required.
Thanks!Andrzej--
You received this message because you are subscribed to the Google Groups "OpenLineage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openlineage...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlineage/cc1c5ef9-50e1-44d0-9ea2-cef12ca2baf1n%40googlegroups.com.
hi Julien,one more question. In the format as mentioned in https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md the section inputs and outputs can contain any class of assets, and not necessarily datasets/tables.For example - consider that an ML model M1 is retrained to output a second ML model M2. In this case - we will think of model_retraining as a job, and record model M1 in inputs and record model M2 in outputs. This semantics is consistent with openlineage format right?
Please have a look at the following description on https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md-----Run Event: and event describing an observed state of a job run. It is required to at least send one event for a START transition and a COMPLETE/FAIL/ABORT transition. Aditional events are optional.-----This states that at least one event is required for a START transition and one for COMPLETE/FAIL/ABORT transition. thats why I wanted to confirm - is the transition field optional? From this description - it looks like that it is mandatory.
1. Wanted to confirm that we can leave transition field unspecified. Consider events like owner updated for an asset. We will not really like to specify the start and end for such an event. Specifying start and end of such an event is really not important here. We would like to raise only a single event, after the owner has been updated.
2. Is it fine to have the same asset as part of both inputs and outputs (in openlineage format)? In the case of owner_updated event - we will need to have the same asset as part of both inputs and outputs. This is because, this event takes an existing asset as input -- so this asset is part of input assets. Also, this event updates the metadata of the asset. In this sense, the event outputs a new state of the asset. So this asset is also part of output assets.
3. We can specify the details of previous owner and new owner as part of facets.We can define relevant facets for the asset -- in both input and output sections.
4. We will consider the owner_updating_process as a job here.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlineage/a11fca45-ecf3-42f6-b205-77e9aa7bb776n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openlineage/050a020b-0133-4f2c-a329-305b7dd17351n%40googlegroups.com.