Data transformation template - How it works

226 views
Skip to first unread message

D Anil Reddy

unread,
Mar 23, 2017, 7:50:59 AM3/23/17
to Kylo Community
Hi,

I have a qeury regarding the feeds created with Data Transformation template. Will the feed able to store the state of the data that is already transformed. 

I mean, I have a feed created to join to two tables and apply transformations on top of the join to have some calculated feeds. Now if the feed has run once and calculated the transformations, when the feed runs for the second time, will it be able to store the state of the process data in the previous run and apply transformation on the detla since previous run ??

or when ever the data transformation run, will it process all the data in the tables again ?

I could not see the state information stored in the nifi processors of the transformation feed. Please let me know how it works.

Thanks,
Anil.

Greg Hart

unread,
Mar 23, 2017, 12:59:24 PM3/23/17
to Kylo Community
Hi Anil,

The Data Transformation template currently re-processes all the data for every run. There are certain functions such as aggregation or window functions that do not support deltas. If this is a feature you're interested in, please open a JIRA and we can consider how to support this.

Thanks!

Satish Abburi

unread,
Jul 25, 2017, 2:32:01 PM7/25/17
to Kylo Community

Hi Greg,  use case we are referring here is.

 

With every new feed/load, we trigger the transformation feed for new data. Are you saying, the transformation applies to all the data in the table from older loads every time we run the transformation? 

 

Thanks,

Satish

Greg Hart

unread,
Jul 25, 2017, 2:36:58 PM7/25/17
to Kylo Community
Hi Satish,

The Data Transformation template that comes with Kylo reads the entire contents of the source Hive or JDBC tables on every run. If you're appending data to the source tables then you're correct that the older data will be reprocessed every time.

Satish Abburi

unread,
Jul 25, 2017, 2:42:43 PM7/25/17
to Kylo Community
hmm...is there any best practices here? I see this is a common use case how the data gets loaded every day and new data goes through the transformation process. 
We don't plan to create new table with every load. 

Greg Hart

unread,
Jul 25, 2017, 3:18:32 PM7/25/17
to Kylo Community
Hi Satish,

One possibility is to modify the template to take a parameter that contains the partition value of the source table to process. There's already a JIRA open to integrate this into the sample template:

Basically in NiFi you'd get the partition value to be processed (if the source table comes from a Data Ingest feed then this would be the processing_dttm value, for example) then do a search and replace of the transform script for where you want this value in your script. Let's say you're using <<PROCESSING_DTTM>> as the placeholder in your transform script then you'd replace <<PROCESSING_DTTM>> with the actual partition value. This would let you parameterize the transform script and if you used the filter() function in your transform script then you could read in a different partition in every run.

If you get this working then this is something we would accept as a pull request:

Satish Abburi

unread,
Jul 25, 2017, 3:25:07 PM7/25/17
to Kylo Community

Sure, we can take this up. Thanks.

Kanika Batra

unread,
Oct 14, 2018, 5:14:17 AM10/14/18
to Kylo Community
Hi Satish,

Were you able to get this working?

D Anil Reddy

unread,
Oct 16, 2018, 10:24:36 AM10/16/18
to Kylo Community
Hi Kanika,

I am replying on behalf of Satish.

We had it working with few tweeks. 
1) We modified the transformation template to have a placeholder for processing_dttm in the transformation query to read from hive
2) We modified the notification messages to JMS to include processing_dttm field to be sent which will be used to get delta from Hive

Hope it helps.

Thanks,
Anil.
Reply all
Reply to author
Forward
0 new messages