Loops in jobs

2,370 views
Skip to first unread message

Matt Casters

unread,
May 3, 2011, 10:41:36 AM5/3/11
to Kettle Developers mailing list
Hi Kettle devs,

It has occurred to me earlier and more recently to others that creating loops in jobs is somewhat a cumbersome process.
So perhaps we can line up the top 5 of most common use-cases and find ease-of-use solutions to those?

One use-case is where we loop over a DB result set (query), copy the rows to result, set variables and use those for each row in the result set.
In that specific case I imagine we could wrap the "Table Input" step around a job entry, execute that, copy the rows to result, all in one job entry.
Setting the variables is something we could cram into the "Transformation" or "Job" job entries without too much of a problem.
That would mean we could eliminate 2 transformations: one to get the result set and one to set the variables inside the loop.  All that remains are 2 job entries: "Table Input" and "Transformation/Job".

So do me (and our users:-) a favor and let us know your most common use-case for loops in a job.
If there is a pattern we could perhaps come up with a more clever way of doing this compared to writing N new job entries for "Table Input", "Text File Input" and so on.

Thanks in advance!

Matt
--
Matt Casters <mcas...@pentaho.org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)
Pentaho : The Commercial Open Source Alternative for Business Intelligence


Roland Bouman

unread,
May 6, 2011, 7:16:10 AM5/6/11
to kettle-d...@googlegroups.com
There is one particular case I use in production, which is best
described as a "crawler". I'm not sure it could be cleanly abstracted
but I'll just throw it in the pot.

I have a website and I want to take it "offline" - get all pages, and
the pages they link to, and so on - either until I reach a predefined
depth, or until I found all links.

To control the loop, i have a configuration (.properties) file
containing some specifics such as the root url (and credentials) where
I can get the pages from, as well as the max_depth.
In addition to that I have a separate "initial-links" file that lists
all urls that serve as starting point (so I can lift several pages in
the same job run)

The looping is controlled by two things
1) I keep track of the current depth in a root-level variable

2) There are three additional "scratchpad" files, a "current-links" to
keep track of the links I need to examine during the current
iteration, an "all-links" file to keep track of the links I tracked
down already in any previous iterations, and a "new-links" file to
store any links I discover during the current iteration.

The loop itself is then implemented as such:

* before the loop, i set the depth variable to 0, and I overwrite the
"current-links" file with the "initial-links" file, and the
"all-links" file is emptied.
* for each iteration, I do:
1) check if the depth variable is less than or equal to the max_depth
configuration value. if not, we're done looping.
2) process all pages pointed to the links in the "current-links" file.
If I discover any links in those pages, I store them in the
"new-links" file.
3) do a diff between the "all-links" and the "new-links" files to
discard any "new-links" that were already processed. The real new
links are then dumped to the "current-links" file
4) check if the last iteration yielded any new links (iow check if
"current-links" is not empty). If it is, we're done looping. If not, I
increase the depth variable by 1 and re-enter at step 1)

Maybe there are simpler ways to do it but this is what I have now and
it works quite well.

I don't know if anyone thinks this pattern is useful, but if so, the
challenge will be to generalize it. I mean, I can see how this would
already be very useful for the typical html website case with <a>,
<img>, <script> and <link> tags ( and friends). But in my use case,
the pages are not really html pages but xml, and the links are not
really links but identifying numbers from which my transformation
logic can derive urls for new xml files.

Anyway - just an example where I rely on loops.

kind regards,

Roland

> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-d...@googlegroups.com.
> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>

--
Roland Bouman
blog: http://rpbouman.blogspot.com/
twitter: @rolandbouman

Author of "Pentaho Solutions: Business Intelligence and Data
Warehousing with Pentaho and MySQL",
http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)

Author of "Pentaho Kettle Solutions: Building Open Source ETL
Solutions with Pentaho Data Integration",
http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)

Spoon - main (changed)_2011-05-06_12-41-26.png

Matt Casters

unread,
May 6, 2011, 7:44:23 AM5/6/11
to kettle-d...@googlegroups.com
Thanks Roland!
When I read the description I was actually expecting a job that looked way more complex, so it's not all that bad.

Also thanks to the other folks that sent in use-cases in private, sometimes with complete jobs attached.  It's highly appreciated.

From all the use-cases I think I saw a few things that always came back with respect to loops:
1) Get a bunch of rows (from a table or a file) and copy those rows to result
2) Loop over the rows with a job. Setting a bunch of variables is the first thing you do in the job

for 1) I'm thinking of something dramatic to help out in the form of a dynamic xform builder.
for 2) it would be much easier if you could set the variables in the "Transformation" or "Job" job entries themselves.  I think it makes a lot of sense to do it there anyway.

I'll wrap these up in 2 JIRA cases and I'll try to get something going for them as soon as possible so you can all give more feedback.

Best regards,

Matt

2011/5/6 Roland Bouman <roland...@gmail.com>
Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37

Roland Bouman

unread,
May 6, 2011, 8:05:05 AM5/6/11
to kettle-d...@googlegroups.com
Hi Matt, all!

On Fri, May 6, 2011 at 1:44 PM, Matt Casters <mcas...@pentaho.org> wrote:
> When I read the description I was actually expecting a job that looked way
> more complex, so it's not all that bad.

True. I got the entire thing done in a day (including logic inside the
transformations), I expected it would take me longer.

> From all the use-cases I think I saw a few things that always came back with
> respect to loops:
> 1) Get a bunch of rows (from a table or a file) and copy those rows to
> result
> 2) Loop over the rows with a job. Setting a bunch of variables is the first
> thing you do in the job

Not sure if that was clear from my example, but I designed mine
explicitly to avoid a "copy rows to result" step. I do everything with
my scratchpad files inside transformations.
I did this on purpose - I wasn't sure how large the resultsets could
be - the number of links has a tendency to explode quite rapidly and I
assumed it would give me better performance and scalability to keep
all the row processing inside the transformations (probably at the
expense of more io, but in my case the crawling is the limiting bit,
not reading links from the files)

kind regards,

Roland.

Brandon Jackson

unread,
May 6, 2011, 10:36:52 AM5/6/11
to kettle-d...@googlegroups.com
We have two difference scenarios where we get 'layered' ETL. 

1. We perfect an idea and it becomes a module, then it is wrapped up in a bigger process which loops it.
2. We gather some inputs, and loop jobs containing transforms on the initial input. (The most mentioned case here).

We could chalk this up to development lifecycle where we should revisit our design and try to make it more sequential, than layered.

One area where I struggle is when something has been in production for a while and we try moving to a later version of PDI and encounter problems.  Due to the layering, it takes time and effort to prepare the transformation to be retested as an individual unit again.  I have to add variables and set them, which i must make sure I delete when reinserting the piece into the production ETL.  I also have to gather up all the inputs, which could be determined in layers above to work towards diagnosing, repeating and correcting the issue.

Example.  We have PDI 4.0.1-CE.  Works fantastically for a transform working with text files and DB lookups to process therapy transactions from a sister company.  It completes in 1.5 minutes.  I try it in any version after 4.0.1 and it takes so long, I've never seen completion.   Logs don't create errors, because there are no errors.  From spoon, I cannot dive down into the running job because an iteration a lower level would have been finished.   I've enabled logs in the transform and nothing get's generated.   So unless I either do not understand logs OR tear the tranform completely apart and build up again, there is no hope of quickly zeroing in on the issue.

Possible solutions come to mind:

Basically a way to switch into and out of testing mode.  This implies more than 'row output' or 'debug'.  I want variable setting, activation of steps that feed dummy data into the transform and special notes.  When one exits 'testing mode', then those dummy data steps, variables and notes all are disabled and disappear neatly away.

Just an idea.  I have little pride.  Please feel free to point me to specific education resources if my suggestions reveal a lack of understanding of some functionality that exists in PDI currently to help me over the hurdle.

PS: Although it has improved greatly, visually and functionally, I still find logging and performance management one of the most confusing aspects of PDI, especially when ETL layers up and is hard to execute on a transform by transform basis.

Thanks for the great product and the commitment to constant and never ending improvement.

Brandon




Jens Bleuel

unread,
May 6, 2011, 3:50:12 PM5/6/11
to kettle-d...@googlegroups.com
A little bit unusual, but it happened:

A) Have a looping option in a transformation.
Yes, I know this is normally out of scope for transformations, but the
following use case is there in regards to a continues data load / real
time processing:
- Query JMS or other queues with some sort of Input step
- This will stop when all the available data is processed (or for JMS
until a timeout is reached or just continue infinite - this is the only
step I know where this "infinite continue" is implemented)
- Now imagine, after a specific amount of time new data arrive and need
to be processed asap.

Actually you may need to restart the transformation what costs
performance, need a looping logic outside of the transformation and all
this leads to a small delay.

We have for example the JMS consumer step that can read continuously. A
problem here is: When you want to stop the transformation in a
controlled way, this is not possible at this time since all steps get
the signal to stop and rows may silently disappear. A feature request
for this: react to a signal that is sent only to this step and stop
processing in a controlled way. A JIRA needs to be created but in the
project where this was found, a loss of some rows (up to the buffer
size) is not critical, really...

From my point of view this request leads to an interesting loop design
proposal for transformations:
1) Have an option for some input steps to just restart after they are
finished.
2) The restart may be delayed for a specific amount of time.
3) This step needs to listen to a specific signal to stop. This is
different from stopping the transformation.

B) This type of looping option is actually possible within jobs with the
Start job entry (repeat functionality) whereas product management set
this feature to deprecated since a while. The recommendation was to
restart and loop by the scheduler or external process. But out of the
above given reasons (overhead, delays and even avoid overlapping job
runs), I still think the features of the start job entry are still
valid. Especially since the link between the scheduler and monitoring is
not given, yet.
Adding the listener for a specific signal to stop a start job in a
controlled way and keep the repeat option, would be very nice to have.

When a real looping logic within a job would be realized the delay or a
restart of a transformation may be acceptable in the above scenario.

For looping we may think of a "for/next" job entry implementation with
some options like:
- maximum number of iterations
- idle time before a next cycle
- some conditions to check if it should continue or not (I know the
phrase "some conditions" may be a wide range, e.g. variables to check or
checking a date/time range)
- a break option to end the "for/next" loop premature
- nested "for/next" should be allowed thus we may need an ID to reference

That are my thoughts for now...

Cheers,
Jens

Am 03.05.2011 16:41, schrieb Matt Casters:
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the
> rows to result, set variables and use those for each row in the result set.
> In that specific case I imagine we could wrap the "Table Input" step
> around a job entry, execute that, copy the rows to result, all in one
> job entry.
> Setting the variables is something we could cram into the
> "Transformation" or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the
> result set and one to set the variables inside the loop. All that
> remains are 2 job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common
> use-case for loops in a job.
> If there is a pattern we could perhaps come up with a more clever way of
> doing this compared to writing N new job entries for "Table Input",
> "Text File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --

> Matt Casters <mcas...@pentaho.org <mailto:mcas...@pentaho.org>>


> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions

> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)


> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>

Matt Casters

unread,
May 6, 2011, 4:29:54 PM5/6/11
to kettle-d...@googlegroups.com
Namaste Jens,

I didn't know that anyone "deprecated" looping under the "Start" job entry.  There must have been a good reason to do so.  However, I hereby officially "un-deprecate" it for 4.2.0 after I did extensive testing to make sure no memory leaks remain. 

http://jira.pentaho.com/browse/PDI-5502

The first job entry to use the logic you descibed is the "HL7 MLLP Input" job entry.   That one gets a single record from a HL7 queue and passes the message to the other job entries.  It's very fast too ;-)

The "re-start without stop" logic of steps will actually be possible with the "Single Threaded Transformation" execution engine.  My plan was to drop that one behind a mapping-like step but it will become a separate step.
The way that it will work is that you have say 1000 rows entering the new engine.  That will process the rows in batches of 1-N.
Certain steps will be restarted every time, like file reading steps or steps that read a single element off a queue somewhere (your JMS sample).  Most steps will simply keep running.  In a single call of the SingleThreadedTransExecutor.oneIteration(), all N rows in the batch are pushed through the transformation.

Up until now I hadn't considered this to be a loop, but you are right, this would work for your situation.

In the mean time I created http://jira.pentaho.com/browse/PDI-6157 and implemented the work.  Actually Sven Boden did years ago when he implemented parameter support in the Job and Transformation entries. 

Cheers,
Matt


2011/5/6 Jens Bleuel <jbl...@pentaho.com>



--
Matt Casters <mcas...@pentaho.org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)

Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37

Jens Bleuel

unread,
May 7, 2011, 9:51:55 AM5/7/11
to kettle-d...@googlegroups.com
Wow, amazing good news!

> 2011/5/6 Jens Bleuel <jbl...@pentaho.com <mailto:jbl...@pentaho.com>>

> <mailto:mcas...@pentaho.org <mailto:mcas...@pentaho.org>>>


>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>
> Pentaho : The Commercial Open Source Alternative for Business
> Intelligence
>
>
> --
>
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-d...@googlegroups.com

> <mailto:kettle-d...@googlegroups.com>.


> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com

> <mailto:kettle-developers%2Bunsu...@googlegroups.com>.


> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-d...@googlegroups.com

> <mailto:kettle-d...@googlegroups.com>.


> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com

> <mailto:kettle-developers%2Bunsu...@googlegroups.com>.


> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
>
>
> --

> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>

Matt Casters

unread,
May 7, 2011, 12:30:28 PM5/7/11
to kettle-d...@googlegroups.com
I finished the single threading engine last night with the encouragements of the gang on ##pentaho (thanks for that):

http://www.ibridge.be/?p=200

Take care,
Matt

2011/5/7 Jens Bleuel <jbl...@pentaho.com>



--
Matt Casters <mcas...@pentaho.org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)

Matt Casters

unread,
May 7, 2011, 12:38:43 PM5/7/11
to kettle-d...@googlegroups.com
To come back to Jens' loop question inside a transformation...

A step can choose to implement the "batchComplete()" method.
For example, the Sort rows step implements this.
If you send the rows by 500 into the "single threader" step, they will be sorted in blocks of 500.

    /**
     * Calling this method will alert the step that we finished passing a batch of records to the step.
     * Specifically for steps like "Sort Rows" it means that the buffered rows can be sorted and passed on.
     * @throws KettleException In case an error occurs during the processing of the batch of rows.
     */
    public void batchComplete() throws KettleException;

That is, the step needs to know that no more rows are coming or otherwise the transformation would block. That is since in this example only 500 rows will ever arrive and the transformation will never be considered Finished until the parent transformation is finished.

Anyway, have fun with it!

Matt



2011/5/7 Matt Casters <mcas...@pentaho.org>
Reply all
Reply to author
Forward
0 new messages