I have a website and I want to take it "offline" - get all pages, and
the pages they link to, and so on - either until I reach a predefined
depth, or until I found all links.
To control the loop, i have a configuration (.properties) file
containing some specifics such as the root url (and credentials) where
I can get the pages from, as well as the max_depth.
In addition to that I have a separate "initial-links" file that lists
all urls that serve as starting point (so I can lift several pages in
the same job run)
The looping is controlled by two things
1) I keep track of the current depth in a root-level variable
2) There are three additional "scratchpad" files, a "current-links" to
keep track of the links I need to examine during the current
iteration, an "all-links" file to keep track of the links I tracked
down already in any previous iterations, and a "new-links" file to
store any links I discover during the current iteration.
The loop itself is then implemented as such:
* before the loop, i set the depth variable to 0, and I overwrite the
"current-links" file with the "initial-links" file, and the
"all-links" file is emptied.
* for each iteration, I do:
1) check if the depth variable is less than or equal to the max_depth
configuration value. if not, we're done looping.
2) process all pages pointed to the links in the "current-links" file.
If I discover any links in those pages, I store them in the
"new-links" file.
3) do a diff between the "all-links" and the "new-links" files to
discard any "new-links" that were already processed. The real new
links are then dumped to the "current-links" file
4) check if the last iteration yielded any new links (iow check if
"current-links" is not empty). If it is, we're done looping. If not, I
increase the depth variable by 1 and re-enter at step 1)
Maybe there are simpler ways to do it but this is what I have now and
it works quite well.
I don't know if anyone thinks this pattern is useful, but if so, the
challenge will be to generalize it. I mean, I can see how this would
already be very useful for the typical html website case with <a>,
<img>, <script> and <link> tags ( and friends). But in my use case,
the pages are not really html pages but xml, and the links are not
really links but identifying numbers from which my transformation
logic can derive urls for new xml files.
Anyway - just an example where I rely on loops.
kind regards,
Roland
> --
> You received this message because you are subscribed to the Google Groups
> "kettle-developers" group.
> To post to this group, send email to kettle-d...@googlegroups.com.
> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
--
Roland Bouman
blog: http://rpbouman.blogspot.com/
twitter: @rolandbouman
Author of "Pentaho Solutions: Business Intelligence and Data
Warehousing with Pentaho and MySQL",
http://tinyurl.com/lvxa88 (Wiley, ISBN: 978-0-470-48432-6)
Author of "Pentaho Kettle Solutions: Building Open Source ETL
Solutions with Pentaho Data Integration",
http://tinyurl.com/33r7a8m (Wiley, ISBN: 978-0-470-63517-9)
On Fri, May 6, 2011 at 1:44 PM, Matt Casters <mcas...@pentaho.org> wrote:
> When I read the description I was actually expecting a job that looked way
> more complex, so it's not all that bad.
True. I got the entire thing done in a day (including logic inside the
transformations), I expected it would take me longer.
> From all the use-cases I think I saw a few things that always came back with
> respect to loops:
> 1) Get a bunch of rows (from a table or a file) and copy those rows to
> result
> 2) Loop over the rows with a job. Setting a bunch of variables is the first
> thing you do in the job
Not sure if that was clear from my example, but I designed mine
explicitly to avoid a "copy rows to result" step. I do everything with
my scratchpad files inside transformations.
I did this on purpose - I wasn't sure how large the resultsets could
be - the number of links has a tendency to explode quite rapidly and I
assumed it would give me better performance and scalability to keep
all the row processing inside the transformations (probably at the
expense of more io, but in my case the crawling is the limiting bit,
not reading links from the files)
kind regards,
Roland.
A) Have a looping option in a transformation.
Yes, I know this is normally out of scope for transformations, but the
following use case is there in regards to a continues data load / real
time processing:
- Query JMS or other queues with some sort of Input step
- This will stop when all the available data is processed (or for JMS
until a timeout is reached or just continue infinite - this is the only
step I know where this "infinite continue" is implemented)
- Now imagine, after a specific amount of time new data arrive and need
to be processed asap.
Actually you may need to restart the transformation what costs
performance, need a looping logic outside of the transformation and all
this leads to a small delay.
We have for example the JMS consumer step that can read continuously. A
problem here is: When you want to stop the transformation in a
controlled way, this is not possible at this time since all steps get
the signal to stop and rows may silently disappear. A feature request
for this: react to a signal that is sent only to this step and stop
processing in a controlled way. A JIRA needs to be created but in the
project where this was found, a loss of some rows (up to the buffer
size) is not critical, really...
From my point of view this request leads to an interesting loop design
proposal for transformations:
1) Have an option for some input steps to just restart after they are
finished.
2) The restart may be delayed for a specific amount of time.
3) This step needs to listen to a specific signal to stop. This is
different from stopping the transformation.
B) This type of looping option is actually possible within jobs with the
Start job entry (repeat functionality) whereas product management set
this feature to deprecated since a while. The recommendation was to
restart and loop by the scheduler or external process. But out of the
above given reasons (overhead, delays and even avoid overlapping job
runs), I still think the features of the start job entry are still
valid. Especially since the link between the scheduler and monitoring is
not given, yet.
Adding the listener for a specific signal to stop a start job in a
controlled way and keep the repeat option, would be very nice to have.
When a real looping logic within a job would be realized the delay or a
restart of a transformation may be acceptable in the above scenario.
For looping we may think of a "for/next" job entry implementation with
some options like:
- maximum number of iterations
- idle time before a next cycle
- some conditions to check if it should continue or not (I know the
phrase "some conditions" may be a wide range, e.g. variables to check or
checking a date/time range)
- a break option to end the "for/next" loop premature
- nested "for/next" should be allowed thus we may need an ID to reference
That are my thoughts for now...
Cheers,
Jens
Am 03.05.2011 16:41, schrieb Matt Casters:
> Hi Kettle devs,
>
> It has occurred to me earlier and more recently to others that creating
> loops in jobs is somewhat a cumbersome process.
> So perhaps we can line up the top 5 of most common use-cases and find
> ease-of-use solutions to those?
>
> One use-case is where we loop over a DB result set (query), copy the
> rows to result, set variables and use those for each row in the result set.
> In that specific case I imagine we could wrap the "Table Input" step
> around a job entry, execute that, copy the rows to result, all in one
> job entry.
> Setting the variables is something we could cram into the
> "Transformation" or "Job" job entries without too much of a problem.
> That would mean we could eliminate 2 transformations: one to get the
> result set and one to set the variables inside the loop. All that
> remains are 2 job entries: "Table Input" and "Transformation/Job".
>
> So do me (and our users:-) a favor and let us know your most common
> use-case for loops in a job.
> If there is a pattern we could perhaps come up with a more clever way of
> doing this compared to writing N new job entries for "Table Input",
> "Text File Input" and so on.
>
> Thanks in advance!
>
> Matt
> --
> Matt Casters <mcas...@pentaho.org <mailto:mcas...@pentaho.org>>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>
> 2011/5/6 Jens Bleuel <jbl...@pentaho.com <mailto:jbl...@pentaho.com>>
> <mailto:mcas...@pentaho.org <mailto:mcas...@pentaho.org>>>
>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177>
> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
>
> Pentaho : The Commercial Open Source Alternative for Business
> Intelligence
>
>
> --
>
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-d...@googlegroups.com
> <mailto:kettle-d...@googlegroups.com>.
> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com
> <mailto:kettle-developers%2Bunsu...@googlegroups.com>.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "kettle-developers" group.
> To post to this group, send email to
> kettle-d...@googlegroups.com
> <mailto:kettle-d...@googlegroups.com>.
> To unsubscribe from this group, send email to
> kettle-develop...@googlegroups.com
> <mailto:kettle-developers%2Bunsu...@googlegroups.com>.
> For more options, visit this group at
> http://groups.google.com/group/kettle-developers?hl=en.
>
>
>
>
> --
> Matt Casters <mcas...@pentaho.org <mailto:mcas...@pentaho.org>>
> Chief Data Integration, Kettle founder, Author of Pentaho Kettle
> Solutions
> <http://www.amazon.com/Pentaho-Kettle-Solutions-Building-Integration/dp/0470635177> (Wiley
> <http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470635177.html>)
> Fonteinstraat 70, 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
> Pentaho : The Commercial Open Source Alternative for Business Intelligence
>
>