Developer documentation

Cesar Martinez

unread,

Jun 10, 2009, 9:47:29 AM6/10/09

to kettle-developers

Hello,

I'm evaluating the possibility to write some custom transformations or
jobs for Kettle.

I've been able to download the source code from SVN and I've
successfully compiled it.

Now, I'd like to get some developer documentation, but I found very
few information ([1], [2] and [3]). It would be very helpful if you
can provide some links about the following topics:

- General Kettle architecture
- How to write your own jobs and transformations (relevant interfaces
and methods), and how is data exchanged between them, at the API
level.

If such documentation does not exist, it would be enough to know the
relevant interfaces and the name of some simple classes implementing
them.

Thanks in advance,

César Martínez Izquierdo

[1] http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Integration+Plug-In
[2] http://wiki.pentaho.com/display/EAI/PDI+Developer+information
[3]
http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+-+Java+API+Examples#PentahoDataIntegration-JavaAPIExamples-ProgramyourownKettletransformation

Vijay A

unread,

Jun 15, 2009, 11:05:47 PM6/15/09

to kettle-d...@googlegroups.com

Hi Cesar,

Looking at the code and debugging of the existing transformations a good way gain an understanding of the control flow.

You can start by exploring a few transformations in <Kettle Workspace>\src\org\pentaho\di\trans\steps

You need to implement 4 classes as follows :

(Taking the example of the JoinRows(Cartesian Product) Transformation

public class JoinRows extends BaseStep implements StepInterface

public class JoinRowsData extends BaseStepData implements StepDataInterface

public class JoinRowsMeta extends BaseStepMeta implements StepMetaInterface

public class JoinRowsDialog extends BaseStepDialog implements StepDialogInterface

You can start a wiki page for the same and the community member will add to / refine that wiki.

Thanks,

Vijay

--
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
This message contains privileged and confidential information and is intended only for the individual named.If you are not the intended recipient you should not disseminate,distribute,store,print, copy or deliver this message.Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system.E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed,arrive late or incomplete or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

César Martínez Izquierdo

unread,

Jun 17, 2009, 10:12:44 AM6/17/09

to kettle-d...@googlegroups.com

Thanks for the info Vijay.

I have been reading the existing development documentation, and I also
had a look to the implementation of some Steps.

I wrote some conclusions, and I've also got some concrete questions,
that I paste bellow. I'd be happy if someone could confirm (or
correct) my conclusions, and answer my questions:

"Kettle offers two different ways for processing information:
Transformations and Jobs.
Transformations are formed by one or more Transformation Steps and
only apply to tabular data, which can be processed row by row. In
Transformations, several Steps can be processed at the same time
(concurrent execution), because each row is processed independently of
the others, and the result of each row flows from one step to the
following one.
Therefore, if we examine the estate of a transformation in a precise
moment, the, the first step may be processing the 3th row, the second
step may be processing the 2nd row, the third step may be processing
the 1st row, and the four step may be waiting for a result to arrive
from the third step.
As each row is processed in an atomic way, rows may be processed in
different computers, which is know as clustered execution.
By contrast, Jobs are formed by one more Job Entries, which can
perform any kind of task (they can apply to non-tabular data). Jobs
Entries are executed in a sequential order, so a Job Entry is only
executed when the previous one has totally finished. Therefore, Jobs
don’t allow clustered execution."

Now, my questions:
- If I need to transform some data which is not formed by rows (for
example, an image)... is the Transformation Step architecture flexible
enough to process this kind of data, of would I need to program my
transformation as Job Entries?
- I have seen that in Transformation Steps, rows flow from one step to
the following one. How does it work for Job Entries? Is there a
similar data flow between Job Entries?

Thank you very much,

César Martínez Izquierdo

2009/6/16 Vijay A <avi...@dataalp.com>:

Jens Bleuel

unread,

Jun 17, 2009, 5:45:04 PM6/17/09

to kettle-d...@googlegroups.com

Hi César,

nice, you could also refer to software pipelines:
http://en.wikipedia.org/wiki/Pipeline_(software)

small addendum:
Job Entries can be run in parallel (see
http://wiki.pentaho.com/display/EAI/Launching+job+entries+in+parallel )
and you could define different remote servers for each job entry.

> - If I need to transform some data which is not formed by rows (for
> example, an image)... is the Transformation Step architecture flexible
> enough to process this kind of data, of would I need to program my
> transformation as Job Entries?

An image is a BLOB field - taken as the Binary data type field and
transported "transparent".

> - I have seen that in Transformation Steps, rows flow from one step to
> the following one. How does it work for Job Entries? Is there a
> similar data flow between Job Entries?

Yes, the result set (see step Copy rows to/from result). For larger data
sets to transfer from one job to another use the (De)Serialize to/from step.

Cheers,
Jens

César Martínez Izquierdo schrieb:

Reply all

Reply to author

Forward