Sort in PDI

96 views
Skip to first unread message

Dan

unread,
Jun 9, 2014, 4:56:49 AM6/9/14
to pentaho-...@googlegroups.com
Hi,

Quick question; How often do you build a transformation where you work with pre-sorted data?

Obviously in PDI if you know the sort that's extremely handy and can potentially save time.  However it can be a double edged sword if you have to make sure you maintain that sort - i.e. don't break it. Do others come across that issue?

Dan

Matt Casters

unread,
Jun 9, 2014, 6:22:28 AM6/9/14
to pentaho-...@googlegroups.com
You mean like a "verify sort order" option?

--
Matt Casters <mcas...@pentaho.org>
Chief Data Integration, Kettle founder, Author of Pentaho Kettle Solutions (Wiley)
Fonteinstraat 70 - 9400 OKEGEM - Belgium - Cell : +32 486 97 29 37
Pentaho  -  Powerful Analytics Made Easy


--
You received this message because you are subscribed to the Google Groups "Pentaho Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pentaho-commun...@googlegroups.com.
To post to this group, send email to pentaho-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pentaho-community/CAGujiv22bki6Hh8nuuCn9k9dPhFRKNtiDHVKQ7LxTRPKfpDJhA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Dan

unread,
Jun 9, 2014, 7:00:22 AM6/9/14
to pentaho-...@googlegroups.com
No; I'm wondering how common it is that people work like this. Has anyone else seen such a situation?


Pedro Alves

unread,
Jun 9, 2014, 7:07:34 AM6/9/14
to pentaho-...@googlegroups.com

For me, it always depended on the scenario. Either I know the input source comes sorted (eg: a reducer) or doesn't, and I have to do it myself.

I do see matt's suggestion come in hand!


-pedro

Nelson Sousa

unread,
Jun 9, 2014, 7:21:58 AM6/9/14
to pentaho-...@googlegroups.com

Hi all,

It depends a lot, really. I've encountered different situations in which different approaches were required.

The typical ETL I build usually splits entirely the E from the T and the L. The first set of transformations reads from the sources, wherever they are, and loads them into staging tables. The incremental ETL is usually controlled here, either using db timestamps or, if it's filebased, using the natural alphabetical order of files.

Whenever possible, I prefer to work with pre-sorted data. With files it's pretty much the order in which data arrives, with databases I order by the timestamp or unique incremental id or any other feature of the source data that allows me to distinguish between past and future data.

If you're reading from webservices, data may be sorted in ascending order across chunks of data, but sorted in descending order within each chunk. (e.g, you read rows from most recent to most ancient because that's how the data arrives, then on the next run you read the next chunk of data, getting an "order" like 5,4,3,2,1,10,9,8,7,6, etc.).


I try to avoid the requirement that data be sorted. It's useful for performance reasons, making the indexing of the staging database faster, but I think an ETL should not require data to be sorted and should not assume data is sorted.

When loading from source to staging, I always add a batch ID and the timestamp of the ETL run. From the moment I load from staging to fact I don't care anymore about the order in which data in each batch is read, all I care about, if at all, is the order of the batches. Fact table data should not require sorting from a purely algorithmic point of view, and sorting is added solely to aid performance.


If all else fails, then I sort in PDI. Sorting breaks the streamlined data flow, forcing steps after the sort to be executed separately from the steps before the sort, so it will always have a performance impact. If you have a 200 step transformation and the sort is required more or less half way through, I'd rather have transformation 1 load, process everything up to the sort, then sort and materialize to a file, then transformation 2 reads the sorted data from the materialized file and processing from that point on.


Bottomline:

- Do you really need to sort? If No, then don't worry about it.
- If yes, can you sort? If yes, sort as early as possible, before injecting the data into PDI if possible (or feasible)
- If not, then sort it in PDI, trying to limit the amount of data on each run especially when running on low disk space servers.
- But, in this scenario, try to split the transformation in two, one before sorting and one after sorting.




Nelson Sousa
Business Intelligence Consultant



Diethard Steiner

unread,
Jun 10, 2014, 6:06:08 AM6/10/14
to pentaho-...@googlegroups.com
Hi,

To add one more scenario which hasn't been mentioned yet:
If you have to combine datasets from different DBs within a single transformation and you require both datasets to be sorted, in some configurations the way these DBs sort the data might not be exactly the same. In this scenario, using the sort step in Kettle ensures that both datasets are sorted exactly the same way. If I remember correctly, I had this scenario once when combining data from DB2 and PostgreSQL, but again, this really depends as well on how they are configured.

Cheers,
Diddy 


Dan

unread,
Jun 10, 2014, 6:08:33 AM6/10/14
to pentaho-...@googlegroups.com
thats just your charset and collation causing that. Match those up and you can sort happily in different DBs.


Diethard Steiner

unread,
Jun 10, 2014, 7:55:31 AM6/10/14
to pentaho-...@googlegroups.com
Yes, but that's something that has to be configured on the DB side, no? And you might not have the rights to do this. Or is this possible via the JDBC settings as well?


Dan

unread,
Jun 10, 2014, 8:05:20 AM6/10/14
to pentaho-...@googlegroups.com
depends on the db IIRC.  Nevertheless the decision on sorting in db vs pdi is far from straightforward for this and other reasons too!


Nelson Sousa

unread,
Jun 10, 2014, 9:04:56 AM6/10/14
to pentaho-...@googlegroups.com

Not so long ago, I had an ETL job where the 

Select <columns> from <table> order by <a column>

would blow up with the following error message: "Cannot write to /tmp/mysql/some_temp_file: not enough space on disk".

Needless to say, I had to do the sort in PDI and tell the admin of that machine that 2Gb of RAM and 4Gb of space is not enough for a DB server. 



Nelson Sousa
Business Intelligence Consultant



Reply all
Reply to author
Forward
0 new messages