Hi all,
It depends a lot, really. I've encountered different situations in which different approaches were required.
The typical ETL I build usually splits entirely the E from the T and the L. The first set of transformations reads from the sources, wherever they are, and loads them into staging tables. The incremental ETL is usually controlled here, either using db timestamps or, if it's filebased, using the natural alphabetical order of files.
Whenever possible, I prefer to work with pre-sorted data. With files it's pretty much the order in which data arrives, with databases I order by the timestamp or unique incremental id or any other feature of the source data that allows me to distinguish between past and future data.
If you're reading from webservices, data may be sorted in ascending order across chunks of data, but sorted in descending order within each chunk. (e.g, you read rows from most recent to most ancient because that's how the data arrives, then on the next run you read the next chunk of data, getting an "order" like 5,4,3,2,1,10,9,8,7,6, etc.).
I try to avoid the requirement that data be sorted. It's useful for performance reasons, making the indexing of the staging database faster, but I think an ETL should not require data to be sorted and should not assume data is sorted.
When loading from source to staging, I always add a batch ID and the timestamp of the ETL run. From the moment I load from staging to fact I don't care anymore about the order in which data in each batch is read, all I care about, if at all, is the order of the batches. Fact table data should not require sorting from a purely algorithmic point of view, and sorting is added solely to aid performance.
If all else fails, then I sort in PDI. Sorting breaks the streamlined data flow, forcing steps after the sort to be executed separately from the steps before the sort, so it will always have a performance impact. If you have a 200 step transformation and the sort is required more or less half way through, I'd rather have transformation 1 load, process everything up to the sort, then sort and materialize to a file, then transformation 2 reads the sorted data from the materialized file and processing from that point on.
Bottomline:
- Do you really need to sort? If No, then don't worry about it.
- If yes, can you sort? If yes, sort as early as possible, before injecting the data into PDI if possible (or feasible)
- If not, then sort it in PDI, trying to limit the amount of data on each run especially when running on low disk space servers.
- But, in this scenario, try to split the transformation in two, one before sorting and one after sorting.