Hi,
I'm wondering whether or not I should raise this:
The row normaliser doesn't scale linearly.
Now; this step is all about blowing up the number of rows, so I had thought it would make sense that if it can produce 100k rows per second, it would ALWAYS produce that rate, regardless of the level of normalisation. However it doesnt. (Assuming you have adequate memory etc - this step does benefit from a lot of RAM)
Example:
900 fields being normalised - 86k records per second
2500 records being normalised (i.e. ~2.6x) - 26k records per second.
I feel it should still be able to produce 86k records per second even when normalising 2500 records. Do you agree? (Or at least, it would be nicer if it got closer)
In other words, double the number of attributes, double the time it takes, but no more. At the moment if you use 2.6x the attributes it takes 7.7x as long.
(And of course, you can then scale it beyond that by partitioning etc.)
Thanks,
Dan