I'm not sure if storing intermediate results in memory is a good
solution even in terms of performance.This really depends on the
actual scenario, but I would consider the following options:
1) The most important thing is to understand where do you have a
bottleneck, is this a slow datastore you are querying or you have to
query it too many times. A scalable ETL procedure should not store
intermediate data at all, it should connect the datasources by
querying and directly inserting, thus avoiding intermediate storage.
2) If the query is expensive to execute, but returns a reasonable
amount of records, less than million records for example. I would
create a temp table in the target database and store results in it.
Most commercial DB handle caching of data in tables in a much more
efficient way than implementing a memory store by your own. I would
strongly recommend to use facilities provided by the database as much
as possible, i.e. use native SQL joins wherever possible, native
conversion functions for transformations etc... By moving the majority
of processing logic to the database you can significantly increase
performance, because no ETL engine is transferring/converting data
faster than the datastore itself.
3) If the input data is really huge, maybe you should organize your
ETL process in a different manner, so that you query it only once, and
call other scripts inside this query.