petl performance and scalability

827 views
Skip to first unread message

Richard Parker

unread,
Dec 3, 2012, 6:49:44 AM12/3/12
to pytho...@googlegroups.com
Hi,

I've looked through the petl package and it looks promising. I'm planning to use this as the framework for my current data migration project, potentially the largest ETL will involve a table as big as 30 million rows and 30 columns. Will petl be able to support and will there be a performance issue?

Btw, we are weighing between custom C++ code (I used this in my previous data migration project, the cons is that it's very time consuming to develop since you have to build everything from scratch) vs petl (less development effort with performance tradeoff).

Thanks.

Regards,
RP

Alistair Miles

unread,
Dec 3, 2012, 11:19:19 AM12/3/12
to pytho...@googlegroups.com
Hi Richard,

On Mon, Dec 03, 2012 at 03:49:44AM -0800, Richard Parker wrote:
> Hi,
>
> I've looked through the petl package and it looks promising. I'm planning
> to use this as the framework for my current data migration project,
> potentially the largest ETL will involve a table as big as 30 million rows
> and 30 columns. Will petl be able to support and will there be a
> performance issue?

I think it will depend very much on what you want to do to the
data. My advice would be to try out some of the transformations you
need in petl with sample data and get a sense of speed. The functions
progress() and clock() in petl can be used to find out how fast
(rows/s) a transformation pipeline is, as well as which steps are fast
and which are slow.

Note that for the sort() function and all functions which depend on it
(e.g., join()) performance will depend on how much data you are
willing to buffer in memory. You will also want to make sure you don't
repeat sorting operations more than necessary, so writing pipelines
such that a sort is done once then all further steps are aware they
are using presorted data (via the 'presorted' keyword arg) can make a
big difference.

Please note also that I have done almost no optimisation work as yet
on petl. If you do decide to evaluate petl and are finding certain
functions to be critically slow, please let me know, I'll certainly
take a look and see if anything can be optimised.

Finally, petl is pure python, so it will never be anywhere near as
fast as custom C code. I am starting to get to grips with cython,
however, so again I would be very glad of feedback about critical
sections in petl, there may be some obvious targets for implementation
as cython extensions.

Cheers,

Alistair

>
> Btw, we are weighing between custom C++ code (I used this in my previous
> data migration project, the cons is that it's very time consuming to
> develop since you have to build everything from scratch) vs petl (less
> development effort with performance tradeoff).
>
> Thanks.
>
> Regards,
> RP

--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: alim...@gmail.com
Tel: +44 (0)1865 287721 ***new number***

Alistair Miles

unread,
Dec 3, 2012, 11:49:22 AM12/3/12
to pytho...@googlegroups.com
I should've also mentioned, petl is implemented using a streaming
architecture. I.e., everything is done with iterators, row by row,
without reading anything into memory. This means memory is not a
limiting factor. The exception is functions which depend on sort(),
where some data is buffered in memory, but you can control how much
via the buffersize argument, anything over that is cached on disk.

So the bottom line is, you can definitely process 30 million rows
using petl on modest hardware, the question is how long it will take.

Hth,

Alistair

Richard Parker

unread,
Dec 5, 2012, 9:27:12 AM12/5/12
to pytho...@googlegroups.com, alim...@googlemail.com
Hi Alistair,

How's the performance of the join() function?
Which one will be faster, execute the join in the SQL statement as part of extract or extract 2 separate tables and perform join() in python?

Btw, do you have a detailed documentation on the petl framework/architecture? Thanks.

Regards,
RP

Alistair Miles

unread,
Dec 5, 2012, 6:42:24 PM12/5/12
to Richard Parker, pytho...@googlegroups.com
Hi Richard,

On Wed, Dec 05, 2012 at 06:27:12AM -0800, Richard Parker wrote:
> Hi Alistair,
>
> How's the performance of the join() function?
> Which one will be faster, execute the join in the SQL statement as part of
> extract or extract 2 separate tables and perform join() in python?

In petl there are two functions that implement an inner join: join()
and hashjoin(). The join() function works by sorting the lefthand and
righthand input tables then doing an interleaved scan. The hashjoin()
function works by building a Python dict mapping key values to rows
from the righthand table, then scanning the lefthand table and looking
up rows to join with from the dict.

If the two tables you want to join are both too big to fit in memory,
and they are *not* already sorted by the join key, then you can use
the petl join() function but it will likely be slow, because petl will
sort both input tables, and it's the sorting (compute + writing temp
files to disk) that takes time. I would guess that a properly indexed
database would be significantly faster, although it's worth actually
checking with your data.

If the two tables you want to join are both big but are *already
sorted* by the join key, then you can use petl join(presorted=True)
and it will skip the sorting and just do the interleaved scan. That
will probably be reasonably fast.

If your tables are *not* presorted but one of your tables is small
enough to fit into memory then you could use petl hashjoin() with the
smaller as the righthand table, and that will probably be reasonably
fast.

In these cases a properly indexed database will still probably be
faster, although again worth checking for your data.

> Btw, do you have a detailed documentation on the petl
> framework/architecture? Thanks.

See http://packages.python.org/petl/0.11.1/index.html especially the
sections "Conventions - row containers and row iterators" and
"Transformation pipelines". Everything in petl is built on the row
container/row iterator convention, there really is nothing more to it.

If you're still curious, check out the source code, e.g.:

https://github.com/alimanfoo/petl/blob/master/src/petl/transform.py

...is where all the transformation functions are implemented.

Hth,

Alistair
> > > Email: alim...@gmail.com <javascript:>
> > > Tel: +44 (0)1865 287721 ***new number***
> >
> > --
> > Alistair Miles
> > Head of Epidemiological Informatics
> > Centre for Genomics and Global Health <http://cggh.org>
> > The Wellcome Trust Centre for Human Genetics
> > Roosevelt Drive
> > Oxford
> > OX3 7BN
> > United Kingdom
> > Web: http://purl.org/net/aliman
> > Email: alim...@gmail.com <javascript:>
Reply all
Reply to author
Forward
0 new messages