Hi Richard,
On Wed, Dec 05, 2012 at 06:27:12AM -0800, Richard Parker wrote:
> Hi Alistair,
>
> How's the performance of the join() function?
> Which one will be faster, execute the join in the SQL statement as part of
> extract or extract 2 separate tables and perform join() in python?
In petl there are two functions that implement an inner join: join()
and hashjoin(). The join() function works by sorting the lefthand and
righthand input tables then doing an interleaved scan. The hashjoin()
function works by building a Python dict mapping key values to rows
from the righthand table, then scanning the lefthand table and looking
up rows to join with from the dict.
If the two tables you want to join are both too big to fit in memory,
and they are *not* already sorted by the join key, then you can use
the petl join() function but it will likely be slow, because petl will
sort both input tables, and it's the sorting (compute + writing temp
files to disk) that takes time. I would guess that a properly indexed
database would be significantly faster, although it's worth actually
checking with your data.
If the two tables you want to join are both big but are *already
sorted* by the join key, then you can use petl join(presorted=True)
and it will skip the sorting and just do the interleaved scan. That
will probably be reasonably fast.
If your tables are *not* presorted but one of your tables is small
enough to fit into memory then you could use petl hashjoin() with the
smaller as the righthand table, and that will probably be reasonably
fast.
In these cases a properly indexed database will still probably be
faster, although again worth checking for your data.
> Btw, do you have a detailed documentation on the petl
> framework/architecture? Thanks.
See
http://packages.python.org/petl/0.11.1/index.html especially the
sections "Conventions - row containers and row iterators" and
"Transformation pipelines". Everything in petl is built on the row
container/row iterator convention, there really is nothing more to it.
If you're still curious, check out the source code, e.g.:
https://github.com/alimanfoo/petl/blob/master/src/petl/transform.py
...is where all the transformation functions are implemented.
Hth,
Alistair
> > > Email:
alim...@gmail.com <javascript:>
> > > Tel:
+44 (0)1865 287721 ***new number***
> >
> > --
> > Alistair Miles
> > Head of Epidemiological Informatics
> > Centre for Genomics and Global Health <
http://cggh.org>
> > The Wellcome Trust Centre for Human Genetics
> > Roosevelt Drive
> > Oxford
> > OX3 7BN
> > United Kingdom
> > Web:
http://purl.org/net/aliman
> > Email:
alim...@gmail.com <javascript:>