Arrays vs. Tables for slice reads

201 views
Skip to first unread message

Henrik Skifjeld

unread,
Jun 6, 2014, 10:41:00 AM6/6/14
to pytable...@googlegroups.com
We are working on replacing the data model of a library used for genomic analyses, and want to replace the current numpy memmap solution with PyTables.

Our current implementation is using tables.Table to persist data. We only insert data into the table once in sorted order. Retrieval is done by reading column slices (table.Column[start_pos:end_pos]) and by iteration.

So, data is retrieved either through:
    1. Iteration
    2. Read slice from a table.Column, either:
        a. One chuck with many elements (~250 M)
        b. Many small chucks (~10 K) with few elements (10-15 K)
       
Performance tests have showed that the memmap solution is 2x faster than PyTables in situation 2a, and PyTables is 1.4x faster than memmap in situation 2b.

What could be the reasons for why tables.Table is slower in 2a? And are there any optimizations one can do to improve performance in this situation? We have read about tables.Array, and suspects that it might be faster in situation 2a, as arrays are column-based. Is this perception correct? And how large speed improvement could we expect?

Francesc Alted

unread,
Jun 6, 2014, 11:14:34 AM6/6/14
to pytable...@googlegroups.com
Hi Henrik,

On 6/6/14, 4:41 PM, Henrik Skifjeld wrote:
> We are working on replacing the data model of a library used for
> genomic analyses, and want to replace the current numpy memmap
> solution with PyTables.
>
> Our current implementation is using tables.Table to persist data. We
> only insert data into the table once in sorted order. Retrieval is
> done by reading column slices (table.Column[start_pos:end_pos]) and by
> iteration.
>
> So, data is retrieved either through:
> 1. Iteration
> 2. Read slice from a table.Column, either:
> a. One chuck with many elements (~250 M)
> b. Many small chucks (~10 K) with few elements (10-15 K)
>
> Performance tests have showed that the memmap solution is 2x faster
> than PyTables in situation 2a, and PyTables is 1.4x faster than memmap
> in situation 2b.
>
> What could be the reasons for why tables.Table is slower in 2a? And
> are there any optimizations one can do to improve performance in this
> situation?

Hmm, we would need more context so as to suggest possible improvements.
Could you point us to the relevant code that is doing the benchmarking?
The iterator in PyTables is pretty general, so it maybe the reason of
the slowness, but in the other hand, there are different code paths for
most of the cases. Probably what is happening here is that the memmap
approach is just more I/O efficient than HDF5, but having some profiles
would be nice. Do you have some that you can show?


> We have read about tables.Array, and suspects that it might be faster
> in situation 2a, as arrays are column-based. Is this perception correct?

Uh, no. Arrays are stored in C-order, which is row-wise. But on
another hand, if your datasets are homogeneous, certainly an Array (or
better, an EArray for allowing the appends like the Table object) could
be a faster way to iterate. But you should do your own benchmarks.

> And how large speed improvement could we expect?

It is difficult to assess. In one hand, the Array containers do not
have the overhead of having to deal with heterogeneous data (Table). On
the other hand, the Table object has an extremely fine-tuned iterator.
As always, there is no replacement for experimentation.

Also, is your data compressible? If it is, you can experiment with that
too, specially with the integrated Blosc compressor.

--
Francesc Alted

Henrik Skifjeld

unread,
Jun 7, 2014, 9:22:50 AM6/7/14
to pytable...@googlegroups.com
Thanks for answering so fast.

There are no problems with performance of the iterator. It is actually faster than using an index to iterate memmaps.
 
Probably what is happening here is that the memmap
approach is just more I/O efficient than HDF5, but having some profiles
would be nice.  Do you have some that you can show?
 
Here are two detailed profiles of the system, one for PyTables and another for memmap:
https://hyperbrowser.uio.no/gtrackcore/u/henrik/h/profiling-memmap-vs-pytables
To view them: Click the 'eye'-icon in each of the history items on the right hand side.
 


> We have read about tables.Array, and suspects that it might be faster
> in situation 2a, as arrays are column-based. Is this perception correct?

Uh, no.  Arrays are stored in C-order, which is row-wise.  

What do you mean by C-order? I though that Arrays were homogeneous, and thus had no 'rows', in the same way that a table has.

I did a h5dump of the following tests:
table = h5file.create_table('/', 'table', {'chr': tables.StringCol(5), 'start': tables.Int32Col()})
row = table.row
row['chr'] = 'chr21'; row['start'] = 12345; row.append()
row['chr'] = 'chr21'; row['start'] = 67890; row.append()
array_start = h5file.create_earray('/', 'array_start', atom=tables.Int32Atom(), shape=(0,))
array_start.append(numpy.array([12345, 67890], dtype=numpy.int32))

The dump output this:
...
DATASET "array_start" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 2 ) / ( H5S_UNLIMITED ) }
      DATA {
      (0): 12345, 67890
      }
...
DATASET "table" {
      DATATYPE  H5T_COMPOUND {
         H5T_STRING {
            STRSIZE 5;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         } "chr";
         H5T_STD_I32LE "start";
      }
      DATASPACE  SIMPLE { ( 2 ) / ( H5S_UNLIMITED ) }
      DATA {
      (0): {
            "chr21",
            12345
         },
      (1): {
            "chr21",
            67890
         }
      }
...

To me, the 'array_start' appears to be contiguously stored, while the 'table' is stored row-by-row. Is this wrong?
 
But on
another hand, if your datasets are homogeneous, certainly an Array (or
better, an EArray for allowing the appends like the Table object) could
be a faster way to iterate.  But you should do your own benchmarks.

> And how large speed improvement could we expect?

It is difficult to assess.  In one hand, the Array containers do not
have the overhead of having to deal with heterogeneous data (Table).  On
the other hand, the Table object has an extremely fine-tuned iterator.  
As always, there is no replacement for experimentation.

I will probably experiment more with Arrays to see if they improve upon tables.
 

Also, is your data compressible?  If it is, you can experiment with that
too, specially with the integrated Blosc compressor.

I tried turning on compression, but it didn't improve performance.
 

--
Francesc Alted


--
Henrik

Francesc Alted

unread,
Jun 8, 2014, 5:01:51 AM6/8/14
to pytable...@googlegroups.com
Oh, I read incorrectly, sorry. So, I while it is difficult to
understand what's going on behind the scenes, I would say that the
problem with handing big chunks (scenario 2a) is that HDF5 normally
needs a copy for reading the data from disk or OS memory page cache to
the actual data buffer. On its hand, memmap does not require that, so
that *may* account for the 2x slowdown. For small chunks (2b), the copy
may happen at CPU cache level, so maybe this is why the difference in
speed is less (and I must say that I am a bit surprised that HDF5
performance is better here).

> Probably what is happening here is that the memmap
> approach is just more I/O efficient than HDF5, but having some
> profiles
> would be nice. Do you have some that you can show?
>
> Here are two detailed profiles of the system, one for PyTables and
> another for memmap:
> https://hyperbrowser.uio.no/gtrackcore/u/henrik/h/profiling-memmap-vs-pytables
> To view them: Click the 'eye'-icon in each of the history items on the
> right hand side.

Yeah, after having a look at the profiles I would say that the reading
is the bottleneck, and that how memmap works vs a regular file read is
the explanation for the difference in performance.
Hmm, perhaps we are not speaking the same language. When I say row
order, I mean something like it is shown in slide 24 of
http://www.pytables.org/docs/StarvingCPUs.pdf. In you example above you
are storing just a 1-dim array, so in this case there is not the concept
of 'rows' and 'columns'.

Besides, there is the additonal 'complication' that EArrays are stored
in chunks as shown in slide 14 of
http://www.pytables.org/docs/PUG-Austin-2012-v3.pdf. Typically the
chunks chosen by PyTables are such that complete rows fit in a chunk, so
it is usually better to read your EArray row-wise. In addition, and as
I suggested before, if the chunks fit comfortably in CPU cache (higher
level cache is enough), you don't incur in the additional copy (to
practical effects).

> But on
> another hand, if your datasets are homogeneous, certainly an Array
> (or
> better, an EArray for allowing the appends like the Table object)
> could
> be a faster way to iterate. But you should do your own benchmarks.
>
> > And how large speed improvement could we expect?
>
> It is difficult to assess. In one hand, the Array containers do not
> have the overhead of having to deal with heterogeneous data
> (Table). On
> the other hand, the Table object has an extremely fine-tuned
> iterator.
> As always, there is no replacement for experimentation.
>
>
> I will probably experiment more with Arrays to see if they improve
> upon tables.
>
>
> Also, is your data compressible? If it is, you can experiment
> with that
> too, specially with the integrated Blosc compressor.
>
>
> I tried turning on compression, but it didn't improve performance.

Hmm, which is the compression ratio of your datasets? How large are
they? In theory compression helps with I/O because you are
reading/writing less. In some cases, specially when doing benchmarks,
your files end living in the OS page cache, so they performance is
somewhat skewed. It is usually good practice flushing the OS page cache
in order to reproduce actual disk I/O. For example, on Linux you can do
that by using this: http://tecadmin.net/flush-memory-cache-on-linux-server/

-- Francesc Alted
Reply all
Reply to author
Forward
0 new messages