padding field data (or not)

17 views
Skip to first unread message

Ken Walker

unread,
Oct 29, 2018, 8:28:59 PM10/29/18
to pytables-users
This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read some data rows from one HDF5 file/dataset into a numpy record array, then write that array to a dataset in a different HDF5 file (no change to the data). The data in the new file looks fine when interrogated with Pytables or viewed with HDFView. However, a downstream C++ app can't read the Pytables data.
I am told (by the developers) that the compiler for the upstream program is set to pad the data when it writes the original file (that I am reading), and the pad is expected by the downstream reader (that reads the file I created). Padding adds 4 pad characters to the a 4 byte S4 field so the next field starts at an 8 byte memory boundary. Based on observed behavior, they have inferred that Pytables removes the pad characters when reading the dataset, and does not add a pad when writing the new dataset. (all perfectly legal in hdf5 and does not affect data integrity). However the missing pad is expected by the downstream reader, and causes an error (I know, bad code design).

So....I'm wondering...is there something in Pytables that controls padding when reading/writing datasets like this?

FYI, I recreated this read/write process with h5py, and the output file is compatible with my downstream app. Apparently h5py retains the padded characters. This is confirmed when I write the dataset.dtype: h5py reports itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite  my utility with h5py...but hope to avoid (if possible) because I leverage a lot of  pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken

Francesc Alted

unread,
Oct 30, 2018, 4:10:32 AM10/30/18
to Ken Walker, pytable...@googlegroups.com
Yes, I remember that PyTables do not implement padding (although I do not remember the reason, but probably just a matter of simplicity).  Having said this, introducing padding should not be that difficult, so pull requests are welcome.

Francesc

--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-user...@googlegroups.com.
To post to this group, send email to pytable...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Francesc Alted

Ken Walker

unread,
Oct 30, 2018, 11:11:56 AM10/30/18
to pytables-users
Hi Francesc,
I thought you might say that. :-) Thanks for confirming. At least I'm not assuming anymore. 
I have never submitted a pull request. Do I need to join github to do that?

I can use h5py...wish it had some of pytables handy functions: walk_nodes(), read_where(), etc. :-)
I recall an announcement that h5py and pytables had plans to work together. 
What's the status on that effort?
-Ken

Francesc Alted

unread,
Nov 1, 2018, 4:57:24 AM11/1/18
to Ken Walker, pytable...@googlegroups.com
On Tue, Oct 30, 2018 at 4:11 PM Ken Walker <ken.c....@gmail.com> wrote:
Hi Francesc,
I thought you might say that. :-) Thanks for confirming. At least I'm not assuming anymore. 
I have never submitted a pull request. Do I need to join github to do that?

Yes, joining github is the simplest approach.
 

I can use h5py...wish it had some of pytables handy functions: walk_nodes(), read_where(), etc. :-)
I recall an announcement that h5py and pytables had plans to work together. 
What's the status on that effort?

Well, there have been two major attempts to make PyTables working on top of h5py.  The first one took place during a hackfest in Perth back in 2016 (thanks to Curtin University funds and most specially to Andrea Bedini enthusiasm), where different maintainers gathered there for starting the porting process.  We did quite a bit of progress, but still, there was a long way to go; you can read the final report here: https://github.com/PyTables/PyTables/blob/pt4/doc/New-Backend-Interface.rst.  The other important push happened past year (2017), by using a small grant from NumFOCUS.  Alberto Sabater, the receiver of the grant did also a lot of progress on top of the existing 2016 work, specially on the *Array (EArray, CArray, VLArray) front; you can find his contribution in this pull request: https://github.com/PyTables/PyTables/pull/634.

My perception from both attempts makes me think that the amount of job remaining for completing the port is still very significant, and that small grants (like NumFOCUS ones, which are 3000 USD max) are not really suitable for getting the job done.  So for this year's small grant from NumFOCUS I suggested to Javier Sancho (the receiver of the grant) to concentrate on fixing bugs and applying pending pull requests and doing a new release of PyTables, and with the remaining time, to implement a web interface for visualizing Table objects remotely; you can see the outcome of this effort here: https://github.com/PyTables/datasette-pytables.  I have to say that I am really happy about the outcome of this latest grant.

From all of this experience and frankly speaking, I am unsure about the feasibility of the PyTables/h5py merge because we would require quite more than a small grant for this, and I am not sure the users/foundations would never really pay for this cost.  So, what I'd like to do instead is to continue applying for small NumFOCUS grants in order to do maintenance works for PyTables, and perhaps some small improvements; for example, I recently applied for a NumFOCUS grant to extend the support of advanced indexing and sorting to general compound datatypes in generic HDF5 files (rings a bell to you?).  I do think this approach would result in a better use of the (scarce) resources that we currently have for PyTables maintenance, and the the users will benefit the most from it (but in case we get bigger funds, the PyTables/h5py merge would still be an option, of course).

Hope this clarifies the current status of PyTables a bit more.

Francesc


--
Francesc Alted

Ken Walker

unread,
Nov 12, 2018, 3:38:07 PM11/12/18
to pytables-users
Hi Francesc,
Thanks for the update on Pytables/h5py merge. As we would say in the US: "Don't hold your breath". 
I will continue to use h5py when I have to deal with padded data fields (or until my downstream application knows doesn't expect packed data -- maybe in 2019?).
Thanks!
-Ken

Francesc Alted

unread,
Feb 20, 2019, 8:52:27 AM2/20/19
to Ken Walker, pytables-users
Hi,

So I thought this issue would deserve some love from us, so after asking (and getting the approval) for a small grant from NumFOCUS, I tackled this in:


Ken (or others), it would be great if you can have this a go and tell if this solves your issue.

Best

Missatge de Ken Walker <ken.c....@gmail.com> del dia dt., 30 d’oct. 2018 a les 16:11:


--
Francesc Alted

Ken Walker

unread,
Mar 25, 2019, 7:56:44 PM3/25/19
to pytables-users
Hi Francesc,

Apologies for the delayed response. I don't check gmail that often.
Thanks for working on this. 
I would love to test (and have some candidate test problems), but need guidance to work with a github pull.
I've never done it before -- I always install Python modules with the conda package manager.
If you can give me some direction I'll give it a try.

-Ken

Francesc Alted

unread,
Mar 27, 2019, 4:58:04 AM3/27/19
to Ken Walker, pytables-users
Hi Ken,

No need to use git pull; the modifications have been included in PyTables 3.5.1.  Give it a spin.

Francesc

Missatge de Ken Walker <ken.c....@gmail.com> del dia dt., 26 de març 2019 a les 0:56:


--
Francesc Alted
Reply all
Reply to author
Forward
0 new messages