Re: Producing tile compressed FITS files

185 views
Skip to first unread message

John K. Parejko

unread,
Jun 17, 2016, 7:41:19 PM6/17/16
to William Pence, erik....@gmail.com, astropy-dev
Moving discussion to astropy-dev, as this is definitely a bug in astropy. I'm working on constructing a minimal example now, but it's not as straightforward as I'd hoped.

Basic summary: when you create a pyfits CompImageHDU from an existing data&header, pyfits doesn't always correctly turn it into a binary table with all the necessary keywords. I tried looking into the io.fits module, but the way it wraps cfitsio is complicated, with fun statements like:

# TODO: The difficulty of implementing this screams a need to rewrite this
# module

;-)

However, I'm betting this particular problem is related to compressed.py:950, where TFIELDS doesn't have an "after" keyword meaning it may not land where it is required to land (8th keyword). I can't prove it without that worked example, though.

My obvious choice for a worked example *is* behaving correctly, so it may take some time to pin down exactly what's going on (besides just using the files I linked to earlier). I'll file a ticket when I have proof, and update this thread, too.

All that said, has responsibility for pyfits development been "officially" passed on to anyone yet, or has it just been tossed in the "community" pot (and thus become officially unfunded)?

Cheers,
John

On 17Jun 2016, at 09:44, John Parejko <pare...@uw.edu> wrote:

> Bill Pence gave this reply, but he's not on the astropy list.
>
> By the looks of it, this is an astropy bug when it writes the CompImageHDUs.
>
> John
> On Jun 17, 2016 8:21 AM, "William Pence" <Willia...@nasa.gov> wrote:
> The error message produced by fits_verify when testing the compressed file indicates that the header keywords in the 3rd HDU are not present in the required order. The 8th keyword of the binary table extension (the compressed image) must be TFIELDS, but in this file the 8th keyword is BSCALE. CFITSIO strictly enforces this requirement, so won't read that HDU (or any following HDUs). Here is the error message:
>
> *** Error: Bad HDU? (from CFITSIO error stack:)
> ffgtkn found unexpected keyword or value for keyword no. 8.
> Expected positive integer keyword TFIELDS, but instead
> found keyword BSCALE with value 1
> Failed to move to HDU number 3 (ffmahd).
>
> -Bill
>
>
> On 6/16/2016 11:48 PM, John K. Parejko wrote:
>
> cc:ing William Pence, as this may be an odd interaction between astropy.io.fits and cfitsio.
>
> In an effort to reduce the size of some LSST test data, I converted the three image HDUs in a file to compressed image HDUs like so:
>
> from astropy.io import fits
> data = fits.open('somefile.fits')
> for i in [1,2,3]:
> data[i] = fits.CompImageHDU(data=data[i].data,header=data[i].header)
> data.writeto(f, clobber=True)
>
> If I open the new file with pyfits, all of the HDUs seem fine, but now cfitsio doesn't appear to like the files. Erin Sheldon's fitsio only sees the first 2 HDUs. So does the online FITS tester webpage:
>
> http://fits.gsfc.nasa.gov/fits_verify.html
>
> DS9 seems to read it ok: I can specify "-mecube" and see all three image planes.
>
> Is what I did above not allowed for some reason, or did I do something wrong?
>
> I've put up the original file, the file with 3 HDUs tile-compressed, and (the original goal of this) a file with the 3 image HDUs zeroed out and then compressed (I wanted a file with the structure of the original, but taking minimal space):
>
> http://staff.washington.edu/parejkoj/lsst/
>
> Ideas and suggestions welcome!
>
> Thanks in advance,
> John
>
> --
> *************************
> John Parejko
> pare...@uw.edu
> http://staff.washington.edu/parejkoj/
> Department of Physics and Astronomy
> University of Washington
> Seattle, WA
> **************************
>
>
>
>
>
>
>
>
>
>
>
>
>

--
*************************
John Parejko
pare...@uw.edu
http://staff.washington.edu/parejkoj/
Department of Physics and Astronomy
University of Washington
Seattle, WA
**************************











John K. Parejko

unread,
Jun 19, 2016, 3:30:22 AM6/19/16
to astropy-dev, William Pence, erik....@gmail.com
Some progress on this problem!

The astropy.io bug is due to the interaction between CompImageHDU._update_header_data() and _BaseHDUMeta._update_uint_scale_keywords(). The former was not guaranteed to set TFIELDS in its required place (it must be the 8th keyword in a BinTable, immediately following GCOUNT), while the latter always puts BSCALE immediately after GCOUNT when the file is written (which is fine in ImageHDUs but breaks BinTables, which CompImageHDUs actually are!). I discovered this after setting TFIELDS with "after='GCOUNT'" and finding that BSCALE/BZERO still showed up ahead of it in the header.

I think I've come up with a safe work-around, and I'll submit a pull request with it shortly.

Sadly, I still haven't come up with a working minimal example (besides my already-existing large files), since I don't really understand exactly how BSCALE/BZERO are managed. Hopefully my above description can help someone who understands FITS headers better to craft an appropriate integration test for an ImageHDU->CompImageHDU conversion.

This whole process has underscored how unnecessarily complicated the compressed image FITS convention is. None of this would be necessary if FITS actually supported a compressed HDU type natively. I'll just add it to my growing list of FITS problems.

Cheers,
John

Matt Craig

unread,
Jun 19, 2016, 5:36:35 PM6/19/16
to astropy-dev, William Pence, erik....@gmail.com
Thanks for submitting this, John! I'm hoping to do some work on refactoring BZERO/BSCALE stuff (at least as it relates to unsigned integer images) later this week.

Matt


--
You received this message because you are subscribed to the Google Groups "astropy-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astropy-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John K. Parejko

unread,
Jun 19, 2016, 10:57:09 PM6/19/16
to astro...@googlegroups.com, William Pence, erik....@gmail.com
I did make a pull request:

https://github.com/astropy/astropy/pull/5118

which I'd like to get merged before this release (then I can send around instructions for LSST to compress some of our testing data).

Can we get it merged before the release, or do you think, Matt, that the refactoring you were going to do would also go in before the release?

Thanks,
John

Matt Craig

unread,
Jun 19, 2016, 11:17:05 PM6/19/16
to astro...@googlegroups.com, William Pence, erik....@gmail.com
Hi,

The 1.2 release is already frozen and almost out the door, but I imagine 1.2.1 will come pretty quickly after that.

Thanks,
Matt Craig

Erik Bray

unread,
Jun 20, 2016, 3:39:00 AM6/20/16
to John K. Parejko, William Pence, astropy-dev

On Jun 19, 2016 9:30 AM, "John K. Parejko" <pare...@uw.edu> wrote:
>
> Some progress on this problem!
>
> The astropy.io bug is due to the interaction between CompImageHDU._update_header_data() and _BaseHDUMeta._update_uint_scale_keywords(). The former was not guaranteed to set TFIELDS in its required place (it must be the 8th keyword in a BinTable, immediately following GCOUNT), while the latter always puts BSCALE immediately after GCOUNT when the file is written (which is fine in ImageHDUs but breaks BinTables, which CompImageHDUs actually are!). I discovered this after setting TFIELDS with "after='GCOUNT'" and finding that BSCALE/BZERO still showed up ahead of it in the header.
>
> I think I've come up with a safe work-around, and I'll submit a pull request with it shortly.
>
> Sadly, I still haven't come up with a working minimal example (besides my already-existing large files), since I don't really understand exactly how BSCALE/BZERO are managed. Hopefully my above description can help someone who understands FITS headers better to craft an appropriate integration test for an ImageHDU->CompImageHDU conversion.
>
> This whole process has underscored how unnecessarily complicated the compressed image FITS convention is. None of this would be necessary if FITS actually supported a compressed HDU type natively. I'll just add it to my growing list of FITS problems.

FYI if you search the Astropy issue tracker under the io.fits label there are a couple issues I've written pertaining to plans for reworking the compressed image support.

There are two main reasons (both documented, I think) that the implementation is so complicated: the first is that it wraps CFITSIO for compression support. This is the *only* area where CFITSIO is used, and it's being wrapped in a way that was never really intended, as PYFITS was never design around CFITSIO in the first place. There's no reason it needs to use CFITSIO though. The compression algorithms can be taken on their own, and the rest of the relevant bits rewritten in some combination of Python and Cython. It's not particularly hard to implement, it's just never been a priority.

The second thing that makes things complicated is a design mistake in PyFITS which originally made the CompImageHDU class a subclass of BinTableHDU. On its face that makes sense since the compressed image convention is implemented in a binary table. However, the class takes great pains to make it look to the user transparently like a normal image HDU. The class really should have been based on ImageHDU, and instead contain a wrapped instance of a BinTableHDU for interaction with the low-level format. That would probably make the code a fair bit simpler.

As FITS conventions go I don't think the tile compression convention is *too* terribly complicated. The biggest offender here is once again unsigned integers, and the general badness of FITS design.

Demitri Muna

unread,
Jun 20, 2016, 9:47:39 AM6/20/16
to astro...@googlegroups.com
Hi Erik,

Thanks for your work on this.

On Jun 20, 2016, at 8:38 AM, Erik Bray <erik....@gmail.com> wrote:

> There are two main reasons (both documented, I think) that the implementation is so complicated: the first is that it wraps CFITSIO for compression support. This is the *only* area where CFITSIO is used, and it's being wrapped in a way that was never really intended, as PYFITS was never design around CFITSIO in the first place. There's no reason it needs to use CFITSIO though. The compression algorithms can be taken on their own, and the rest of the relevant bits rewritten in some combination of Python and Cython. It's not particularly hard to implement, it's just never been a priority.

I'm going to play devil's advocate - why *not* wrap cfitsio? It will be faster than a pure Python implementation, easier than coding it in Cython, and we know it will work. What is the benefit to not have cfitsio as a dependency? Astropy has many examples of C code underneath.

Cheers,
Demitri

_________________________________________
Demitri Muna
http://muna.com

Department of Astronomy
Il Ohio State University

My Projects:
http://nightlightapp.io
http://trillianverse.org
http://scicoder.org


Erik Bray

unread,
Jun 21, 2016, 10:29:19 AM6/21/16
to astropy-dev
On Mon, Jun 20, 2016 at 3:47 PM, Demitri Muna <demitr...@gmail.com> wrote:
> Hi Erik,
>
> Thanks for your work on this.
>
> On Jun 20, 2016, at 8:38 AM, Erik Bray <erik....@gmail.com> wrote:
>
>> There are two main reasons (both documented, I think) that the implementation is so complicated: the first is that it wraps CFITSIO for compression support. This is the *only* area where CFITSIO is used, and it's being wrapped in a way that was never really intended, as PYFITS was never design around CFITSIO in the first place. There's no reason it needs to use CFITSIO though. The compression algorithms can be taken on their own, and the rest of the relevant bits rewritten in some combination of Python and Cython. It's not particularly hard to implement, it's just never been a priority.
>
> I'm going to play devil's advocate - why *not* wrap cfitsio? It will be faster than a pure Python implementation, easier than coding it in Cython, and we know it will work. What is the benefit to not have cfitsio as a dependency? Astropy has many examples of C code underneath.

As I wrote, the way CFITSIO is designed doesn't mesh well with the
rest of PyFITS. Just to give one example, it manages the entire FITS
file on its own, including the header, whereas PyFITS manages the
header separately. It's not easy to wrap CFITSIO in such a way that
it can interact with raw data in memory. That's partly why the
existing interface to CFITSIO is so complicated--it really just needs
to be able to compress / decompress a data array using the FITS tile
compression algorithms, without the assumption that those arrays are
wrapped in a FITS file. In other words, CFITSIO does not do enough to
abstract the internal data structure away from FITS.

Other than the compression algorithms themselves, which are written in
C and easy enough to wrap, the main functionality extracted from
CFITSIO is the actual code for dividing an array up into tiles of a
given tile size (determined from the header). This is fairly easy to
rewrite--in fact easier to write in Python or Cython than it is in C.

At the end of the day it would be less of a maintenance hassle.
CFITSIO is also a moving target without well defined versioning
semantics. Its ABI can change from version to version. That said,
for most of the time I maintained it, maintaining that interface was
still less hassle than rewriting it, though I would have preferred to
rewrite it if I had the time.

Best,
Erik

Thomas Robitaille

unread,
Jun 24, 2016, 1:28:59 PM6/24/16
to astropy-dev mailing list, William Pence, Erik Bray, stephe...@lbl.gov
Hi all,

We have included John's fix for the bug along with a regression test
in Astropy 1.2.1 (https://github.com/astropy/astropy/pull/5125).

Stephen: could you confirm whether this also fixes the issues you were
seeing with DESI?

Thanks,
Tom

Demitri Muna

unread,
Jun 28, 2016, 5:22:47 AM6/28/16
to astro...@googlegroups.com, Erik Bray
Hi Erik,

Comments below.

On Jun 21, 2016, at 3:29 PM, Erik Bray <erik....@gmail.com> wrote:

As I wrote, the way CFITSIO is designed doesn't mesh well with the
rest of PyFITS.  Just to give one example, it manages the entire FITS
file on its own, including the header, whereas PyFITS manages the
header separately.  It's not easy to wrap CFITSIO in such a way that
it can interact with raw data in memory.

I don't understand this comment. I am of course using CFITSIO under the hood for Nightlight, but I've wrapped all of the functionality around Objective-C classes. I definitely don't expose anything specific to that library (data structures, etc.) to my higher-level code. I don't see why the same thing can't be done in Python. In fact Erin Sheldon more or less does this with this fitsio library. This will allow much more fine grained control by both the library and the user - see the recent thread on the difficulty on opening extremely large files.

"Managing the header" should mean reading it from the file (using Python or CFITSIO, shouldn't matter), keeping and manipulating it in memory, then writing it back to disk when needed by the wrapper routine. One chooses the details of that independently of everything else.

That's partly why the
existing interface to CFITSIO is so complicated--it really just needs
to be able to compress / decompress a data array using the FITS tile
compression algorithms, without the assumption that those arrays are
wrapped in a FITS file.  In other words, CFITSIO does not do enough to
abstract the internal data structure away from FITS.

I agree, there is serious need for the library to be factored into a library for just this sort of use case. I think a good way forward would be for us to advocate to those responsible to free up some funding for this to benefit the whole community. That will require a little organization - I don't even know who to start this conversation with.

At the end of the day it would be less of a maintenance hassle.
CFITSIO is also a moving target without well defined versioning
semantics.  Its ABI can change from version to version.

Presumably you compile CFITSIO each time there is a new version.

That said,
for most of the time I maintained it, maintaining that interface was
still less hassle than rewriting it, though I would have preferred to
rewrite it if I had the time.

That's one of my main points - the interface (however not ideal) is far, far easier to maintain. The library is tested and does work, and creating your own implementation means you have to repeat that work. Your job (nor mine) is not to rewrite/refactor/whatever CFITSIO, but it is someone's, and we should start the conversation to get them to do that. (I want to specifically emphasize that I'm not referring to Bill Pence here who has put a tremendous effort into the library. He has been and continues to be extremely helpful in supporting the library. I am referring to the lack of available funding to modernize the library.)

Cheers,
Demitri



Reply all
Reply to author
Forward
0 new messages