Perry already addressed some of the political/practical issues. So I
just want to respond a bit more to some of the technical issues you
raised:
On Thu, Jul 18, 2013 at 2:59 PM, Demitri Muna <
demitr...@gmail.com> wrote:
> Hi,
>
> On Jul 18, 2013, at 11:38 AM, Erik Bray <
erik....@gmail.com> wrote:
>
> That sounds a little better! Thanks for this data point...now that I
> think about it this is a dev issue, as it's relevant to issue #520
> [1]. Though you did never tell me if this was Python 2 or Python 3,
> which is also sadly relevant in this case.
>
>
> Sorry, Python 2.7.
That's *really* odd then, because the string-related issue I was
thinking of applies only to Python 3. It has to do with the fact that
on Python 3 Numpy string arrays (that is, with the S dtype) are
treated as arrays of bytes strings. In order to provide a consistent
interface this needs to be converted to an array of unicode strings
(the U dtype). Unfortunately, so far I've found no way to do this
without copying the entire array even if the text is all ASCII. I
don't remember for sure, but I think this is because Numpy's unicode
dtype internally encodes all characters as UTF-16 (allowing arbitrary
encodings like UTF-8 would be difficult for Numpy since it would
involve variable strides, but for encodings with fixed character width
I think Numpy should allow more flexibility *sigh*).
So I will have to do a little more exploring as to what exactly was
slowing down the reading of your table's rows, because they weren't
using BSCALE/BZERO either.
> To explain a little more about what this line means: When loading a
> table in PyFITS it has always used the numpy.rec.recarray format for
> accessing the table as a record array. But as we all know all too
> well FITS has a number of quirks in its data format, and just reading
> the raw data directly off disk (as we are doing with a recarray), we
> have to use some hacks that are implemented in the FITS_rec class,
> which is a subclass of recarray. Some of these hacks (currently)
> require reading an entire column into memory in order to apply a
> transformation to the data in that column. String columns often need
> to be transformed in this manner, especially on Python 3 since the
> need to be transformed from arrays for bytes strings to arrays of
> unicode strings.
>
>
> It sounds to me like this is less of a quirk of the FITS format and more of
> trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously
> trying to read the entire table (here, a 1.8GB table) is going to be slow.
> And we know that astronomers aren't going to stop making gigantic FITS
> files… ever.
As Perry wrote, it's a little of both. For the more common data types
(unscaled floats and integers, strings (unicode issues aside)) the use
of recarray is sufficent and easy to implement. You run into trouble
though when it comes to data that needs to be rescaled, or quirks that
really shouldn't exist in this day and age like zero-width columns.
Those things break some of Numpy's basic assumptions--even worse when
you're trying to read from a file with mmap and avoid using up too
much physical memory.
> Not to say that I don't agree with you that the FITS format isn't awful (I'm
> writing a desktop FITS viewer, so believe me I know).
>
> Thanks for the explanation. I tried to go through the documentation and
> found the "view" numpy method, but I still don't really see how I would have
> come up with that solution (or how to explain it to my students).
I don't know either. I suppose this could be discussed in the
documentation somewhere, but one needs to be aware of the
implications. If you re-view the table with the recarray type then
you lose automatic conversion from the raw values in the FITS file to
their actual logical values. In the case of simpler tables like yours
it shouldn't have been an issue at all, so there might be a bug
somewhere causing unnecessary slowdowns.
> The main thrust of my plans to overhaul the PyFITS (and by extension
> astropy.io.fits) table code is to do away with the use of
> numpy.rec.recarray and treat each column as a separate array wrapped
> in a class that can perform any necessary transformations on the fly
> rather than over the entire column at once (more like how CFITSIO
> works).
>
>
> I don't know how you are reading the FITS file under the hood, but cfitsio
> can't read columns; data is written as rows, so reading a column means
> sequentially reading the whole file. This is grossly inefficient in ciftsio:
>
> for idx, c in enumerate(<fits columns>):
> array[c] = <read column idx>
Sorry, my writing was unclear. What I was saying is that CFITSIO will
convert the column data on the fly as you read over the rows. That
is, by default it actually reads several rows at a time into a buffer,
and if any transformations need to be performed (such as BSCALE) it
also applies them in that buffer. PyFITS' most severe limitation, on
the other hand, is that if a column needs to be transformed it reads
the *entire* column into memory and performs the transformation on it,
even if all you need is one row! This is fine for tables under, say,
100 MB but beyond that it gets pretty bad. This is the main issue I
want to fix but it's a rather involved undertaking since Numpy is not
very good at this sort of thing by itself. Though I do have ideas on
how to address the issue.
> I see different use cases here: iterating by column and iterating by row.
> How you read the file (since it's huge and we're not reading it into memory)
> depends on what you want. I'd suggest that astropy.io.fits provide two
> iterators, a row-based iterator and a column-based iterator. While that's
> not the most elegant thing here, having the user provide the hint on how
> they are iterating over the file can tell us how to optimize the reading. My
> case was *very* simple: read one row at a time, but it "failed".
You can basically already do this in pyfits. But you don't gain
anything because the data in FITS files is stored in rows. A future
FITS replacement format should allow column-based or row-based storage
depending on how the data is more likely to be accessed.
Erik