Fastest way to read from large FITS file?

1,986 views
Skip to first unread message

Demitri Muna

unread,
Jul 17, 2013, 5:34:25 PM7/17/13
to astro...@googlegroups.com
Hi,

I'm writing a script to convert a 1.9GB FITS file into an SQLite file. The file is a single table of ~1.8M rows. The fastest I've been able to import into the database is about 82 records/sec, which strikes me as pretty low. I've squeezed everything I can out of the database side (written with SQLAlchemy and raw SQL INSERT statements, which if done right are really about the same). I'm pasting my code below, but I've left out the database stuff so it won't actually run.

The datafile is here if you want to play with it:

http://data.sdss3.org/sas/dr9/sdss/sspp/ssppOut-dr9.fits

Is there a faster way to read the data? Reading the whole file into memory is a non-starter here (even if I do have it).

Cheers,
Demitri

_________________________________________
Demitri Muna

Department of Astronomy
Ohio State University

http://scicoder.org/

---

from astropy.io import fits

sspp_hdu_list = fits.open(filename, mmap=True)
field_names = [x.lower() for x in sspp_hdu_list[1].columns.names]
table = sspp_hdu_list[1].data # for convenience

rows = []
for row in table[0:1000+1]:
id += 1
row_dict = dict(zip(field_names,row))
row_dict["id"] = id

rows.append(row_dict)

if id % 500 == 0:
print("Importing row {0}...".format(id))
engine.execute(
SSPP.__table__.insert(),
rows
)
rows = []

if len(rows) > 0:
engine.execute(
SSPP.__table__.insert(),
rows
)


Wolfgang Kerzendorf

unread,
Jul 17, 2013, 5:48:50 PM7/17/13
to astro...@googlegroups.com
The I/O of sqlite is pretty slow (it seems especially the creation of tuples) - I suggest dumping things into HDF5 and then using PANDAS for the querying if speed is an issue.

Cheers,
Wolfgang
> --
> You received this message because you are subscribed to the Google Groups "astropy-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to astropy-dev...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Demitri Muna

unread,
Jul 17, 2013, 6:05:01 PM7/17/13
to astro...@googlegroups.com

On Jul 17, 2013, at 5:48 PM, Wolfgang Kerzendorf <wkerz...@gmail.com> wrote:

The I/O of sqlite is pretty slow (it seems especially the creation of tuples) - I suggest dumping things into HDF5 and then using PANDAS for the querying if speed is an issue. 

I'm very happy with SQLite, but thanks for the suggestions. The problem I'm having is with reading the FITS file.

Wolfgang Kerzendorf

unread,
Jul 17, 2013, 6:15:24 PM7/17/13
to astro...@googlegroups.com
sorry, I should have made this clearer: I doubt that the problem is on the fits reading end - but suspect it's on the importing into the sqlite database. A way out is other storage formats (however if you need to do large joins/queries on disk then sqlite is probably your best shot). 

Cheers
 W

Erik Bray

unread,
Jul 17, 2013, 6:34:33 PM7/17/13
to astro...@googlegroups.com
I think probably the astropy mailinglist (not astropy-dev) would be a
better place for this discussion. But is this Python 2 or Python 3?

Erik Bray

unread,
Jul 17, 2013, 6:40:51 PM7/17/13
to astro...@googlegroups.com
In any case--this is a bit of a long shot (the file is still
downloading for me) but first try doing something like

table = table.view(np.recarray)

in order to bypass some of the FITS-specific absurdity.

Demitri Muna

unread,
Jul 17, 2013, 7:15:49 PM7/17/13
to astro...@googlegroups.com
Hi,

On 17 Jul 2013, at 6:40 PM, Erik Bray <erik....@gmail.com> wrote:

In any case--this is a bit of a long shot (the file is still
downloading for me) but first try doing something like

table = table.view(np.recarray)

in order to bypass some of the FITS-specific absurdity.

Well, adding that one line jumped the rate from 82 records/sec to 1440 records/sec, dropping my import from 6 hours to 21 minutes. Thanks!

So... what does that line mean?


On 17 Jul 2013, at 6:15 PM, Wolfgang Kerzendorf <wkerz...@gmail.com> wrote:

sorry, I should have made this clearer: I doubt that the problem is on the fits reading end - but suspect it's on the importing into the sqlite database.

I timed the actual import itself (excluding the FITS processing) and was getting ~2000 records/sec. SQLite when run properly is actually pretty fast.

John K. Parejko

unread,
Jul 17, 2013, 7:19:50 PM7/17/13
to astro...@googlegroups.com
(agree that astropy is better than astropy-dev for this question).

Another possibility is to try Erin's fitsio.py:

https://github.com/esheldon/fitsio

It's very fast, and you could read slices at a time, and even try buffering them.

John

Erik Bray

unread,
Jul 18, 2013, 11:38:21 AM7/18/13
to astro...@googlegroups.com
On Wed, Jul 17, 2013 at 7:15 PM, Demitri Muna <demitr...@gmail.com> wrote:
> Hi,
>
> On 17 Jul 2013, at 6:40 PM, Erik Bray <erik....@gmail.com> wrote:
>
> In any case--this is a bit of a long shot (the file is still
> downloading for me) but first try doing something like
>
> table = table.view(np.recarray)
>
> in order to bypass some of the FITS-specific absurdity.
>
>
> Well, adding that one line jumped the rate from 82 records/sec to 1440
> records/sec, dropping my import from 6 hours to 21 minutes. Thanks!
>
> So... what does that line mean?

That sounds a little better! Thanks for this data point...now that I
think about it this is a dev issue, as it's relevant to issue #520
[1]. Though you did never tell me if this was Python 2 or Python 3,
which is also sadly relevant in this case.

To explain a little more about what this line means: When loading a
table in PyFITS it has always used the numpy.rec.recarray format for
accessing the table as a record array. But as we all know all too
well FITS has a number of quirks in its data format, and just reading
the raw data directly off disk (as we are doing with a recarray), we
have to use some hacks that are implemented in the FITS_rec class,
which is a subclass of recarray. Some of these hacks (currently)
require reading an entire column into memory in order to apply a
transformation to the data in that column. String columns often need
to be transformed in this manner, especially on Python 3 since the
need to be transformed from arrays for bytes strings to arrays of
unicode strings.

The main thrust of my plans to overhaul the PyFITS (and by extension
astropy.io.fits) table code is to do away with the use of
numpy.rec.recarray and treat each column as a separate array wrapped
in a class that can perform any necessary transformations on the fly
rather than over the entire column at once (more like how CFITSIO
works). Unfortunately the time available to me is limited and I've
been asked for focus on other work, which is fine because we want to
replace FITS anyways...

In light of this I think I'll hold off on merging #520 until I can do
more exploration of the performance impact of string columns, and
maybe find a better way to handle them.


On Wed, Jul 17, 2013 at 7:19 PM, John K. Parejko <pare...@gmail.com> wrote:
> (agree that astropy is better than astropy-dev for this question).
>
> Another possibility is to try Erin's fitsio.py:
>
> https://github.com/esheldon/fitsio
>
> It's very fast, and you could read slices at a time, and even try buffering them.

+1 to fitsio for applications where performance is critical. It's a
Python wrapper for CFITSIO, so while it's lower level and not as
advanced at dealing with things like headers, but if raw FITS I/O
performance is the primary concern it's very good.

Erik



[1] https://github.com/astropy/astropy/pull/520

Demitri Muna

unread,
Jul 18, 2013, 2:59:29 PM7/18/13
to astro...@googlegroups.com
Hi,

On Jul 18, 2013, at 11:38 AM, Erik Bray <erik....@gmail.com> wrote:

That sounds a little better!  Thanks for this data point...now that I
think about it this is a dev issue, as it's relevant to issue #520
[1].  Though you did never tell me if this was Python 2 or Python 3,
which is also sadly relevant in this case.

Sorry, Python 2.7.

To explain a little more about what this line means: When loading a
table in PyFITS it has always used the numpy.rec.recarray format for
accessing the table as a record array.  But as we all know all too
well FITS has a number of quirks in its data format, and just reading
the raw data directly off disk (as we are doing with a recarray), we
have to use some hacks that are implemented in the FITS_rec class,
which is a subclass of recarray.  Some of these hacks (currently)
require reading an entire column into memory in order to apply a
transformation to the data in that column.  String columns often need
to be transformed in this manner, especially on Python 3 since the
need to be transformed from arrays for bytes strings to arrays of
unicode strings.

It sounds to me like this is less of a quirk of the FITS format and more of trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously trying to read the entire table (here, a 1.8GB table) is going to be slow. And we know that astronomers aren't going to stop making gigantic FITS files… ever.

Not to say that I don't agree with you that the FITS format isn't awful (I'm writing a desktop FITS viewer, so believe me I know).

Thanks for the explanation. I tried to go through the documentation and found the "view" numpy method, but I still don't really see how I would have come up with that solution (or how to explain it to my students).

The main thrust of my plans to overhaul the PyFITS (and by extension
astropy.io.fits) table code is to do away with the use of
numpy.rec.recarray and treat each column as a separate array wrapped
in a class that can perform any necessary transformations on the fly
rather than over the entire column at once (more like how CFITSIO
works).

I don't know how you are reading the FITS file under the hood, but cfitsio can't read columns; data is written as rows, so reading a column means sequentially reading the whole file. This is grossly inefficient in ciftsio:

for idx, c in enumerate(<fits columns>):
    array[c] = <read column idx>

I see different use cases here: iterating by column and iterating by row. How you read the file (since it's huge and we're not reading it into memory) depends on what you want. I'd suggest that astropy.io.fits provide two iterators, a row-based iterator and a column-based iterator. While that's not the most elegant thing here, having the user provide the hint on how they are iterating over the file can tell us how to optimize the reading. My case was *very* simple: read one row at a time, but it "failed".

Unfortunately the time available to me is limited and I've
been asked for focus on other work, which is fine because we want to
replace FITS anyways…

The format?! It's not going away for a long time, so we must support it. I maintain that while FITS is not perfect under the hood, the creation of good tools makes it usable. We can't exactly say to end users, "Oh, a FITS file? You should use another format." :)

And I still think we (as a community) should "fix" FITS and propose changes to the standard.

In light of this I think I'll hold off on merging #520 until I can do
more exploration of the performance impact of string columns, and
maybe find a better way to handle them.

I think this is good, but again, my use case was not column based at all.

On Wed, Jul 17, 2013 at 7:19 PM, John K. Parejko <pare...@gmail.com> wrote:
(agree that astropy is better than astropy-dev for this question).

Another possibility is to try Erin's fitsio.py:

https://github.com/esheldon/fitsio

It's very fast, and you could read slices at a time, and even try buffering them.

+1 to fitsio for applications where performance is critical.  It's a
Python wrapper for CFITSIO, so while it's lower level and not as
advanced at dealing with things like headers, but if raw FITS I/O
performance is the primary concern it's very good.

This is good to know about (and I did). However, the program I was writing was a solution to an exercise I gave last week as SciCoder, and I don't think it's ideal to tell my students (or the average user) that if they want to iterate row by row on a ~2GB file that they shouldn't use astropyfits. (I'm going to start calling it that since it's no longer PyFITS and I can't just call it "fits".)

Perry Greenfield

unread,
Jul 18, 2013, 3:46:03 PM7/18/13
to astro...@googlegroups.com

On Jul 18, 2013, at 2:59 PM, Demitri Muna wrote:
>
> It sounds to me like this is less of a quirk of the FITS format and more of trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously trying to read the entire table (here, a 1.8GB table) is going to be slow. And we know that astronomers aren't going to stop making gigantic FITS files… ever.
>
It all depends on your viewpoint. If you don't use a numpy recarray, then you introduce new problems, complexities, or performance problems. It's a matter of tradeoffs. There are ways of using FITS tables that avoid these problems. Basically the most efficient thing to do is avoid use of these special cases. It is possible (IIRC) to access the raw recarray and deal with the issues directly. Don't use BSCALE, BZERO for columns (there's rarely a good reason to these days). Leave booleans in the character format and test for them directly. Test for special invalid values directly. Do this, and you avoid a lot of the problems.

>
>
>> Unfortunately the time available to me is limited and I've
>> been asked for focus on other work, which is fine because we want to
>> replace FITS anyways…
>
> The format?! It's not going away for a long time, so we must support it. I maintain that while FITS is not perfect under the hood, the creation of good tools makes it usable. We can't exactly say to end users, "Oh, a FITS file? You should use another format." :)
>
I'm the one asking Erik to reduce the time spent supporting FITS. We are looking for a different format to solve some of the problems. We expect FITS to be around a long time too, but I think the effort is more profitably spent exploring better alternatives than sinking huge efforts into a very inadequate format. We aren't going to tell people they can't use FITS, of course. We'll be using it too for data files (we are required to). But for some problems, it just isn't sufficient.

> And I still think we (as a community) should "fix" FITS and propose changes to the standard.
>
Good luck. I (and many others) have had little success in doing that. You are welcome to try. But FITS needs a lot more than small tweaks to fix it. After stagnating for so many years, it's my humble opinion that a completely different approach is needed (I think they missed the opportunity to evolve).

Perry



Demitri Muna

unread,
Jul 18, 2013, 4:28:59 PM7/18/13
to astro...@googlegroups.com
Hi Perry,

On Jul 18, 2013, at 3:46 PM, Perry Greenfield <stsci...@gmail.com> wrote:

It sounds to me like this is less of a quirk of the FITS format and more of trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously trying to read the entire table (here, a 1.8GB table) is going to be slow. And we know that astronomers aren't going to stop making gigantic FITS files… ever.

It all depends on your viewpoint. If you don't use a numpy recarray, then you introduce new problems, complexities, or performance problems. It's a matter of tradeoffs. There are ways of using FITS tables that avoid these problems. Basically the most efficient thing to do is avoid use of these special cases. It is possible (IIRC) to access the raw recarray and deal with the issues directly. Don't use BSCALE, BZERO for columns (there's rarely a good reason to these days). Leave booleans in the character format and test for them directly. Test for special invalid values directly. Do this, and you avoid a lot of the problems.

I think our job in the astro-dev community is to shield the poor graduate student astronomer who has to deal with these issues as much as possible. This is unfortunately the equivalent of writing a proper Javascript library so that individual developers don't have to waste their lives writing ("if IE do this if Safari do that… etc."). If you want to do something complex then yes, peek under the hood. But simple things should be simple, and iterating row by row of a 2GB FITS file is an extremely simple/common request. As I mentioned, I'm OK saying "use this python iterator method if you are reading row by row, this one by columns, etc.". I'm less OK saying "learn the intricacies of recarrays", because frankly, people will just stick with IDL, and that's the bigger beastie than FITS.

Unfortunately the time available to me is limited and I've
been asked for focus on other work, which is fine because we want to
replace FITS anyways…

The format?! It's not going away for a long time, so we must support it. I maintain that while FITS is not perfect under the hood, the creation of good tools makes it usable. We can't exactly say to end users, "Oh, a FITS file? You should use another format." :)

I'm the one asking Erik to reduce the time spent supporting FITS. We are looking for a different format to solve some of the problems. We expect FITS to be around a long time too, but I think the effort is more profitably spent exploring better alternatives than sinking huge efforts into a very inadequate format. We aren't going to tell people they can't use FITS, of course. We'll be using it too for data files  (we are required to). But for some problems, it just isn't sufficient.

I know, we discussed this last year at the Astropy meeting. I don't think the aims are mutually exclusive - we should encourage people to use a better format upstream, but the time saved across the community by making better (Python) tools in the meantime is a *huge* multiplier across the astronomy community.

Yes, for some problems, the effort isn't worth it. But let's get the simple/common use cases well sorted out. Saying "well, performance is bad with BSCALE/BZERO columns because of the nature of the format" is reasonable to say; "well, there are strings in the file" is not so much.

Is there a clear horse in the lead in your opinion as a replacement? If so, are the tools to create/use the format *top notch*? Are there IDL equivalents? I think all these things need to be in place before we make recommendations (*especially* the IDL tools so that the format can be integrated into existing pipelines/workflows). The people on this list are the early adopters, coders, and influencers. The people I teach in my SciCoder workshop are the same. Give something that we all can proselytize, and I'm sure we will. But the tools have to be *complete* first.

I claim that's one reason that VO as a format isn't used all that much. The specs are over-engineered, but much more importantly, the tools to create/use the formats are pretty bad (apologies if there are those on this list that use/write them). The tools to read/create FITS files are much better/well known, so there's no impetus to move. (And they're not great.)

All I ask is that we don't make things more painful for the trenches that have no choice in the matter than they have to be.

And I still think we (as a community) should "fix" FITS and propose changes to the standard.

Good luck. I (and many others) have had little success in doing that. You are welcome to try. But FITS needs a lot more than small tweaks to fix it. After stagnating for so many years, it's my humble opinion that a completely different approach is needed (I think they missed the opportunity to evolve).

I once asked the cfitstio author (whose name is easily looked up) about the reasoning behind the default internal buffer size chosen in a particular low-level read function. The size looked like it was appropriate for computers two decades ago (or more), and I wanted to see what might be a better choice on modern computers, particularly with SSD drives. I was getting significant improvement by increasing the value, but wanted to see how it might be more intelligently chosen given the internal algorithm. The answer was "it depends on the data", which is hard to turn into an algorithm. He also admitted as well that he has limited time to work on cfitsio.

So yes, FITS isn't going to get better. I'd be open to creating/proposing a FITS2 format independent of FITS, or something better, but we as a dev community need something to hang our hat on first. FITS has been great for a very long time and seasoned astronomers will be loathe to give it up, so the alternative has to be amazing and well supported. As Mike Blanton once said to me, "I can open an ASCII file from the 1970s, and I can open a FITS file from the 1970s." This is what we are competing with.

Perry Greenfield

unread,
Jul 18, 2013, 5:09:13 PM7/18/13
to astro...@googlegroups.com

On Jul 18, 2013, at 4:28 PM, Demitri Muna wrote:

> Hi Perry,
>
> On Jul 18, 2013, at 3:46 PM, Perry Greenfield <stsci...@gmail.com> wrote:
>
>>> It sounds to me like this is less of a quirk of the FITS format and more of trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously trying to read the entire table (here, a 1.8GB table) is going to be slow. And we know that astronomers aren't going to stop making gigantic FITS files… ever.
>>>
>> It all depends on your viewpoint. If you don't use a numpy recarray, then you introduce new problems, complexities, or performance problems. It's a matter of tradeoffs. There are ways of using FITS tables that avoid these problems. Basically the most efficient thing to do is avoid use of these special cases. It is possible (IIRC) to access the raw recarray and deal with the issues directly. Don't use BSCALE, BZERO for columns (there's rarely a good reason to these days). Leave booleans in the character format and test for them directly. Test for special invalid values directly. Do this, and you avoid a lot of the problems.
>
> I think our job in the astro-dev community is to shield the poor graduate student astronomer who has to deal with these issues as much as possible. This is unfortunately the equivalent of writing a proper Javascript library so that individual developers don't have to waste their lives writing ("if IE do this if Safari do that… etc."). If you want to do something complex then yes, peek under the hood. But simple things should be simple, and iterating row by row of a 2GB FITS file is an extremely simple/common request. As I mentioned, I'm OK saying "use this python iterator method if you are reading row by row, this one by columns, etc.". I'm less OK saying "learn the intricacies of recarrays", because frankly, people will just stick with IDL, and that's the bigger beastie than FITS.
>
I understand. But we have limited resources, and right now, we can't do both (fix every FITS issue, and work on a usable alternative that we need very soon for other purposes). If someone else would like to tackle this without making pyfits too hard to maintain, we can consider that (but it's likely a significant job). As far as I can tell, there isn't any perfect FITS library; each has it's limitations, even the IDL one (there are a number of advantages that pyfits has over the IDL libraries).


>>>> Unfortunately the time available to me is limited and I've
>>>> been asked for focus on other work, which is fine because we want to
>>>> replace FITS anyways…
>>>
>>> The format?! It's not going away for a long time, so we must support it. I maintain that while FITS is not perfect under the hood, the creation of good tools makes it usable. We can't exactly say to end users, "Oh, a FITS file? You should use another format." :)
>>>
>> I'm the one asking Erik to reduce the time spent supporting FITS. We are looking for a different format to solve some of the problems. We expect FITS to be around a long time too, but I think the effort is more profitably spent exploring better alternatives than sinking huge efforts into a very inadequate format. We aren't going to tell people they can't use FITS, of course. We'll be using it too for data files (we are required to). But for some problems, it just isn't sufficient.
>
> I know, we discussed this last year at the Astropy meeting. I don't think the aims are mutually exclusive - we should encourage people to use a better format upstream, but the time saved across the community by making better (Python) tools in the meantime is a *huge* multiplier across the astronomy community.
>
The aims aren't exclusive, but the resources available make it so in the short run. We have to weigh the relative priorities.

> Yes, for some problems, the effort isn't worth it. But let's get the simple/common use cases well sorted out. Saying "well, performance is bad with BSCALE/BZERO columns because of the nature of the format" is reasonable to say; "well, there are strings in the file" is not so much.
>
> Is there a clear horse in the lead in your opinion as a replacement? If so, are the tools to create/use the format *top notch*? Are there IDL equivalents? I think all these things need to be in place before we make recommendations (*especially* the IDL tools so that the format can be integrated into existing pipelines/workflows). The people on this list are the early adopters, coders, and influencers. The people I teach in my SciCoder workshop are the same. Give something that we all can proselytize, and I'm sure we will. But the tools have to be *complete* first.
>
What we will try is not an existing format, but will be based on existing standards for the basic components, and use standard libraries for those components. Will it have an immediate IDL library? No. This is at least partly exploratory. It may end up used only for limited purposes, but if we do a reasonable job, it could be used for everything FITS is, and more. So initially, Python only. If it seems sensible, then support will appear in other languages. We, of course, are in no position to ensure that. But at least it is an attempt to show an alternative approach. It may fail. But at least we are going to try. FITS has failed us for certain needs. (I can explain if it is desired; perhaps I have already on this list). Regardless of whether it is successful in being viewed as a replacement for FITS, it will be used for at least some information we have to generate (and may end up as a component of FITS files).

> I claim that's one reason that VO as a format isn't used all that much. The specs are over-engineered, but much more importantly, the tools to create/use the formats are pretty bad (apologies if there are those on this list that use/write them). The tools to read/create FITS files are much better/well known, so there's no impetus to move. (And they're not great.)
>
I'd argue that the biggest reason is that VOTable doesn't sufficiently support binary data. That killed it right off the bat (killed in the sense that it is not a replacement for FITS and couldn't be)

> All I ask is that we don't make things more painful for the trenches that have no choice in the matter than they have to be.
>
I wish it were so, but evolving software and requirements have continually meant some pain to keep up. We are far from nirvana, still. Sticking with FITS involves pain of its own though. It may not be visible in simple things like opening FITS files, but it shows up in lots of places, and isn't always recognized as being a consequence of the problems with FITS. Things are already painful in the trenches.

>>> And I still think we (as a community) should "fix" FITS and propose changes to the standard.
>>>
>> Good luck. I (and many others) have had little success in doing that. You are welcome to try. But FITS needs a lot more than small tweaks to fix it. After stagnating for so many years, it's my humble opinion that a completely different approach is needed (I think they missed the opportunity to evolve).
>
> I once asked the cfitstio author (whose name is easily looked up) about the reasoning behind the default internal buffer size chosen in a particular low-level read function. The size looked like it was appropriate for computers two decades ago (or more), and I wanted to see what might be a better choice on modern computers, particularly with SSD drives. I was getting significant improvement by increasing the value, but wanted to see how it might be more intelligently chosen given the internal algorithm. The answer was "it depends on the data", which is hard to turn into an algorithm. He also admitted as well that he has limited time to work on cfitsio.
>
> So yes, FITS isn't going to get better. I'd be open to creating/proposing a FITS2 format independent of FITS, or something better, but we as a dev community need something to hang our hat on first. FITS has been great for a very long time and seasoned astronomers will be loathe to give it up, so the alternative has to be amazing and well supported. As Mike Blanton once said to me, "I can open an ASCII file from the 1970s, and I can open a FITS file from the 1970s." This is what we are competing with.
>
Replacing FITS will be difficult, I'm aware :-)

Perry

Erik Bray

unread,
Jul 18, 2013, 5:37:36 PM7/18/13
to astro...@googlegroups.com
Perry already addressed some of the political/practical issues. So I
just want to respond a bit more to some of the technical issues you
raised:

On Thu, Jul 18, 2013 at 2:59 PM, Demitri Muna <demitr...@gmail.com> wrote:
> Hi,
>
> On Jul 18, 2013, at 11:38 AM, Erik Bray <erik....@gmail.com> wrote:
>
> That sounds a little better! Thanks for this data point...now that I
> think about it this is a dev issue, as it's relevant to issue #520
> [1]. Though you did never tell me if this was Python 2 or Python 3,
> which is also sadly relevant in this case.
>
>
> Sorry, Python 2.7.

That's *really* odd then, because the string-related issue I was
thinking of applies only to Python 3. It has to do with the fact that
on Python 3 Numpy string arrays (that is, with the S dtype) are
treated as arrays of bytes strings. In order to provide a consistent
interface this needs to be converted to an array of unicode strings
(the U dtype). Unfortunately, so far I've found no way to do this
without copying the entire array even if the text is all ASCII. I
don't remember for sure, but I think this is because Numpy's unicode
dtype internally encodes all characters as UTF-16 (allowing arbitrary
encodings like UTF-8 would be difficult for Numpy since it would
involve variable strides, but for encodings with fixed character width
I think Numpy should allow more flexibility *sigh*).

So I will have to do a little more exploring as to what exactly was
slowing down the reading of your table's rows, because they weren't
using BSCALE/BZERO either.

> To explain a little more about what this line means: When loading a
> table in PyFITS it has always used the numpy.rec.recarray format for
> accessing the table as a record array. But as we all know all too
> well FITS has a number of quirks in its data format, and just reading
> the raw data directly off disk (as we are doing with a recarray), we
> have to use some hacks that are implemented in the FITS_rec class,
> which is a subclass of recarray. Some of these hacks (currently)
> require reading an entire column into memory in order to apply a
> transformation to the data in that column. String columns often need
> to be transformed in this manner, especially on Python 3 since the
> need to be transformed from arrays for bytes strings to arrays of
> unicode strings.
>
>
> It sounds to me like this is less of a quirk of the FITS format and more of
> trying to shoehorn the data into a numpy recarray. (No judgement!) Obviously
> trying to read the entire table (here, a 1.8GB table) is going to be slow.
> And we know that astronomers aren't going to stop making gigantic FITS
> files… ever.

As Perry wrote, it's a little of both. For the more common data types
(unscaled floats and integers, strings (unicode issues aside)) the use
of recarray is sufficent and easy to implement. You run into trouble
though when it comes to data that needs to be rescaled, or quirks that
really shouldn't exist in this day and age like zero-width columns.
Those things break some of Numpy's basic assumptions--even worse when
you're trying to read from a file with mmap and avoid using up too
much physical memory.

> Not to say that I don't agree with you that the FITS format isn't awful (I'm
> writing a desktop FITS viewer, so believe me I know).
>
> Thanks for the explanation. I tried to go through the documentation and
> found the "view" numpy method, but I still don't really see how I would have
> come up with that solution (or how to explain it to my students).

I don't know either. I suppose this could be discussed in the
documentation somewhere, but one needs to be aware of the
implications. If you re-view the table with the recarray type then
you lose automatic conversion from the raw values in the FITS file to
their actual logical values. In the case of simpler tables like yours
it shouldn't have been an issue at all, so there might be a bug
somewhere causing unnecessary slowdowns.

> The main thrust of my plans to overhaul the PyFITS (and by extension
> astropy.io.fits) table code is to do away with the use of
> numpy.rec.recarray and treat each column as a separate array wrapped
> in a class that can perform any necessary transformations on the fly
> rather than over the entire column at once (more like how CFITSIO
> works).
>
>
> I don't know how you are reading the FITS file under the hood, but cfitsio
> can't read columns; data is written as rows, so reading a column means
> sequentially reading the whole file. This is grossly inefficient in ciftsio:
>
> for idx, c in enumerate(<fits columns>):
> array[c] = <read column idx>

Sorry, my writing was unclear. What I was saying is that CFITSIO will
convert the column data on the fly as you read over the rows. That
is, by default it actually reads several rows at a time into a buffer,
and if any transformations need to be performed (such as BSCALE) it
also applies them in that buffer. PyFITS' most severe limitation, on
the other hand, is that if a column needs to be transformed it reads
the *entire* column into memory and performs the transformation on it,
even if all you need is one row! This is fine for tables under, say,
100 MB but beyond that it gets pretty bad. This is the main issue I
want to fix but it's a rather involved undertaking since Numpy is not
very good at this sort of thing by itself. Though I do have ideas on
how to address the issue.

> I see different use cases here: iterating by column and iterating by row.
> How you read the file (since it's huge and we're not reading it into memory)
> depends on what you want. I'd suggest that astropy.io.fits provide two
> iterators, a row-based iterator and a column-based iterator. While that's
> not the most elegant thing here, having the user provide the hint on how
> they are iterating over the file can tell us how to optimize the reading. My
> case was *very* simple: read one row at a time, but it "failed".

You can basically already do this in pyfits. But you don't gain
anything because the data in FITS files is stored in rows. A future
FITS replacement format should allow column-based or row-based storage
depending on how the data is more likely to be accessed.

Erik

John K. Parejko

unread,
Jul 18, 2013, 6:15:45 PM7/18/13
to astro...@googlegroups.com
On 18Jul 2013, at 17:09, Perry Greenfield wrote:

> What we will try is not an existing format, but will be based on existing standards for the basic components, and use standard libraries for those components.

I'm pretty sure introducing an entirely new format is not a good route. The number of dead formats is very, very long.

What do people on this list think of HDF5? I've heard from several people that they find it much better for large files than FITS, and it seems quite flexible (admittedly, I haven't used it myself). There's already pytables for dealing with hdf5, and idl handls it natively.

John

Christopher Hanley

unread,
Jul 18, 2013, 6:30:11 PM7/18/13
to astro...@googlegroups.com
If this is really a conversation you wish to have please start a new thread. Replacing FITS is not a natural extension of the current Subject line for those just skimming messages. 

Thanks,
Chris

Perry Greenfield

unread,
Jul 18, 2013, 7:10:07 PM7/18/13
to astro...@googlegroups.com

On Jul 18, 2013, at 6:15 PM, John K. Parejko wrote:

> On 18Jul 2013, at 17:09, Perry Greenfield wrote:
>
>> What we will try is not an existing format, but will be based on existing standards for the basic components, and use standard libraries for those components.
>
> I'm pretty sure introducing an entirely new format is not a good route. The number of dead formats is very, very long.
>
I'm not sure that there are all that many dead astronomy formats.

> What do people on this list think of HDF5? I've heard from several people that they find it much better for large files than FITS, and it seems quite flexible (admittedly, I haven't used it myself). There's already pytables for dealing with hdf5, and idl handls it natively.
>

It certainly solves many of the problems with FITS. Yet, we are going to look at an alternative. It wasn't my intent to explain the details of the alternative now. I think when we show the proposal or prototype format, it will be more obvious why it is preferable to hdf5. But give us a little time.

Perry


Marten van Kerkwijk

unread,
Jul 19, 2013, 11:59:20 AM7/19/13
to astro...@googlegroups.com
Hi Perry,

Like John, the idea of a new format set off immediate alarm bells for me. You may well be right that it is necessary, but then it would also be important to ensure it is much more broadly useful. Could you perhaps post a link to your definition requirements? (in a new thread -- I answer here only because I'm not sure whether you'd get/see a new thread.) I think between the people here, there is experience from an enormous range of types of data produced in astronomy, from photon event tables to complex voltages. Most of these are very poorly captured by the idea of "an image", but much more easily by "a table" (while an image really is just a table -- even if perhaps a 1-column, 1-row one holding elements that are very large arrays...)

Thanks,

Marten


Erik Bray

unread,
Jul 19, 2013, 1:53:14 PM7/19/13
to astro...@googlegroups.com
Right--I make no distinction between an image/array and a table--a
table to me is nothing more than a 1-D array of homogeneous structured
records. An array of complex values is just an array of fairly simple
records consisting of two floats. An array of floats is just an array
of one float records.

One feature of FITS that complicates this view is variable length
array columns. I'm not sure that that's a feature worth keeping...
it is used heavily in tile-compressed images but there's nothing
stopping one from defining a binary format specifically for that
purpose rather than having VLAs be a general feature of tables.

I know I've seen other uses for them but I still consider it a special
case distinct from normal tables.


(Note: In the data format I'm working on the current draft does
distinguish between tables and arrays, but the only difference is that
'arrays' don't allow compound types and that tables are
one-dimensional by default; the table format also makes it easier to
attach metadata to each column. Later it might be worth relaxing or
even doing away with the distinction).
Reply all
Reply to author
Forward
0 new messages