Creating a fits data cube using streaming techniques

478 views
Skip to first unread message

hans smit

unread,
Nov 13, 2012, 8:48:12 AM11/13/12
to su...@googlegroups.com
I am currently working on a problem which has stumped me for the past 4 hours. A colleague of mine mentioned the sunpy group and said they may be interested in this. So here I am.

The problem is as follows: I need to create a fits data cube (N x 2D images), using as input N separate 2D fits files. I am currently using the pyfits module to create fits formatted files, and this works very well when creating a data cube with limited memory usage, but when the image resolution is high, and the number of images is large, memory exhaustion occurs. So, the technique that should be implement is a file streaming technique. The easiest way to explain what I need is by example:

import pyfits
import numpy
import os

def memory_exhaustive_cube_creation(fnames):
    data = []

    for fname in fnames:
        img = pyfits.getdata(fname)
        data.append(img)

    fits = pyfits.PrimaryHDU(numpy.array(data))
    fits.writeto("cube.fits", clobber=True)


def streaming_cube_creation(fnames):

    oname = "cube2.fits"
    if os.path.exists(oname):
        os.remove(oname)

    for fname in fnames:
        img = pyfits.getdata(fname)
        # THIS DOES NOT CREATE A DATA CUBE!!!
        # but I would like it to...
        pyfits.append(oname, img)


if __name__ == '__main__':
    fnames = [
        r"C:\data\CDSReference\20120824093658\H2RG_R01_M01_N01.fits",
        r"C:\data\CDSReference\20120824093658\H2RG_R01_M01_N02.fits"
    ]
    memory_exhaustive_cube_creation(fnames)
    streaming_cube_creation(fnames)

The first method works well (memory_exhaustive_cube_creation), but when your dealing with 2K x 2K 16 bit images, memory exhaustion occurs very quickly, i.e. on my computer after 14 images. The 2nd function (streaming_cube_creation) does not work. It creates separate header units for each image appended to the file, which is not what we want (all frames should be navigable in ds9).

Things I have tried are the following:
1) create a streaming cube class that inherits from numpy.array, and implement all the methods required by pyfits. FAILED: cannot inherit from numpy.array
2) binary hacking. Generate a 2 x 2D fits file using pyfits and append the next N-2 images to it. INSANITY: binary hacking sends compatibility chills down my spine.
3) extend the pyfits module to allow a streaming cube class as input. This would require a major hack in an external module (pyfits), and some major effort, but do-able. 

I noticed the sunpy.MapCube class does it by any chance have a tofits method?
Has anyone else had to deal with 3D image creation, and if so, how did they tackle it? Buying more memory is not an acceptable answer :-)

Any additional insight would be greatly appreciated. Thanks.

DVD PS

unread,
Nov 13, 2012, 9:51:02 AM11/13/12
to sunpy
Hi Hans,

I'm not answering your question because I have not a clue... but I remember that I, for a while, wanted to keep all the files from an observation into the same one... Just for having an organized file structure... but I gave up.  And then I learnt about Eclipse (ESO fits tools, nothing to do with the development environment) [1] and I
did all I wanted from there (mainly extract statistics from regions of interest, extract header information, etc.) it was a lot better than the option I had with IDL at the moment.  If you have used IRAF you would find that these tools are somehow similar, but without having to use an environment, straight from the shell.

Now, back to the cube thing.  Eclipse has also the possibility of doing cubes... again, don't know if it does in the way you want.  If not, you may have a look to the Ftools [2], they do as well, and they work from the shell as well.

Finally...  will DS9 be able to open the big cube? Probably it does... but it's worth to check before :-)  Also... cannot you just open them all as ds9 *fits and have the same control? [3]

[1] http://www.eso.org/sci/software/eclipse/
[2] http://heasarc.gsfc.nasa.gov/ftools/ftools_menu.html
[3] http://casa.colorado.edu/~ginsbura/ds9tips.htm#body6

Again, sorry for not answer your question, but hopefully you find something of that useful, at least till you buy more memory :-P

David



--
You received this message because you are subscribed to the Google Groups "SunPy" group.
To view this discussion on the web visit https://groups.google.com/d/msg/sunpy/-/anG8Q1GooTEJ.
To post to this group, send email to su...@googlegroups.com.
To unsubscribe from this group, send email to sunpy+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sunpy?hl=en.

hans smit

unread,
Nov 13, 2012, 10:25:26 AM11/13/12
to su...@googlegroups.com
Hi David,

Thanks for the quick reply, and the links to some useful tools. You're right about DS9 not being able to load the mega-cube, but that does not really matter, what the scientists here want is a cube to validate against something similar they wrote in C many years ago. I've been trying to get them away from this cube business for exactly this memory reason. In the past, I had developed a Cube class that acts exactly as a numpy 3D array, but loads the data from the file system when it is de-referenced. This worked very well for my purposes (photon transfer curve calculations).

What surprised me the most while tackling this problem, is that pyfits does not include such a streaming technique out of the box. Of course, I could be wrong, the pyfits core module is only 12000 lines of code :-O . May be I'm missing something. Regardless, I find this an interesting problem, and will put it on the back burner for now. I suspect it will come back to visit me in future discussions with the data-analysis team, but for now, it's a low priority.

Thanks again!

Cheers,

Hans

Stuart Mumford

unread,
Nov 13, 2012, 10:40:54 AM11/13/12
to su...@googlegroups.com
Hello,

This is an interesting probelm. At least more interesting that what I am supposed to be working on!!

Firstly I would like to point out that what you are doing is probably a terrible idea, having an array that size will cause the same kind of problems reading it in and could cause you more headaches than this one. If you really want them in one fits file you could think about having a HDU for each time step.

That being said, I think I have found a solution! It uses numpy memmapped files to create an on disk array, which it seems pyfits will save to a file without copying it into memory. This is the reason my fits save commands are a little weird at the end of the script so it doesn't try and copy the memmapped array into memory when it creates a HDU.

You can see my solution here: https://gist.github.com/4066376 I tested it for 20 AIA images and it appeared to not munch RAM but I can't be sure I haven't profiled it or anything.

Hope that helps
Stuart

hans smit

unread,
Nov 13, 2012, 11:38:05 AM11/13/12
to su...@googlegroups.com
Hi Stuart,

I just took your test code for a spin, it seems to WORK!!! I will be doing some validation on large data sets later tomorrow (probably tonight - I'm rubbing my hands in anticipation).

Your solution is very interesting, I did not realize numpy had these capabilities. After years of python programming I never cease to be amazed.

You are absolutely right about this being a terrible idea (large data sets in a single file). This is one of those cases where I do what I'm told, but make a terrible idea into an interesting problem. I will let the powers that be know that I'm not the only one that advocates the use of many files over a single giga-file.


>>  If you really want them in one fits file you could think about having a HDU for each time step.

Yep, that's exactly what the pyfits.append method does when executed multiple times.

Thanks for the solution. I will let you know how it works out.

Cheers,

Hans

hans smit

unread,
Nov 13, 2012, 2:47:53 PM11/13/12
to su...@googlegroups.com
Hi Stuart,

I've tested your code with 156 images, and it works! However, this is the limit on my laptop. Then the numpy.lib.format.open_memmap call throws a "Not enough storage is available to process this command". I also noticed that if I replace dtype=numpy.int16 with dtype=numpy.uint16 then things also go belly up in the pyfits module in the writeHDUdata method, during an unsigned to signed array conversion (it's the memory copy that kills it). Not really a big deal, I can now get the job done I was requested to complete. Yeah!

I've included the final code for future reference.

Many thanks!

Hans

from glob import glob
import numpy
import pyfits

files = glob("./data/*fits")
n = len(files)

print "Allocate memory mapped array for %d 2Kx2K 16 bit images." % n
dataTmp = numpy.lib.format.open_memmap(
    filename="./tmp.npy",
    mode='w+',
    dtype=numpy.int16,
    shape=(n, 2048, 2048)
)

print "Collecting data into a memory mapped array"
for i, fname in enumerate(files):
    dataTmp[i] = pyfits.getdata(fname)

dataTmp.close()

print "Writing data cube to fits file"
hdu = pyfits.PrimaryHDU(dataTmp)
hdu.writeto("./results.fits", clobber=True)
print "Finished"

Tiago Pereira

unread,
Nov 13, 2012, 6:00:01 PM11/13/12
to su...@googlegroups.com
Hi,

Pyfits.open() supports memmap for reading FITS files (and this is the
default). However, you'll still need to have the memory to write the array.

I seriously doubt that this is a memory exhaustion problem. Your initial
14 2Kx2K images in int16 occupy only 112 Mb, so unless your computer is
really old, this is not the issue...

Looking in your initial code it seems the problem is that you're
appending arrays to a list and then converting to an array again before
writing. This is very inefficient.

In your last example you hit the limit at 156 images, which is 1248 Mb
-- a much more plausible value for a RAM limit. Given that you still hit
a limit, I would say that the memmap is not being useful. It is still
having to copy the full datacube in memory when you do the write. If the
memmap was working properly, there would be no limit.

For simplicity and future reference I suggest doing away with the memmap
in this case and just do something like:

dataTmp = numpy.empty((N, 2048, 2048), dtype='int16')

for i, fname in enumerate(files):
dataTmp[i] = pyfits.getdata(fname)

pyfits.writeto("results.fits", clobber=True)
del dataTmp

As others mentioned, large FITS datacubes are not a good idea. I much
prefer having several extensions for different images. Also, if you have
a datacube with more than 2Gb, people with 32-bit systems won't be able
to read them.

For a true memory-independent solution to your problem, I would be
tempted to do what you call 'binary hacking'. It wouldn't be very hard
because a fits file is essentially an ascii header with the binary data.
I suspect it would take less than 20 lines of python code to implement
something like that, write the header and then use memmap for the binary
part.

Cheers,

Tiago
> --
> You received this message because you are subscribed to the Google
> Groups "SunPy" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/sunpy/-/niibICWsU44J.

hans smit

unread,
Nov 14, 2012, 6:52:12 AM11/14/12
to su...@googlegroups.com
Hi Tiago,

Thanks for the insight. I had given some incorrect information in a previous post. My initial 14 2K x 2K test was done on my colleague's (old) machine, with his code. After closer inspection, I noticed his data was being saved as float64, plus there was additional copying taking place. My mistake.

I had implemented and tested your,

  dataTmp = numpy.empty((N, 2048, 2048), dtype='int16') 

code, and this works just as well as the memmap method. So in effect, it had no effect.

I took your 20 lines of 'binary hacking' code as a challenge, and implemented it in 13 (minus headers array). :-) Here it is:

# memory-independent solution to creating a very large data cube
# input: files
headers = [
    ("SIMPLE", "T", " / conforms to FITS standard"),
    ("BITPIX", "16", " / array data type"),
    ("NAXIS", "3", " / number of array dimensions"),
    ("NAXIS1", "2048", ""),
    ("NAXIS2", "2048", ""),
    ("NAXIS3", str(len(files)), ""),
    ("BSCALE", "1", " / No scaling"),
    ("BZERO", "32768", " / Data is Unsigned Integer"),
    ("EXTEND", "T", ""),
]
blockLen = 2880         # the FITS block size
f = open("test.fits", "wb")
for header in headers:
    f.write("%-8s=%21s%-50s" % header)
f.write("%-80s" % "END")
f.write(" " * ((blockLen - (f.tell() % blockLen)) % blockLen))
for fname in files:
    data = pyfits.getdata(fname).astype(numpy.int16) + 32768
    data.byteswap(True)
    f.write(data.tostring())
f.write('\0' * ((blockLen - (f.tell() % blockLen)) % blockLen))
f.close()


As you explained, it was much easier that I had expected. The block padding was the most difficult trick to figure out, including the byteswap. The pyfits module code helped out here. Please note: the above routine is very "hardcoded" for my needs, it can easily be generalized into a function with appropriate arguments. I must admit, there is much less 'binary hacking' going on than expected. Thanks for motivating me to implement this.

Cheers,

Hans

Tiago Pereira

unread,
Nov 14, 2012, 7:10:02 PM11/14/12
to su...@googlegroups.com
Great hacking! Thank you for sharing your code, I'm sure it will come in
handy for some. Your persistence and search for a better solution is
admirable, in particular in times when so many people just want a quick
fix.

Cheers,

Tiago
> <https://groups.google.com/d/msg/sunpy/-/niibICWsU44J>.
> > To post to this group, send email to su...@googlegroups.com
> <javascript:>.
> > To unsubscribe from this group, send email to
> > sunpy+un...@googlegroups.com <javascript:>.
> > For more options, visit this group at
> > http://groups.google.com/group/sunpy?hl=en
> <http://groups.google.com/group/sunpy?hl=en>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "SunPy" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/sunpy/-/1feLkmcEZzcJ.

hans smit

unread,
Nov 15, 2012, 2:21:59 AM11/15/12
to su...@googlegroups.com
Thanks! It's a bit of a mundane solution, but did teach me a bit about the fits format. To be honest though, I was hoping for a quick solution to this, going on the assumption there was some pyfits function / method I was missing. When a problem becomes elusive, it tends to become more interesting. Then it sinks it's claws into me, and doesn't let go. Coding is the endless puzzle with infinite possibilities. I can't get enough. I'm getting off topic... 

Cheers,

Hans

Erik Bray

unread,
Dec 28, 2012, 8:36:29 PM12/28/12
to su...@googlegroups.com
Hi Hans,

Sorry to revive an old thread. I just ran across this one cleaning
out my e-mail. I'm the maintainer of PyFITS. This kind of problem
does come up every now and then, and I've thought about adding some
way to make it easier to build very large files. The problem is that
there are many scenarios in which one might want to do something like
this and there's no one solution that would fit them all.

For something like your particular case, I would just figure out how
much space I need (n * 4096 * 4096) where n is the number of time
slices, and preallocate a large FITS file with that amount of empty
space, then fill it in one slice at a time with your data. The
solution that Stuart gave you is one perfectly good way to do this.
There's also a (partial) solution in the PyFITS FAQ:
http://packages.python.org/pyfits/appendix/faq.html#how-can-i-create-a-very-large-fits-file-from-scratch

The problem you had with unsigned ints is avoidable. You wrote that
there was an error in the "writeHDUdata" method, which suggests you're
using a very old version of PyFITS, and which doesn't fully support
mmap. Much of that logic has been completely rewritten--you should
give PyFITS 3.1 a try.

Thanks,
Erik
Reply all
Reply to author
Forward
0 new messages