Yes, the ZIP file format has a 4GB limit. Unfortunately, Python does
not yet support the ZIP64 format.
> Is there anyway I can recover the data (I
> guess I could try decompressing the file with 7z and extracting the
> individual npy files?)
Possibly. However, if the normal zip utility isn't working, 7z
probably won't, either. Worth a try, though.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user
>> I need to save a fairly large set of arrays to disk. I have saved it using
>> numpy.savez, and the resulting file is around 11Gb (yes, I did say fairly
>> large ;D). When I try to load it using numpy.load, the zipfile module
>> compains about
>> BadZipfile: Bad magic number for file header
>>
>> I can't open it with the normal zip utility present on the system, but it
>> could be that it's barfing about files being larger than 2Gb.
>> Is there some file limit for npzs?
>
> Yes, the ZIP file format has a 4GB limit. Unfortunately, Python does
> not yet support the ZIP64 format.
>
>> Is there anyway I can recover the data (I
>> guess I could try decompressing the file with 7z and extracting the
>> individual npy files?)
>
> Possibly. However, if the normal zip utility isn't working, 7z
> probably won't, either. Worth a try, though.
I've had similar problems, my solution was to move to HDF5. There are
two options for accessing and working with HDF files from python: h5py
(http://code.google.com/p/h5py/) and pytables
(http://www.pytables.org/). Both packages have built in numpy support.
Regards,
Lafras
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkuZ8jcACgkQKUpCd+bV+kruKwCghfG0yAo/eRXzDZxH6i1eOyfn
bnUAoLsuLB2O9qyvJV7CP3jXT9OcMwye
=1xc3
-----END PGP SIGNATURE-----
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>>> I need to save a fairly large set of arrays to disk. I have saved it using
>>> numpy.savez, and the resulting file is around 11Gb (yes, I did say fairly
>>> large ;D). When I try to load it using numpy.load, the zipfile module
>>> compains about
>>> BadZipfile: Bad magic number for file header
>>>
>>> I can't open it with the normal zip utility present on the system, but it
>>> could be that it's barfing about files being larger than 2Gb.
>>> Is there some file limit for npzs?
>>
>> Yes, the ZIP file format has a 4GB limit. Unfortunately, Python does
>> not yet support the ZIP64 format.
>>
>>> Is there anyway I can recover the data (I
>>> guess I could try decompressing the file with 7z and extracting the
>>> individual npy files?)
>>
>> Possibly. However, if the normal zip utility isn't working, 7z
>> probably won't, either. Worth a try, though.
>
> I've had similar problems, my solution was to move to HDF5. There are
> two options for accessing and working with HDF files from python: h5py
> (http://code.google.com/p/h5py/) and pytables
> (http://www.pytables.org/). Both packages have built in numpy support.
>
> Regards,
> Lafras
I've experienced similar issues too, but I moved to NetCDF. The only disadvantage was that I did not find any python modules that work well _and_ support numpy. Hence, I am considering moving to HDF5. Which python module would people here recommend? (Or, alternatively, did I miss a great netCDF python module that someone could tell me about?)
Cheers,
Paul.
I use h5py. I think it is great. It gives you a dictionary-like
interface to your archive. Here's a quick example:
>> import h5py
>> a = np.random.rand(1000,1000)
>> f = h5py.File('/tmp/myfile.hdf5')
>> f['a'] = a # <-- Save
>> f.keys()
['a']
>> f.filename
'/tmp/myfile.hdf5'
>> b = f['a'] # <-- Load
> _______________________________________________
I don't know any particular advantages of the file format itself. There are, however, several python modules for hdf5 that use numpy. Your suggestion for a netcdf module might be a good one, but it does not build on my system: it does not find the netcdf library, only the hdf5 lib - even if they reside in the same folder... I'll see if it works out eventually!
-Paul
Could it be arranged that an exception is raised when creating a >4GB
.npz file, so people do not find themselves with unrecoverable data?
>> Is there anyway I can recover the data (I
>> guess I could try decompressing the file with 7z and extracting the
>> individual npy files?)
>
> Possibly. However, if the normal zip utility isn't working, 7z
> probably won't, either. Worth a try, though.
If your data is valuable enough — irreplaceable space mission results,
say — some careful spelunking in the code combined with some knowledge
about your data might allow you to semi-manually reconstruct it from
the damaged zip file. This would be a long and painful labour, but
would probably produce a python module that supported zip64 as an
incidental side effect.
Anne
If you can arrange it, sure.
yes, numpy support is critical -- why anyone would write a netcdf
wrapper and not use numpy is beyond me.
>> There is http://code.google.com/p/netcdf4-python/
>>
>> I know netcdf4 is a subset of HDF5. What advantages there to use HDF5 not NetCDF4 ?
The way I think about it is that netcdf is a more structured subset of
hdf5. If the structure imposed by netcdf works well for your needs, it's
a good option. There are also things like the CF metadata standard that
make it easier to exchange data with others.
However if you are using it only to save and reload data for your own
app -- pytables may be a better bet.
I've found the netcdf4-python package to be robust and have a nice
interface -- and certainly works well with numpy. My only build issue
with it are actually with getting hd5 built right, the netcdf part has
been easier, and the python bindings very easy -- at least on OS-X.
Windows support is not in as good shape, though if you don't' need
opendap support, I think there are Windows binaries for netcdf4 you
could use.
Also, I think the python bindings still support netcd3, if you don't
need the extra stuff netcdf4 gives you (you may, if you need big files
-- not sure about that).
If netcdf seems like the right way to go for you, then I'm sure you can
get netcdf4-python working -- and Jeff Whitaker is very helpful if you
have trouble.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
On Fri, Mar 12, 2010 at 12:27, Anne Archibald <peridot...@gmail.com> wrote:On 11 March 2010 10:48, Robert Kern <rober...@gmail.com> wrote:On Thu, Mar 11, 2010 at 05:10, Jose Gomez-Dans <jgome...@gmail.com> wrote:Hi! I need to save a fairly large set of arrays to disk. I have saved it using numpy.savez, and the resulting file is around 11Gb (yes, I did say fairly large ;D). When I try to load it using numpy.load, the zipfile module compains about BadZipfile: Bad magic number for file header I can't open it with the normal zip utility present on the system, but it could be that it's barfing about files being larger than 2Gb. Is there some file limit for npzs?Yes, the ZIP file format has a 4GB limit. Unfortunately, Python does not yet support the ZIP64 format.Could it be arranged that an exception is raised when creating a >4GB .npz file, so people do not find themselves with unrecoverable data?If you can arrange it, sure.
You can do netcdf3 without hdf and with big files by
compiling netcdf4-python (on linux) via:
> export NETCDF3_DIR=/home/phil/usr64/netcdf_3.6.3
> ~/usr64/bin/python setup-nc3.py install
then in python open a dataset for writing with:
theFile=Dataset(ncfilename,'w','NETCDF3_64BIT')
-- Phil
You could try pupynere, which is pure python, only a single file
(netcdf 3 only).
http://pypi.python.org/pypi/pupynere/
Ryan
--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
Just to add, h5py (and PyTables also) allows you to read/write subsets
of your data:
>>> f = h5py.File('foo.hdf5','w')
>>> f['a'] = np.random.rand(1000,1000)
>>> subset = f['a'][200:300, 400:500:2] # only reads this slice from the file
H5py also supports transparent compression on a per-dataset basis,
with no limits on the size of the datasets or files. Slicing is still
efficient for compressed datasets since HDF5 supports a chunked
storage model. There's a general introduction to h5py here:
http://h5py.alfven.org/docs/guide/quick.html
Andrew Collette
On 12. mars 2010, at 16.35, Ryan May wrote:
>> I've experienced similar issues too, but I moved to NetCDF. The only disadvantage was that I did not find
>> any python modules that work well _and_ support numpy. Hence, I am considering moving to HDF5.
>> Which python module would people here recommend? (Or, alternatively, did I miss a great netCDF
>> python module that someone could tell me about?)
>
> You could try pupynere, which is pure python, only a single file
> (netcdf 3 only).
>
> http://pypi.python.org/pypi/pupynere/
>
> Ryan
pupynere is read-only, which of course is a show-stopper.
Everyone else, thanks for good advice. I still can't get netcdf4-python to work with my netcdf4 library though, which it won't detect, for some mysterious reason... Anyway, h5py seems like a nice module - thanks Keith! I think I might go that route instead.
Paul
No, it does allow writing. At the top of the link I sent before
(which I'm guessing you didn't read :) ):
"Pupynere is a Python module for reading and writing NetCDF files...."
It works pretty well for me. The only problem is that it doesn't
allow modifying files, but that's not too bad of a limitation. The
pure python part makes it really simple to install (it doesn't even
rely on having the official netcdf libraries installed.)
Ryan
--
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
I see Bruce dug up http://projects.scipy.org/numpy/ticket/991. Is
this the right route to go, or do we need a more sophisticated
solution?
Regards
Stéfan
It might be nice to build something on libbzip2 -- It looks like the
license is right, it's got good compression qualities, supports 64 bit,
and it getting pretty widely used.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
bzip2 does not support random access. It just compresses a single
file. It is not a replacement for zipfile.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
If we're going to pull in a new C library for this purpose, then maybe
we should just use libhdf5? :-).
-- Nathaniel
Jose, what do Python version are you using and does the ' normal zip utility present on the system' actually support zip64?
You should try seeing if 7z sees it.
Hi,_______________________________________________ SciPy-User mailing list SciPy...@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
except that hdf is one big honking pain in the butt to build.
Robert's right, bzip isn't a great option anyway -- it can do multiple
files, but it just concatenates them, and doesn't appear to provide an
index. I used to use afio, which would zip each file first, then put
them all in an archive -- I liked that approach. We could, of course,
build something like that with bzip, but it looks like python's zip will
work for >= 2.6, so no need for something new.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
(Or, alternatively, did I miss a great netCDF python module that someone could tell me about?)
No. The above is only an API that emulates the NetCDF interface of the
Scientific package, and do not create (nor can read either) pure NetCDF files.
For creating/reading NetCDF files, better use the netcdf4-python project:
http://code.google.com/p/netcdf4-python/
--
Francesc Alted
I wouldn't say that HDF5 it is very difficult to build/install. In fact, it
is a matter of "./configure; make install" --and that only if there is not a
binary package available for your SO, which is usually the case. It is just
that it is another dependency to add to numpy/scipy ...and a *big* one.
--
Francesc Alted
Except on Windows, of course, where you need to have Visual Studio and
a lot of patience. :) Even on UNIX one of the major support issues
I've had with h5py is that everyone has a slightly different version
of HDF5, built in a slightly different way.
Andrew
> I wouldn't say that HDF5 it is very difficult to build/install. In fact, it
> is a matter of "./configure; make install" --and that only if there is not a
> binary package available for your SO, which is usually the case.
clearly, you've never tired to build a Universal binary on OS-X ;-)
> It is just
> that it is another dependency to add to numpy/scipy ...and a *big* one.
yeah, that too.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Yeah, Windows, but this is a platform where everybody expects to have a
binary ...and fortunately HDF5 is not an exception there. :)
> Even on UNIX one of the major support issues
> I've had with h5py is that everyone has a slightly different version
> of HDF5, built in a slightly different way.
Well, I cannot say the same with PyTables, but that could be somewhat
expected, as you try to support much more HDF5 low level features than I do.
--
Francesc Alted
Ok, touché. But is there any package for which it is easy to build an
Universal binary on Mac OS-X? ;-)
--
Francesc Alted
numpy.
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
Good! So, I should start by looking at how it achieves that then :)
--
Francesc Alted
On 2010-03-15, at 4:16 PM, Francesc Alted <fal...@pytables.org> wrote:
> Good! So, I should start by looking at how it achieves that then :)
Actually I did build a 4-way universal binary of HDF5. Long story
short, because of some quirks of the build process for HDF5 it isn't
possible to add some -arch flags to CFLAGS -- you actually have to
build it separately for each architecture and then do something like
make install DESTDIR=path/for/arch
and then stitch them together manually with otool. I'm also not sure
of the correct way, if there is one, to handle "hdf5.settings" -- it's
unclear to me whether programs ever look at this at build time/
runtime, or if it's just there for the user's convenience.
An installer for the binaries I made is at
http://www.cs.toronto.edu/~dwf/mirror/hdf5-1.8.4-quad.pkg
I can write up some instructions if that would be helpful.
David
Yes, I think "hdf5.settings" it is for convenience purposes only (a fast way
to look for configuration of the compiler flags for the library). So you
don't need to worry too much about this.
> An installer for the binaries I made is at
>
> http://www.cs.toronto.edu/~dwf/mirror/hdf5-1.8.4-quad.pkg
Hey, that's great.
> I can write up some instructions if that would be helpful.
Please do. I'm definitely interested!
Thanks,
--
Francesc Alted
> An installer for the binaries I made is at
>
> http://www.cs.toronto.edu/~dwf/mirror/hdf5-1.8.4-quad.pkg
>
> I can write up some instructions if that would be helpful.
Wonderful! Any chance you've tackled netcdf4 as well?
Thanks,
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Actually, that's a much better way of doing things generally. I wish
this was the standard way of building universal binaries, it is
usually more robust than using multiple -arch arguments, especially
when you export a public API with C headers. I use this technique
myself for all my packages pure C libraries (audiolab and samplerate
in particular),
cheers,
Sorry, I haven't, but it should be roughly the same procedure. I'll write up
some instructions.
One thing with HDF5 was that I believe I had to compile i386 and x86_64
binaries on an Intel Mac and ppc/ppc64 on a PowerPC Mac, there was some issue
with cross-compiling that I never quite got sorted out. It may compile a
binary which it then wants to actually *run* for another step of the build
process. Rosetta should take care of this on Intel but the machine I was
sitting in front of was a G5, and the Intel build on that machine failed
miserably, so I logged in remotely to an Intel Mac. I will try compiling all
4 on an Intel machine and see if that works.
David