Saving Large number of training images

507 views
Skip to first unread message

Geetank Raipuria

unread,
Aug 23, 2016, 7:28:30 AM8/23/16
to lasagne-users

Hi,

For semantic segmentation task, I am saving high resolution images (after resizing) as numpy arrays,

however the size of the array goes quite big (around 10GB for 250 images and corresponding dense labels), does any one have experience with saving images data with a reduced size?

I already use " np.savez_compressed".

Thanks in advance

Jan Schlüter

unread,
Aug 23, 2016, 8:48:18 AM8/23/16
to lasagne-users
however the size of the array goes quite big (around 10GB for 250 images and corresponding dense labels), does any one have experience with saving images data with a reduced size?

a) It may sound silly, but image compression is a well-studied problem -- if you need to reduce storage space (or I/O bandwidth), use PNG, or JPEG/JPEG2000 if you can tolerate small differences. The PIL module allows you to compress/decompress in memory. If you prepare mini-batches in a separate thread, decompression should not slow you down by much.
b) If you're worried about keeping everything in memory, note that you can load numpy arrays (in .npy format) as memory-mapped files: np.load(fn, mmap_mode='r'). This will offload the burden of loading from disk and/or caching to the operating system. To write a .npy file that is too large to fit into main memory, open it with np.lib.format.open_memmap(fn, mode='w', dtype=..., shape=...).

Hope this helps!
Best, Jan

Geetank Raipuria

unread,
Aug 25, 2016, 5:18:30 AM8/25/16
to lasagne-users
Thanks for the advice 
cheers!

Sander Dieleman

unread,
Aug 28, 2016, 6:31:07 PM8/28/16
to lasagne-users
I knew the about mmap_mode='r', but not about np.lib.format.open_memmap(). Cool! Seems like there's lots of nice but poorly documented stuff in np.lib.

Jan Schlüter

unread,
Aug 29, 2016, 8:59:52 AM8/29/16
to lasagne-users
I knew the about mmap_mode='r', but not about np.lib.format.open_memmap(). Cool!

Had to figure this out when doing my MSc thesis on a 4 GiB computer :)


Seems like there's lots of nice but poorly documented stuff in np.lib.

You're right, we could hint at np.lib.format.open_memmap() in the mmap_mode description, or under "See also". Yeah, let's try this! https://github.com/numpy/numpy/pull/7987

Sander Dieleman

unread,
Aug 29, 2016, 10:54:09 AM8/29/16
to lasagne-users


On Monday, August 29, 2016 at 1:59:52 PM UTC+1, Jan Schlüter wrote:
I knew the about mmap_mode='r', but not about np.lib.format.open_memmap(). Cool!

Had to figure this out when doing my MSc thesis on a 4 GiB computer :)

I mostly ended up using HDF5 / h5py for that kind of stuff, mainly because I had no idea that numpy also supports it. 

Jan Schlüter

unread,
Aug 29, 2016, 12:51:04 PM8/29/16
to lasagne-users
Had to figure this out when doing my MSc thesis on a 4 GiB computer :)

I mostly ended up using HDF5 / h5py for that kind of stuff, mainly because I had no idea that numpy also supports it.

Which is nice until you need to go faster and find out that h5py does not support multithreading in Python. (I know you know that, just wanted to put this down for anybody reading it: .npy has a substantial performance advantage when you want to read your data in a background thread. Yes, you can do multiprocessing instead, but this incurs an inter-process communication overhead if your batches.)

We should probably put this kind of knowledge somewhere...

Sander Dieleman

unread,
Aug 30, 2016, 4:54:45 PM8/30/16
to lasagne-users
Good call! That has bitten me pretty hard in the past: when you try to use an hdf5 file from multiple threads it doesn't immediately throw an exception, it just returns wrong results (and then eventually crashes).

Sander

Jan Schlüter

unread,
Aug 31, 2016, 12:29:39 PM8/31/16
to lasagne-users
Good call! That has bitten me pretty hard in the past: when you try to use an hdf5 file from multiple threads it doesn't immediately throw an exception, it just returns wrong results (and then eventually crashes).

Oh! When I tried that from within Python, it just wouldn't multi-thread (by blocking the GIL). Was that from Python or something else? Maybe it also depends on your libhdf5 and h5py versions.

Sander Dieleman

unread,
Aug 31, 2016, 8:19:07 PM8/31/16
to lasagne-users
Come to think of it, this may have been in the multiprocessing case, where the file handle was created before the fork happened.
Reply all
Reply to author
Forward
0 new messages