Work with large scale dataset which does not fit in Memory

1,381 views
Skip to first unread message

saeed.i...@gmail.com

unread,
Mar 1, 2017, 1:58:07 PM3/1/17
to lasagne-users
Hello,

I am working with Lasagne and intend to work my model on a huge dataset (~65G) which does not fit into my RAM. The dataset has been serialized and saved a single pickle file on HDD. How can I shuffle my dataset in different epochs and get batches to feed the network?

All my project has been  suspended due to this problem. So please give me some immediate guides. 
Thanks
Saeed

Jan Schlüter

unread,
Mar 2, 2017, 6:59:25 AM3/2/17
to lasagne-users, saeed.i...@gmail.com
I am working with Lasagne and intend to work my model on a huge dataset (~65G) which does not fit into my RAM. The dataset has been serialized and saved a single pickle file on HDD. How can I shuffle my dataset in different epochs and get batches to feed the network?

If the dataset does not fit into RAM, how did you save it as a single pickle file? The first step would be to split it up, otherwise you cannot even unpickle it.

All my project has been  suspended due to this problem. So please give me some immediate guides.

There are several things you can do. One option is to split up the file into multiple chunks of, say, 4 GiB. Make sure the data is randomly divided between chunks, not ordered in any way. Then you can load a single chunk at a time, shuffle the examples within this chunk and process them. If the dataset is large enough, it won't make much of a difference if you shuffle *all* examples or shuffle them within a chunk only.
Another option is to save the data as a .npy file, not as a pickle (if you have multiple tensors in there, like input and targets, you'd use a separate .npy files per modality). You can then load the .npy file with np.load(filename, mmap_mode='r') to load it as a memory-mapped file. It will look like a regular numpy array, so you can randomly access its entries, and the OS will take care to load them from disk (and cache them) as needed. Actually you can just use the same batch iterator as in Lasagne's MNIST example. Make sure to put your data on a SSD so reading (and random access!) is fast.
If you need to create the large .npy file(s) on a computer that doesn't have enough memory to keep it in memory at once, you can create a writeable memory-mapped .npy file using np.lib.format.open_memmap, and then assign values to subslices or rows of it.

Hope this helps!
Best, Jan

isen...@googlemail.com

unread,
Mar 3, 2017, 7:44:14 AM3/3/17
to lasagne-users, saeed.i...@gmail.com
I can only emphasize the usefulness of np arrays with mmap_mode='r'. I use it all the time for all my datasets, even if they would theoretically fit in RAM. My batch generator (with on the fly data augmentation) is multithreaded. Due to the nature of python multi threading even smaller datasets are impossible to use if everything is stored in RAM.

saeed.i...@gmail.com

unread,
Mar 5, 2017, 1:29:27 AM3/5/17
to lasagne-users
Hello guys,
Thanks for your tips.
The code is now working. 
Reply all
Reply to author
Forward
0 new messages