Lazy loading text file for training on GPU

195 views
Skip to first unread message

John King

unread,
Mar 28, 2015, 2:51:40 PM3/28/15
to tor...@googlegroups.com
I'm looking at Torch7 as an option for several ML text tasks. Is it possible to lazy load large text files like >20GB for training on a GPU with Torch? I understand that I'll have to go through the OS file system, but is this something commonly done? How would I go about doing it in Torch?

I don't have much experience with GPU's either.

soumith

unread,
Mar 28, 2015, 2:54:57 PM3/28/15
to torch7 on behalf of John King
By lazy-load, do you mean to load parts of the text files (indexed or at random)? You can do that, using standard lua file IO.

If you use the high-level cutorch/cunn interface, you dont need to have experience with GPUs, it's all taken care of for you.

On Sat, Mar 28, 2015 at 2:51 PM, John King via torch7 <torch7+APn2wQeUxckGDzR9K9x7loeb_...@googlegroups.com> wrote:
I'm looking at Torch7 as an option for several ML text tasks. Is it possible to lazy load large text files like >20GB for training on a GPU with Torch? I understand that I'll have to go through the OS file system, but is this something commonly done? How would I go about doing it in Torch?

I don't have much experience with GPU's either.

--
You received this message because you are subscribed to the Google Groups "torch7" group.
To unsubscribe from this group and stop receiving emails from it, send an email to torch7+un...@googlegroups.com.
To post to this group, send email to tor...@googlegroups.com.
Visit this group at http://groups.google.com/group/torch7.
For more options, visit https://groups.google.com/d/optout.

John King

unread,
Mar 28, 2015, 3:51:30 PM3/28/15
to tor...@googlegroups.com
Both actually. Either in large chunks, enough to fit in the GPU's memory and possibly, although less likely at random. Looking to do the first at the moment. 

How would having to go back to disk affect performance?


On Saturday, March 28, 2015 at 2:54:57 PM UTC-4, smth chntla wrote:
By lazy-load, do you mean to load parts of the text files (indexed or at random)? You can do that, using standard lua file IO.

If you use the high-level cutorch/cunn interface, you dont need to have experience with GPUs, it's all taken care of for you.

soumith

unread,
Mar 28, 2015, 4:07:42 PM3/28/15
to torch7 on behalf of John King
If we have to hit the disk, we usually do data-loading on separate threads using the threads package (https://github.com/torch/threads-ffi). An example is here: https://github.com/soumith/imagenet-multiGPU.torch

UserNotDefined

unread,
Mar 28, 2015, 5:05:09 PM3/28/15
to tor...@googlegroups.com
Great! I'll take a look. Also, just asking your opinion based on your experience, do you think it's worth it to try and work with large amounts of text with Torch? 


On Saturday, March 28, 2015 at 4:07:42 PM UTC-4, smth chntla wrote:
If we have to hit the disk, we usually do data-loading on separate threads using the threads package (https://github.com/torch/threads-ffi). An example is here: https://github.com/soumith/imagenet-multiGPU.torch

soumith

unread,
Mar 28, 2015, 5:24:10 PM3/28/15
to torch7 on behalf of UserNotDefined
I do work routinely with very large amounts of text.

One thing to keep in mind is to use the right data structures to be efficient and efficient data-loading routines for such huge blobs of text.
The most common issue is when people use a simplistic package to do large-scale things and are not sure what's going on.

For large text, use a data-structure like tds.hash which does not put pressure on the garbage collector:
https://github.com/torch/tds#example
It works (almost) exactly like a lua table for data-storage purposes.

To load the data, open a file pointer and read the data with offsets, DONT use a small-scale package like https://github.com/clementfarabet/lua---csv or an obvious approach like io.read("*all")





UserNotDefined

unread,
Mar 28, 2015, 5:36:43 PM3/28/15
to tor...@googlegroups.com
Thanks for the info. Also, there are techniques in NLP where calculations/processing can be parallelized. I understand one can use threads in Lua, but does something similar exist in Torch for GPUs? I'm not even sure if the concept of threading applies to GPUs aside from having multiple GPUs.

soumith

unread,
Mar 28, 2015, 5:44:09 PM3/28/15
to torch7 on behalf of UserNotDefined
You can have streams in CUDA. It is not yet exposed at the Lua level, but our GPU code is pretty efficient and parallelizes things well. We are working on exposing streams at the lua level and giving the user the ability to parallelize computation even more.
Reply all
Reply to author
Forward
0 new messages