Pre-fetching data

Ian Goodfellow

unread,

Oct 15, 2010, 4:38:09 PM10/15/10

to theano...@googlegroups.com

Is there any way to set up a "producer thread, consumer thread" scenario for use with theano? Most of my work involves datasets where the individual examples are very large, and a significant amount of the time it takes to process them is just file I/O. It would be nice if I could hide some of that latency by having one thread load examples while another thread does stochastic gradient descent on the loaded examples and then discards them. It seems like I'm out of luck because there is no such thing as threading in python. Is there any way of working around that restriction?

Josh Bleecher Snyder

unread,

Oct 15, 2010, 4:46:30 PM10/15/10

to theano...@googlegroups.com

Python does support threading (albeit not all *that* thoroughly, but I
think it handles file i/o ok). It's not too tough to put together a
simple work queue using e.g. the queue module:
http://docs.python.org/library/queue.html

If that doesn't work for you, providing some details as to why might
lead the way to a more helpful answer.

-josh

Ian Goodfellow

unread,

Oct 15, 2010, 4:48:30 PM10/15/10

to theano...@googlegroups.com

OK, I was under the impression that only stackless python allowed threads, but I was wrong.

Dumitru Erhan

unread,

Oct 15, 2010, 5:21:07 PM10/15/10

to theano...@googlegroups.com

On Fri, Oct 15, 2010 at 16:46, Josh Bleecher Snyder <josh...@gmail.com> wrote:

> Is there any way to set up a "producer thread, consumer thread" scenario for
> use with theano? Most of my work involves datasets where the individual
> examples are very large, and a significant amount of the time it takes to
> process them is just file I/O. It would be nice if I could hide some of that
> latency by having one thread load examples while another thread does
> stochastic gradient descent on the loaded examples and then discards them.
> It seems like I'm out of luck because there is no such thing as threading in
> python. Is there any way of working around that restriction?

Python does support threading (albeit not all *that* thoroughly, but I
think it handles file i/o ok). It's not too tough to put together a
simple work queue using e.g. the queue module:
http://docs.python.org/library/queue.html

Do calls to C functions in Python block I/O (I had read that somewhere, but I might have gotten this wrong)? I tried doing what Ian wants (with this particular module), but I've never had any luck with actually gaining performance. Never investigated too much, though.

Dumitru

Ian Goodfellow

unread,

Oct 15, 2010, 5:33:59 PM10/15/10

to theano...@googlegroups.com

What exactly do you mean by block I/O? Do you mean you think a call to a C function in one thread might prevent an I/O operation from taking place in another thread?

From what I've read in the last few minutes, it sounds like while threads are supported, every subexpression that touches a python object needs to get the global lock. This lock can be dropped during some I/O operations, but presumably initiating each I/O operation requires touching the python object that you want the result written to. This means you would not be able to start any new I/O operations while your theano function is running (the C code for theano ops definitely touches python objects without acquiring any lock so I'm assuming we must hold the lock throughout the entire execution of the theano function). Do you think that could be what prevented you from getting performance improvements, Dumitru?

Dumitru Erhan

unread,

Oct 15, 2010, 5:41:50 PM10/15/10

to theano...@googlegroups.com

On Fri, Oct 15, 2010 at 17:33, Ian Goodfellow <goodfel...@gmail.com> wrote:

What exactly do you mean by block I/O? Do you mean you think a call to a C function in one thread might prevent an I/O operation from taking place in another thread?

From what I've read in the last few minutes, it sounds like while threads are supported, every subexpression that touches a python object needs to get the global lock. This lock can be dropped during some I/O operations, but presumably initiating each I/O operation requires touching the python object that you want the result written to. This means you would not be able to start any new I/O operations while your theano function is running (the C code for theano ops definitely touches python objects without acquiring any lock so I'm assuming we must hold the lock throughout the entire execution of the theano function). Do you think that could be what prevented you from getting performance improvements, Dumitru?

In a nutshell, yes :)

Dumitru

On Fri, Oct 15, 2010 at 5:21 PM, Dumitru Erhan <dumitr...@gmail.com> wrote:

On Fri, Oct 15, 2010 at 16:46, Josh Bleecher Snyder <josh...@gmail.com> wrote:

> Is there any way to set up a "producer thread, consumer thread" scenario for
> use with theano? Most of my work involves datasets where the individual
> examples are very large, and a significant amount of the time it takes to
> process them is just file I/O. It would be nice if I could hide some of that
> latency by having one thread load examples while another thread does
> stochastic gradient descent on the loaded examples and then discards them.
> It seems like I'm out of luck because there is no such thing as threading in
> python. Is there any way of working around that restriction?

Python does support threading (albeit not all *that* thoroughly, but I
think it handles file i/o ok). It's not too tough to put together a
simple work queue using e.g. the queue module:
http://docs.python.org/library/queue.html

Do calls to C functions in Python block I/O (I had read that somewhere, but I might have gotten this wrong)? I tried doing what Ian wants (with this particular module), but I've never had any luck with actually gaining performance. Never investigated too much, though.

Dumitru

--
http://dumitru.ca, +1-514-432-8435

Ian Goodfellow

unread,

Oct 15, 2010, 5:52:49 PM10/15/10

to theano...@googlegroups.com

Is theano compatible with any implementations of stackless python?

Josh Bleecher Snyder

unread,

Oct 15, 2010, 6:31:35 PM10/15/10

to theano...@googlegroups.com

>> What exactly do you mean by block I/O? Do you mean you think a call to a C
>> function in one thread might prevent an I/O operation from taking place in
>> another thread?
>>
>> From what I've read in the last few minutes, it sounds like while threads
>> are supported, every subexpression that touches a python object needs to get
>> the global lock. This lock can be dropped during some I/O operations, but
>> presumably initiating each I/O operation requires touching the python object
>> that you want the result written to. This means you would not be able to
>> start any new I/O operations while your theano function is running (the C
>> code for theano ops definitely touches python objects without acquiring any
>> lock so I'm assuming we must hold the lock throughout the entire execution
>> of the theano function). Do you think that could be what prevented you from
>> getting performance improvements, Dumitru?
>
> In a nutshell, yes :)

Hmmm...bummer.

Another option to look into is the multiprocessing module:
http://docs.python.org/library/multiprocessing.html -- basically
multithreading but via processes, thus avoiding the GIL. It looks like
it might offer a decent alternative, as long as the IPC doesn't prove
to be too slow and/or the shared memory facilities not helpful.

-josh

Nicolas Pinto

unread,

Oct 15, 2010, 6:39:58 PM10/15/10

to theano...@googlegroups.com

You may also want to consider the thread/Cython with 'nogil' option:
http://docs.cython.org/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil

There was an example in Brian Granger's HPC Tutorial at Scipy'10. If
you can't find it, let me know.

HTH

N

--
Nicolas Pinto
Ph.D. Candidate, Brain & Computer Sciences
Massachusetts Institute of Technology, USA
http://web.mit.edu/pinto

Razvan Pascanu

unread,

Oct 15, 2010, 8:26:21 PM10/15/10

to theano...@googlegroups.com

It is, Arnaud tried it. He would know better if there are any issues though.

Arnaud Bergeron

unread,

Oct 18, 2010, 3:36:57 PM10/18/10

to theano...@googlegroups.com

2010/10/15 Razvan Pascanu <r.pa...@gmail.com>:

> It is, Arnaud tried it. He would know better if there are any issues though.

Yes, stackless is compatible. However I don't think they lifted the
GIL issue. So no performance improvement there.

Your other option would be to code an extension module that releases
the GIL and does I/O in the background. It shouldn't be too hard.

--
La brigade SnW veut vous recruter - http://www.brigadesnw.com

Ian Goodfellow

unread,

Oct 18, 2010, 3:39:36 PM10/18/10

to theano...@googlegroups.com

If my files were made with cPickle, can I actually do I/O without holding the GIL?

Dmitry Chichkov

unread,

Oct 18, 2010, 4:13:01 PM10/18/10

to theano...@googlegroups.com

Hi Ian,

Are you sure it's actually file i/o that slows you down? From my experience with python, pickling data is quite fast and usually not a limiting factor.
It works pretty much with the disk speed/sequential read: 100MB/sec ... 1GB/sec. Have you measured your disk read / pickling throughput?

And just in case, you keep all the input data in one large pickled [or marshal-ed] file, right?

-- Dmitry

Ian Goodfellow

unread,

Oct 18, 2010, 4:19:46 PM10/18/10

to theano...@googlegroups.com

I don't think pickling is the limiting factor. I think the limiting factor is disk / network access. So I guess one solution could be to have a thread that copies files into a ram disk and then let the main thread de-pickle them.
They're not all in one file, they're in hundreds of different files, each of which is around 500 MB.

Dmitry Chichkov

unread,

Oct 18, 2010, 4:42:34 PM10/18/10

to theano...@googlegroups.com

Mmm... are you running it on cluster, or it's a single box? And again, have you actually measured your disk read / pickling throughput? And you actually pickling the data (with cPickle), not reading it using some other mechanisms, right?

If it is a single box - 1000 files x 500 MB each = 500GB. If this is indeed the case, it might be worth investing into a 10,000 RPM 1TB drive and forgetting about slow disk reads for a while. You'd be able to read your 500GB in a few minutes :)

-- Dmitry

Ian Goodfellow

unread,

Oct 18, 2010, 4:45:37 PM10/18/10

to theano...@googlegroups.com

Cluster, and I'm loading them by passing the file object to cPickle.

Josh Bleecher Snyder

unread,

Jan 3, 2011, 4:30:03 PM1/3/11

to theano...@googlegroups.com

Hi Ian,

> I'm loading them by passing the file object to cPickle.

I know that this is a really old thread, and you may have already
found a good solution, but just in case, I thought I'd share something
I just discovered: Python's gzip file wrapper *really* slows things
down.

Of course, without the gzip wrapper, files are much larger. However,
courtesy of a recent email from David Warde-Farley about carray, I've
been playing with another alternative, blosc
(https://github.com/FrancescAlted/python-blosc).

I've been experimenting with data compression option on the mnist
pickle file used with the tutorials. I've tried three things so far:

(1) Use the gzipped pickle file exactly as it comes, with gzip.open.
(2) gunzip the pickle file, and just use plain open().
(3) Use blosc.pack_array on each numpy array in the file, and then
pickle the results. When loading the file, use blosc.unpack_array to
restore each numpy array.

Summary:

Method File size Total file load time (including all decompression)
gzip 220.0 Mb 6.76s
open 16.2 Mb 0.52s
blosc 26.4 Mb 0.87s

So it looks like blosc might actually offer a nice middle ground, in
terms of keep file sizes small but still offering fast read times.

Of course, if you're not currently using the gzip wrapper, this
doesn't help much...

I found using blosc to be quite straightforward, but I can share the
crude code I cobbled together to test this, if it would be of any use.

-josh

James Bergstra

unread,

Jan 3, 2011, 5:57:48 PM1/3/11

to theano...@googlegroups.com

Interesting - but did you accidentally switch some numbers in the table? It doesn't make sense to me.

--
http://www-etud.iro.umontreal.ca/~bergstrj

Josh Bleecher Snyder

unread,

Jan 3, 2011, 7:00:31 PM1/3/11

to theano...@googlegroups.com

> Interesting - but did you accidentally switch some numbers in the table? It
> doesn't make sense to me.

I did indeed! I transposed the file sizes for gzip/open. Good catch.
Fixed version:

Method File size Total file load time (including all decompression)

open 220.0 Mb 0.52s
gzip 16.2 Mb 6.76s
blosc 26.4 Mb 0.87s

It looks like future versions of carray will avoid the need for
manually managing blosc compression and will make it a bit easier --
see http://groups.google.com/group/carray/browse_thread/thread/1aadced6eefb359.
When 0.4 comes out, I plan to revisit. Adding theano support for
carrays (if only via triggering automatic exporting of slices to numpy
arrays) could make this almost entirely transparent.

-josh

Josh Bleecher Snyder

unread,

Jan 3, 2011, 7:03:09 PM1/3/11

to theano...@googlegroups.com

And in case you're curious, the breakdown for blosc is 0.06s to read
the file contents, and 0.81s to decompress.

(This is in keeping with the straightforward open case -- 0.52s to
read 220Mb matches pretty closely to 0.06s to read 26.4Mb.)

-josh

On Mon, Jan 3, 2011 at 4:00 PM, Josh Bleecher Snyder

Frédéric Bastien

unread,

Jan 4, 2011, 1:38:13 PM1/4/11

to theano...@googlegroups.com

Hi,

Interesting. Did you tried the transpose trick with gzip too or only
with carray? It can help me make the file smaller and make the
decompression time smaller too.

Don't forget that the compression ration of carray vs gzip will
change. carray is specialized when the entropy is low in the data, not
gzip. So you must do the file size comparison for each dataset that
you will use. Also, for use in a cluster with only 1 file server to
server ~350 jobs running at the same time, the file size is more
important then decompression time. So people in our lab, don't use it!

We need a better way to deal in Theano with dataset that don't fit in
memory. We also need a way to handle direction in theano generating
output that don't fit into memory.

For this I plan to test at the end of this week and next week
something that will use pytables[1]. It is from the same author as
carray and also allow to use the same compression algo as well as gzip
and lzo. It also handle the case when not all data fit in memory.

I think I read on some paper on pytables that tell that lzo give
approximately the same file size as gzip, but is faster at
decompression. Pytables use gzip by default as it is installed by
default, but not lzo.

[1] http://www.pytables.org/

Fred

p.s. I will try to remember the trick to transpose the input. I think
it will be applicable to many algo.

On Mon, Jan 3, 2011 at 7:03 PM, Josh Bleecher Snyder

Josh Bleecher Snyder

unread,

Jan 4, 2011, 6:56:12 PM1/4/11

to theano...@googlegroups.com

> Interesting. Did you tried the transpose trick with gzip too or only
> with carray? It can help me make the file smaller and make the
> decompression time smaller too.

Actually, for the numbers I gave in this email thread, I didn't do any
transposition at all. (Sorry for any confusion. When I said in
response to James that I transposed the numbers, I meant the numbers
in the final results table, not the actual datasets.)

> Don't forget that the compression ration of carray vs gzip will
> change. carray is specialized when the entropy is low in the data, not
> gzip. So you must do the file size comparison for each dataset that
> you will use.

There will of course be variation per dataset, although gzip also
won't work well on data with high entropy (almost by definition). And
actually, the way that blosc's pack_array works (which is what I was
using), is by pickling the numpy array and then compressing the
resulting string. So it is actually treating the ndarray as an opaque
string, much like gzip. (Interestingly, the pickling/unpickling
accounts for the vast majority of blosc's pack_array and unpack_array
run time.) So while I agree that it is definitely worth experimenting
with each new data set, I think blosc has decent odds of performing
well across the board, at least as compared with gzip.

> We need a better way to deal in Theano with dataset that don't fit in
> memory. We also need a way to handle direction in theano generating
> output that don't fit into memory.
>
> For this I plan to test at the end of this week and next week
> something that will use pytables[1]. It is from the same author as
> carray and also allow to use the same compression algo as well as gzip
> and lzo. It also handle the case when not all data fit in memory.

That'd definitely be handy for me. Part of the reason that I've been
poking around at all these is that my dataset will soon not fit in
host memory either, so I'm looking at either keeping it compressed in
memory or reading it in chunks from the filesystem as needed. I look
forward to seeing what you come up with (some form of transparent
compression and decompression, I presume?), particularly as it strikes
me as being a hard problem to solve generally. Please do let me know
if I can be of assistance on this front.

Frédéric Bastien

unread,

Jan 6, 2011, 9:47:28 AM1/6/11

to theano...@googlegroups.com

On Tue, Jan 4, 2011 at 6:56 PM, Josh Bleecher Snyder
<josh...@gmail.com> wrote:
>> We need a better way to deal in Theano with dataset that don't fit in
>> memory. We also need a way to handle direction in theano generating
>> output that don't fit into memory.
>>
>> For this I plan to test at the end of this week and next week
>> something that will use pytables[1]. It is from the same author as
>> carray and also allow to use the same compression algo as well as gzip
>> and lzo. It also handle the case when not all data fit in memory.
>
> That'd definitely be handy for me. Part of the reason that I've been
> poking around at all these is that my dataset will soon not fit in
> host memory either, so I'm looking at either keeping it compressed in
> memory or reading it in chunks from the filesystem as needed. I look
> forward to seeing what you come up with (some form of transparent
> compression and decompression, I presume?), particularly as it strikes
> me as being a hard problem to solve generally. Please do let me know
> if I can be of assistance on this front.

I don't plan to enable directly in theano data compressed in memory,
but I think it is possible to do so with pytables :) It is to keep all
those complicated stuff outside Theano as long as possible that I will
try PyTables first:)