Pattern for persistent examples streams?

24 views
Skip to first unread message

Joseph Turian

unread,
Dec 21, 2009, 2:04:17 PM12/21/09
to pylea...@googlegroups.com
I use the following code (get_train_minibatch(), specifically), to
read minibatch examples.
(I cannot shuffle the examples, they are too big. So I have to read
them in order.)
I would like to make my example stream persistent, so that I can stop
training and restart from where I left off.

However, you cannot pickle generator.

In my specific code, what is the cleanest way to save and load the
state of the example stream?
More generally, do you have a pattern for this, as a more general
pylearn concept?

Thanks!

def get_train_example():
for l in open(HYPERPARAMETERS["TRAIN_SENTENCES"]):
prevwords = []
for w in string.split(l):
w = string.strip(w)
id = None
prevwords.append(wordmap.id(w))
if len(prevwords) >= HYPERPARAMETERS["WINDOW_SIZE"]:
yield prevwords[-HYPERPARAMETERS["WINDOW_SIZE"]:]

def get_train_minibatch():
minibatch = []
for e in get_train_example():
minibatch.append(e)
if len(minibatch) >= HYPERPARAMETERS["MINIBATCH SIZE"]:
assert len(minibatch) == HYPERPARAMETERS["MINIBATCH SIZE"]
yield minibatch
minibatch = []

James Bergstra

unread,
Dec 21, 2009, 9:17:00 PM12/21/09
to pylea...@googlegroups.com
If you want to persist something by pickling it, then that thing
should be a class instance.

class ExampleStream(object):
def __init__(self):
self.file = open(HYPERPARAMETERS["TRAIN_SENTENCES"])
self.prewords = []
self.line = []
self.line_pos = 0
self.line_no = 0
...
def __iter__(self): return self
def next(self):
while self.line_pos == len(self.line):
self.line = self.file.readline().split()
self.line_pos = 0
self.line_no += 1
return self.line[self.line_pos]


Rewrite your other function as another class with a __iter__ function
and a next() function.

Now you can use "for token in ExampleStream()" type syntax, but you
can also make your class picklable (overriding __setstate__ and
__getstate__ as necessary, to seek through the example file when
reloading the stream).

There may be an easier way if you permit yourself more freedom in
terms of restructuring the code, but that would be the persistable way
of doing what you're already doing with generators.

James

--
http://www-etud.iro.umontreal.ca/~bergstrj

Joseph Turian

unread,
Dec 22, 2009, 2:36:16 AM12/22/09
to pylearn-dev
James,

Thanks.

> Rewrite your other function as another class with a __iter__ function
> and a next() function.

Turns out that you can just put the current generator code into
__iter__ and 'yield' the result.
You then don't even need the next method.

Joseph Turian

unread,
Dec 22, 2009, 4:05:24 AM12/22/09
to pylearn-dev
Reply all
Reply to author
Forward
0 new messages