Interface change - Iterators when no data is required

15 views
Skip to first unread message

Pascal Lamblin

unread,
May 10, 2013, 2:39:09 PM5/10/13
to pylearn-dev
Hi,

The mechanism for using data_specs (Spaces and sources) in costs,
models and monitoring channels, to request the appropriate data in the
appropriate format from data sets is making progress, it now almost
works with SGD.

I'm having a design problem in the case where no actual data from a
given data set is actually needed: for instance, if the only costs
used for learning and monitoring are penalties over the model's
parameters.

I've created a Space (NullSpace) for that, so the costs are able
to specify they do not use any data, to avoid that data being
generated/copied/reshaped, etc. I've used None as a placeholder value
for batches, for code that actually needs an object passed.

The problem occurs when trying to compute the actual batch size of that
batch. Since the data is not actually read from the data set, I don't
know what the batch size would have been (if it is the last batch of the
set, for instance, it may be smaller than the requested one). I talked
with David W-F about that when I first encountered the problem, and we
decided to use 0, since no actual sample was returned, and there is no
other way a 0 could be returned otherwise. The iterator would still
produce as many values as if actual data were requested.

However, Monitor actually checks that the sum of the returned batch
sizes corresponds to the total size the data set advertised, and it
complains (see error below).

I see different possible solutions:
- Make the data set iterate over the data, even if none is requested,
so we have the right batch size, and find a way to convey that information
to get_batch_size();
- Make the iterator not return any items, preventing iterating over
a data set when no data is returned, and change the existing tests
that use this feature;
- Make Monitor accept 0 as the number of example, provided the Space
does not contain any data.

I would favour the third solution, but I'm open to discussion and other
suggestions.


======================================================================
ERROR: test_monitor.test_dont_serialize_dataset
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/lisa/os/epd-7.1.2/lib/python2.7/site-packages/nose/case.py", line 187, in runTest
self.test(*self.arg)
File "/u/lamblinp/code/Pylearn2/pylearn2/tests/test_monitor.py", line 333, in test_dont_serialize_dataset
monitor()
File "/u/lamblinp/code/Pylearn2/pylearn2/monitor.py", line 221, in __call__
+ str(actual_ne) + ".")
RuntimeError: At compile time, your iterator said it had 4.0 examples total, but at runtime it gave us 0.

----------------------------------------------------------------------


--
Pascal

Pascal Lamblin

unread,
May 10, 2013, 2:48:30 PM5/10/13
to pylearn-dev
On Fri, May 10, 2013, Pascal Lamblin wrote:
> I see different possible solutions:
> - Make the data set iterate over the data, even if none is requested,
> so we have the right batch size, and find a way to convey that information
> to get_batch_size();
> - Make the iterator not return any items, preventing iterating over
> a data set when no data is returned, and change the existing tests
> that use this feature;
> - Make Monitor accept 0 as the number of example, provided the Space
> does not contain any data.
>
> I would favour the third solution, but I'm open to discussion and other
> suggestions.

Actually, another solution would be to modify FiniteDatasetIterator or
SubsetIterator to have num_examples return 0 when the data_specs is empty.

Ian Goodfellow

unread,
May 10, 2013, 2:50:54 PM5/10/13
to pylearn-dev
Ideally Monitor should special-case the non-data-dependent monitoring
channels and just not do any iterating; it's a waste of computation to
average together several observations of the same constant.

The reason Monitor rejects iterators over 0 examples is that it has to
divide by the number of examples to compute the average, and this will
give a NaN result.
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "pylearn-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pylearn-dev...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Pascal Lamblin

unread,
May 10, 2013, 3:51:39 PM5/10/13
to pylea...@googlegroups.com
On Fri, May 10, 2013, Ian Goodfellow wrote:
> Ideally Monitor should special-case the non-data-dependent monitoring
> channels and just not do any iterating; it's a waste of computation to
> average together several observations of the same constant.

For monitoring, I agree, and it might not be that hard to do.

For training, though, there is currently a test case where the training
cost is some penalty on the model's parameters, and does not use data at
all. The test expects (and so would I) the same number of updates to the
parameters to be made as if the cost included a data-dependent term.
Is it correct to continue reporting "n batches, 0 examples" in that case?

> The reason Monitor rejects iterators over 0 examples is that it has to
> divide by the number of examples to compute the average, and this will
> give a NaN result.

We could also consider that a special case.
--
Pascal

Ian Goodfellow

unread,
May 10, 2013, 4:11:32 PM5/10/13
to pylearn-dev
which test?

Pascal Lamblin

unread,
May 10, 2013, 4:58:32 PM5/10/13
to pylea...@googlegroups.com
On Fri, May 10, 2013, Ian Goodfellow wrote:
> which test?

I think it was training_algorithms/tests/test_sgd.py:test_lr_scalers

--
Pascal

Ian Goodfellow

unread,
May 10, 2013, 4:56:09 PM5/10/13
to pylearn-dev
How about if we just modify the test to use data, and set sgd.py to
raise NotImplementedError if it gets a NullSpace for now?

Pascal Lamblin

unread,
May 10, 2013, 5:05:07 PM5/10/13
to pylea...@googlegroups.com
On Fri, May 10, 2013, Ian Goodfellow wrote:
> How about if we just modify the test to use data, and set sgd.py to
> raise NotImplementedError if it gets a NullSpace for now?

Works for me. I'll let you know if there are other problems, for
instance with BGD.

--
Pascal
Reply all
Reply to author
Forward
0 new messages