Bug in DenseDesignMatrix's _apply_holdout?

Gerrit Kieffer

Nov 20, 2015, 9:32:23 AM11/20/15
to pylearn-dev
Dear pylearn-dev team,

I was recently trying to split my dataset into training and test/validation sets using bootstrap_holdout(...) when I encountered the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "[...]/pylearn2.git/pylearn2/datasets/dense_design_matrix.py", line 597, in bootstrap_holdout
    return self._apply_holdout("random_slice", train_size, train_prop)
  File "[...]/pylearn2.git/pylearn2/datasets/dense_design_matrix.py", line 520, in _apply_holdout
  File "[...]/pylearn2.git/pylearn2/datasets/dense_design_matrix.py", line 301, in iterator
  File "[...]/pylearn2.git/pylearn2/utils/iteration.py", line 557, in __init__
    raise ValueError("num_batches cannot be None for random slice "
ValueError: num_batches cannot be None for random slice iteration

I believe this is a bug which was introduced with commit b5082926151c2b3b94159c614e7ef5e0adbb8b35 (https://github.com/lisa-lab/pylearn2/commit/b5082926151c2b3b94159c614e7ef5e0adbb8b35), where the parameter num_batches=2 was removed from the method call to self.iterator

Can you confirm this? 

And furthermore I wanted to know if there is a way of random splitting -without- resampling, i.e.: split_dataset_*(...) splits the dataset, but keeps the examples in order, so you always have the same examples in training and test sets, but bootstrap_*(...) just samples randomly with replacement, so you will most likely have some examples in both, the training and the test set.

Thanks in advance.

Kind regards,
Gerrit Kieffer
