How can ShuffledScheme be used with Batches?

55 views

Skip to first unread message

Petr Bělohlávek

unread,

May 8, 2016, 4:53:59 AM5/8/16

to fuel-users

Hi everybody,

I am quite new to Blocks and Fuel and I would really like to solve a small problem. The goal is to 1) shuffle the stream each epoch and 2) read few batches ahead in order to sort the data by their length (memory reduction).

The following code solved 2). Firstly, it creates a DataSet (from numpy arrays x,y stored in memory). Secondly, a stream is created, which is consequently batched into (k*b) sized batches: b=number of batches; k=number of batches to be read ahead. Finally, if k>1, the already read batches are sorted by length, unpacked and packed again to batches of size b. The final line adds padding to the new batches.

dataset = IterableDataset({'x': x, 'y': y})
stream = DataStream(dataset=dataset)
stream = Batch(stream, iteration_scheme=ConstantScheme(k * b))

if k > 1:
    stream = Mapping(stream, SortMapping(_length))   # wrapper for function len
    stream = Unpack(stream)
    stream = Batch(stream, iteration_scheme=ConstantScheme(b))

stream = Padding(stream, mask_sources=['x'])

This actually works well. My problem is 1), i.e. how to randomly shuffle the stream each training epoch. I hoped for setting ShuffledScheme or ShuffledExampleScheme as dataset's iteration_scheme but it for some reason didn't work as some other arguments were required or ValueErrors were raised.

Can I somehow use one of these schemes in order to achieve shuffling? Could you please tell me how? I believe this is common problem so it should be very straightforward. Thanks in advance!

Petr

Petr Bělohlávek

unread,

May 8, 2016, 8:22:14 AM5/8/16

to fuel-users

Well I figured it out. This is working code:

dataset = IndexableDataset({'x': x, 'y': y})
stream = DataStream(dataset=dataset,
                    iteration_scheme=ShuffledScheme(batch_size=args.sort_k_batches_ahead * args.batch_size,
                                                    examples=dataset.num_examples))

if args.sort_k_batches_ahead > 1:
    stream = Mapping(stream, SortMapping(_length))
    stream = Unpack(stream)
    stream = Batch(stream, iteration_scheme=ConstantScheme(args.batch_size))



stream = Padding(stream, mask_sources=['x'])

The thing was the the IterableDataset which for some reason doesn't support ShuffledScheme.