Unreasonably slow data access from streams.get_epoch_iterator()

18 views

Skip to first unread message

ppk...@gmail.com

unread,

Feb 28, 2016, 7:00:33 AM2/28/16

to fuel-users

I set up a data stream to get batches of images from DogsVsCats. Then I used the following code to time how much time it takes to get a batch from the stream over an epoch.

import timeit

start_time = timeit.default_timer()

for x,y in data_stream.get_epoch_iterator()

    print(timeit.default_timer()-start_time)

    start_time = timeit.default_timer()

And I got something like this as time between each batch:
0.115445137024
0.0785758495331
0.066997051239
0.0754771232605
0.0781049728394
0.0756139755249
0.0682671070099
0.087070941925
0.0688378810883
0.0686841011047
0.0885288715363
0.0747940540314
0.0728580951691
0.0670669078827
0.0878319740295
0.0848870277405
0.0787830352783
0.0777719020844
0.0734760761261
7.66090202332
12.0970921516
13.2528531551
13.7659142017
13.7945039272
11.2318739891
12.5979430676
At the beginning of the iteration it was fast and then all of a sudden it hit a wall and became really slow.
I am not using DataServerStream so there shouldn’t be any buffer issue there.
This is the data_stream in question :
# Let's load and process the dataset
from fuel.datasets.dogs_vs_cats import DogsVsCats
from fuel.streams import DataStream
from fuel.schemes import ShuffledScheme
from fuel.transformers.image import RandomFixedSizeCrop
from fuel.transformers.image import MinimumImageDimensions
from fuel.transformers import Flatten
from fuel.transformers import ScaleAndShift


train_set = DogsVsCats(('train',), subset=slice(0, 20000))
valid_set = DogsVsCats(('train',), subset=slice(20000, 25000))
test_set = DogsVsCats(('test',))

batch_size = 128
n_train_batches = train_set.num_examples // batch_size

#################################################################################
# Train Stream
# We now create a "stream" over the dataset which will return shuffled batches
# of size 128. Using the DataStream instead of DataStream.default_stream constructor we return
# our images exactly as is.
stream = DataStream(train_set,
    iteration_scheme=ShuffledScheme(train_set.num_examples, batch_size)
)

# Our images are of different sizes, so we'll use a Fuel transformer
# to upscale images to at least (512 x 512)

upscale_stream = MinimumImageDimensions(stream, (256,256),which_sources=('image_features',))

# Take random crops of (32 x 32) from each image
cropped_stream = RandomFixedSizeCrop(
    upscale_stream, (32, 32), which_sources=('image_features',))
# Convert images to [0,1] scale

default_cropped_stream = ScaleAndShift(cropped_stream,  1.0/(255.0), 0., which_sources=('image_features',))

# We'll use a simple MLP, so we need to flatten the images

# from (channel, width, height) to simply (features,)

train_stream = Flatten(

    default_cropped_stream, which_sources=('image_features',))


























What could be possibly be the problem here ?                                                                                          1,9           Top

ppk...@gmail.com

unread,

Mar 3, 2016, 2:40:37 PM3/3/16

to fuel-users, ppk...@gmail.com

It seems like the slowdown is caused by a local storage problem. Once I moved fuel_data to /scratch the problem was gone.