Unreasonably slow data access from streams.get_epoch_iterator()

17 views
Skip to first unread message

ppk...@gmail.com

unread,
Feb 28, 2016, 7:00:33 AM2/28/16
to fuel-users
I set up a data stream to get batches of images from DogsVsCats. Then I used the following code to time how much time it takes to get a batch from the stream over an epoch.

import timeit
start_time = timeit.default_timer()
for x,y in data_stream.get_epoch_iterator()
    print(timeit.default_timer()-start_time)
    start_time = timeit.default_timer()

And I got something like this as time between each batch:

0.115445137024
0.0785758495331
0.066997051239
0.0754771232605
0.0781049728394
0.0756139755249
0.0682671070099
0.087070941925
0.0688378810883
0.0686841011047
0.0885288715363
0.0747940540314
0.0728580951691
0.0670669078827
0.0878319740295
0.0848870277405
0.0787830352783
0.0777719020844
0.0734760761261
7.66090202332
12.0970921516
13.2528531551
13.7659142017
13.7945039272
11.2318739891
12.5979430676

At the beginning of the iteration it was fast and then all of a sudden it hit a wall and became really slow.

I am not using DataServerStream so there shouldn’t be any buffer issue there.

This is the data_stream in question :

# Let's load and process the dataset

from fuel.datasets.dogs_vs_cats import DogsVsCats

from fuel.streams import DataStream

from fuel.schemes import ShuffledScheme

from fuel.transformers.image import RandomFixedSizeCrop

from fuel.transformers.image import MinimumImageDimensions

from fuel.transformers import Flatten

from fuel.transformers import ScaleAndShift



train_set = DogsVsCats(('train',), subset=slice(0, 20000))

valid_set = DogsVsCats(('train',), subset=slice(20000, 25000))

test_set = DogsVsCats(('test',))


batch_size = 128

n_train_batches = train_set.num_examples // batch_size


#################################################################################

# Train Stream

# We now create a "stream" over the dataset which will return shuffled batches

# of size 128. Using the DataStream instead of DataStream.default_stream constructor we return

# our images exactly as is.

stream = DataStream(train_set,

    iteration_scheme=ShuffledScheme(train_set.num_examples, batch_size)

)


# Our images are of different sizes, so we'll use a Fuel transformer

# to upscale images to at least (512 x 512)


upscale_stream = MinimumImageDimensions(stream, (256,256),which_sources=('image_features',))


# Take random crops of (32 x 32) from each image

cropped_stream = RandomFixedSizeCrop(

    upscale_stream, (32, 32), which_sources=('image_features',))

# Convert images to [0,1] scale

default_cropped_stream = ScaleAndShift(cropped_stream,  1.0/(255.0), 0., which_sources=('image_features',))

# We'll use a simple MLP, so we need to flatten the images

# from (channel, width, height) to simply (features,)

train_stream = Flatten(

    default_cropped_stream, which_sources=('image_features',))


What could be possibly be the problem here ?                                                                                          1,9           Top


ppk...@gmail.com

unread,
Mar 3, 2016, 2:40:37 PM3/3/16
to fuel-users, ppk...@gmail.com
It seems like the slowdown is caused by a local storage problem. Once I moved fuel_data to /scratch the problem was gone. 
...
Reply all
Reply to author
Forward
0 new messages