1. With 20-layers ResNet, the test accuracy is around 89% that is still ~2% behind the result reported in the paper. I wonder what can I do to further improve this result.
The training set is split into 45,000 training examples and 5,000 validation examples.
I used 'he_normal' initialization with no bias for every convolution layers.
Data augmentation is performed using Keras's ImageDataGenerator.
I used batch size of 100. Changing this to 200 does not change the performance too much.
I used 'adam' optimizer.
I implement my own ResNet. It's a straightforward implementation that follows the structure described in the paper. The following code is used to construct the residual basic block:
def create_res_basicblock(input_shape, ksize, n_feature_maps, reduce_first):
x = Input(shape=(input_shape))
# identity path
ss = (1,1)
xx = x
if reduce_first:
ss = (2,2)
xx = AveragePooling2D(pool_size=(2,2), dim_ordering='th')(x)
# pading zero channels
if n_feature_maps > input_shape[0]:
tmp = Convolution2D(n_feature_maps-input_shape[0], 1, 1, border_mode='same', bias=False, init='zero')(xx)
tmp.trainable = False # No train, just zero padding
xx = merge([xx, tmp], mode='concat', concat_axis=1)
# residual path
residual = Convolution2D(n_feature_maps, ksize, ksize, border_mode='same', init='he_normal', bias=False, subsample=ss)(x)
residual = BatchNormalization(axis=1, mode=2)(residual)
residual = Activation('relu')(residual)
residual = Convolution2D(n_feature_maps, ksize, ksize, border_mode='same', init='he_normal', bias=False)(residual)
residual = BatchNormalization(axis=1, mode=2)(residual)
y = merge([xx, residual], mode='sum')
z = Activation('relu')(y)
block = Model(input=x, output=z)
return block
2. I googled and found several ResNet implementations using Torch. I wonder if the two percents difference are due to the difference between Keras and Torch?
3. I used a machine with K20 card, and it took about 6-8 hours to train 200 epoch (I am not alone on the machine). In the paper, the authors trained ResNet for more than 30,000 "iterations". I wonder if the "iteration" referred to in the paper is the same as epoch we use in Keras/Theano. Do they use really powerful computer or Torch is much faster than Keras/Theano.