I am trying to use the VGG19 implementation from
keras.applications to extract feature maps from images during preprocessing for the "Show, Attend & Tell" image captioning model (Paper, GitHub).
I've faced an issue with the dimensionality of the VGG output, that makes the code crash during training.
According to the image captioning paper, they use "the 14×14×512
feature map of the fourth convolutional layer before max pooling" (p.6).I find this a little ambigious when comparing to the VGG19 architecture (Paper,
p.3), but the very last convolutional layer "block5_conv4" gets me
close to the expected dimensions. This layer (and other layers in that block), however, return a 15×15×512 tensor.
I
feel like I am missing something obvious. Does anyone have an idea what
the extra dimensions are? How would I identify
which rows/columns to exclude to match the expected dimensionality?
All best,
Matthias