Hi everyone,
I was wondering whether is possible to train from scratch the architecture of VGGish... Is the network trained with sounds of a fixed length or you used variable length sounds? If so I am trying to do the same but I can't deal with the different shapes of input (#patch,#frame,#bands)... did you treat each patch as a single sound so that the number of patch were actually the "batch size" or do I miss something?
Thank you for all your work,
Michele