Any time you add a blob, it will increase the memory usage. The layer introduces the blob "
dataaugmentation". I'm guessing "data" is then loaded into the GPU. The data augmentation layer problem then copies the data back to the CPU to perform its augmentation, and then the data get re-copied to the GPU. I'm not sure if this is the entire story, I don't know how much the memory usage is increased by.
You might want to try having the Python layer read in the data. It might be difficult to make the code as efficient as the C++ code though.
You could also try make the computation in-place, although I think you would still have data being copied to the GPU before the augmentation.
Cheers,
Jonathan