I think we can understand the 1*1 convolution as a pixel-wise linear classifier.
For example the input feature map size is (128,500,500).
If the output is (1,500,500), the convolution kernel size would be (1,128,1,1). This 128 dimensional vector is actually a linear classifier.
Correct me if I am wrong please.