This was on the MNIST dataset. Unsupervised, so only using the images, and ignoring the ground truth info.
Given an image, the activations of the nodes in the output layer would not be identical due to random weight initialization. The WTA would pick the most active node. I want the winner to specialize in the pattern that let it win, while suppressing the others. The idea is that as this node specializes for one pattern it will have less activation for other patterns, where another node will win the competition. Eventually each node will specialize for a different pattern. You can think of it as a form of clustering.
What happened is that at first the winner of the WTA changed with different input. Makes sense. After some more iterations, node i would win more frequently and eventually always win. Whatever the pattern, only one node would always fire. It's as if that node learned the average of all patterns that led to its activation being higher than the activations of all other nodes. I checked the activations of the other nodes were pretty low.
This behavior didn't make much sense. There might have been an implementation error somewhere.
I also tried a soft-WTA. The activations would from a distribution from which the WTA would sample. This introduced stochastic noise into the WTA. But unfortunately, it didn't change the odd behavior.
I call this a hack because I didn't implement it as a pooling layer. I actually did this in the Conv layer. When initializing the label variable in the backward() and forward() methods, I ignored the actual label from the ground truth and set the label to be the argmax of the activations. This argmax was my hard WTA.