I've also struggled with this. It helps to encode the colors in words in your mind, if you are only doing position/color, but might become difficult if you are doing higher modal n-back where you need to rely on your auditory processing for sounds as well.
I suspect there might also be a way to "visualize" the colors in some spatial memory buffer but either my buffer is terribly bad or it's not really possible since my scores drop drastically if I move away from the method of encoding the colors with words.