1. Yes, this is exactly how it works!
2. No difference at all from the mathematical point of view. It might be more handy to concatenate as you have less blobs to care about (and access from Python, for example). On the backend side, Concat means an additional data copy and more RAM used (for the intermediate, concatenated blob), however the loss function GPU kernel might be executed slightly more efficiently (as a single kernel over a larger tensor, as opposed to two kernel launches over smaller tensors).
3. If you concatenate A with itself, it will receive the sum of gradients coming from the top (but technically, it will still be updated just once). The actual value of the update will of course depend on what happens above the concat, but in a trivial case of concatenating A along dim 0 and right after that computing the loss function with labels also concatenated this way, the update will be simply twice the size.
Hope that helps!