That's an issue for hardware folks... Relaxed memory ordering performs better because the cores in a manycore machine can avoid doing extra work...
Think that if an architecture implemented a sequential consistent memory model you would only need to perform atomic loads and stores and no memory barriers at all. So it is easier for you the programmer. BUT, the architecture underneath will implement in hardware those memory fences for you in EVERY load and store you perform.
Next one realizes that ordering memory actions between different processors is not essential for every load and store. But only when you need it...
So the hardware guys say to you:
Well get those cheap load and stores you want. And if you want to order memory actions then i give to you those extra memory barriers instructions (which are costly with respect to other instructions), and you have to place them where they are needed.
So, the hardware runs fast in the common case and CAN also run fast in the case where synchronization is needed if you succeed in placing only the required memory fences and no more.
This certainly adds to the complexity a programmer has to face when implementing a lock-free algorithm and that's why you don't like it! :)