The cgo solution with embedded assembly gives way too much overhead. I managed to implement RDTSC both via direct asm and via cgo - the difference is overwhelming. Call-to-call in go-asm gives avg 90 cycles while in cgo-asm about 1300 cycles...
Moreover, is there a way to allocate memory with specific alignment (16 bytes to be precise) ? One solution I see is to allocate it in cgo and pass back as unsafe.Pointer. Is this the way?
All this is necessary for me to implement ultra low latency interface to Via Padlock cryptographic coprocessor.