Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU
Interesting discussion.
In Borkar's model, the future increase in performance comes mainly from having a very large number of threads. This will cause a lot of software problems, especially in cases of fine-grained parallelism where a lot of communication and synchronization between threads will spoil the performance. I think such applications would call for longer vector registers instead. The most important problems with long vector registers are, as I see it:
I have designed a new instruction set architecture to meet these
problems. It has variable-length vector registers and a special
addressing mode and loop structure that makes sure the same
software can run optimally on different CPUs with different vector
lengths without recompiling. It also takes data locality into
account. The movement of data from one vector to the same position
in another vector takes typically one clock cycle, while
horizontal movement of data from one vector position to another
depends on the vector length or the distance of movement.