4 views
Skip to first unread message

Federico Lucifredi

unread,
Mar 9, 2017, 11:28:07 PM3/9/17
to computing-...@googlegroups.com
Another interesting paper out in recent months.

Debunking the 100X GPU vs. CPU Myth:
An Evaluation of Throughput Computing on CPU and GPU 

http://sbel.wisc.edu/Courses/ME964/Literature/LeeDebunkGPU2010.pdf


LeeDebunkGPU2010.pdf

Kurt Keville

unread,
Mar 10, 2017, 9:10:28 AM3/10/17
to computing-...@googlegroups.com
This is a nice study. I would submit that the GPU wall is a a
relatively new phenomenon. Indeed, massive parallelism has allowed us
to navigate around the other (Patterson's) 3 walls for years. The
clever algorithm guys (like Hank Dietz at Kentucky and Rich Brower at
BU) come up with 4x speed improvements on CPU every 3 years, rather
than 2x every 18 months, so we jump back onto the Moore's Law roadmap,
although it is more punctuated equilibrium than classical Moore's
evolution.

Check out Borkar's 3rd slide at
https://www.nextplatform.com/2015/08/12/future-systems-intel-fellow-conjures-the-perfect-exascale-machine/
it is clear that to get back on a 45 degree trajectory we need to fix
the CPU side somehow.
> --
> You received this message because you are subscribed to the Google Groups
> "Computing Performance" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to computing-perfor...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> Best-F
>
> --
> You received this message because you are subscribed to the Google Groups
> "Computing Performance" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to computing-perfor...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Agner Fog

unread,
Mar 10, 2017, 12:34:31 PM3/10/17
to computing-...@googlegroups.com

Interesting discussion.

In Borkar's model, the future increase in performance comes mainly from having a very large number of threads. This will cause a lot of software problems, especially in cases of fine-grained parallelism where a lot of communication and synchronization between threads will spoil the performance. I think such applications would call for longer vector registers instead. The most important problems with long vector registers are, as I see it:

  1. Longer physical distances on the CPU core makes data transfer across a vector slower.
      
  2. Intel is inventing a new instruction set extension every time they increase the length of vector registers. The software has to be compiled again for each new vector length. The costs of developing, testing and maintaining a separate version of your software for each new vector length is so high that it is rarely done. Most software on the market lags 5 years or more behind the hardware for this reason.

I have designed a new instruction set architecture to meet these problems. It has variable-length vector registers and a special addressing mode and loop structure that makes sure the same software can run optimally on different CPUs with different vector lengths without recompiling. It also takes data locality into account. The movement of data from one vector to the same position in another vector takes typically one clock cycle, while horizontal movement of data from one vector position to another depends on the vector length or the distance of movement.

This instruction set also has many other features for improved performance and security. This is all in an early stage of development. The instruction set has been defined, as well as ABI standards, etc., but nothing has been implemented yet. All documentation is on http://www.forwardcom.info
Reply all
Reply to author
Forward
0 new messages