[apologies, max, re-posting as the google group list isn't configured
correctly, i've seen this before, it's not set up in the headers
correctly with the right "Reply-to" such that when i hit reply -
ironically with gmail - gmail picks my *default* From address... which
is not one that's subscribed to the list! doh!]
On Sun, Jan 21, 2018 at 6:35 PM, Max Hayden Chiz <
max....@gmail.com> wrote:
> Creating a RISC-V GPU with GPGPU capabilities is a good idea.
it would be absolutely amazing, wouldn't it?
> But I was
> putting it under the general "make it better" category since I was confining
> my list to microarchitectural things you could do for our existing
> processors. And because designing a whole new chip is a lot more work than
> adding a microarchitectural refinement, I think it might be more of a PhD
> thing. (But could be wrong.)
the... ah... what's the name... ORSOC Graphics Accelerator guys, they
were a 2-man MSc team. got a heck of a lot done.
i'm a big fan of not redoing work that's already been done, i tend to
find wildly disparate systems and work out the minimum amount of work
needed to join them together. that way you draw on the expertise of
more people, and they appreciate it a lot.
> Also, for GPGPU, you'd ideally want it to be as
> compatible as possible with the forthcoming vector instruction set.
yyeah that's a tough ask in the case of nyuzi. more on this below.
> So you'd
> want to look at Hwacha and our eventual vector unit in addition to Nyuzi
> when you were coming up with the design.
... except nyuzi has been published, and hwacha hasn't. at least,
there's no public repositories that i can easily find - just a couple
of pages. nyuzi on the other hand has two follow-on research papers,
a complete repository and full and complete documentation.
i've been in touch with jeff and he's a lot of fun to talk to,
extremely knowledgeable. he's done some amazing analysis and has a
clear understanding and breakdown of the tasks and number of
instructions per task in each phase of modern (shader-based) 3D
Graphics Procesing. he also makes it clear that the metric to focus
on, for optimisation and evaluation purposes, is "instructions per
pixel".
> For non-vectorizable loops, we have lots of different possible options and
> no one knows the best way to handle them or what the trade-offs are.
well, this is where jeff's approach - and focus - would come in
handy. and whilst i appreciate that hwacha may, technically, have a
better approach, the fact that nobody outside of the group can *look*
at what they're doing means that, in my mind, sadly it is off the
table for consideration.
> The area is largely unexplored
*precisely*. it's... *sigh* we (royal we) kinda left it a bit late
in the game, for the incumbent proprietary companies to get what... a
20 year head start? it reminds me of a professor i met once who left
Seagate because within that *one* company - all of them are
reverse-engineering each others' hard drives down to the molecular
level - they're *literally* 20 years ahead of academia in
electro-magnetism and he said he just couldn't stand how they were
keeping all that knowledge secret. so... he left and has been
publishing papers ever since.
> and getting the various research ideas into the
> tree would help future researchers because it would make it possible to give
> a solid comparison.
very much so.
> E.g. Ideally you'd want to know how these accelerators stack up to something
> simpler like generalized chaining
> (
https://snehasish.net/docs/sharifian-micro16.pdf) which could either be an
> ISA extension or could just be an under-the-hood thing for hot traces picked
> up by a trace-cache. (The idea here is that you group instructions with
> single dependencies into chains so that the you amortize the decoding cost
> and avoid the communication and scheduling cost of handling them
> separately.)
niiice. ha, i told jeff about how Ingenic did it, when they added
X-Burst to their ultra-low-power MIPS processor, he loved it but also
had a lot of respect for what they did, i'll explain why X-Burst is a
Vector SIMD pipeline that they usually run at 500mhz when the main
processor is running at 1ghz. get this: they get 30 million triangles
per second.... by running awk/sed macros on the standard mesagl
software library, doing pattern-matching on c code and grafting /
substituting X-Burst assembly code in its place! frickin awesome hack
or _what_? :)
but here's the thing: when you try adding this stuff "properly" to
say gcc, you only have to look at say how long it took to get the
altivec support into gcc to know that it's just.. yeah, it's too much.
so instead jeff focussed on a specialised set of tools, and on adding
LLVM support for the Nyuzi (general-purpose) instruction set, and has
left it at that. sometimes, it's easier to do that, y'know?
> The "behavior specialized accelerators" in the above papers get more
> performance, save more power, and work for more loops (in particular loops
> with unpredictable branches), but as-is they require a compiler to take some
> profiling information and then generate special dataflow instructions for
> the loop accelerators.
yehyeh, which is a whooole research area on its own. this is one of
the reasons why i suggested nyuzi, because jeff and the team he's
worked with did all that, already, some years back. that's not to say
that *everything* is done - far from it: for the published papers the
team *specifically* focussed on the core algorithms of 3D engines (the
inner loops). but they also got actual 3D renderng demos up and
running which is a huge achievement (teapot, quake, others).
so "bang-per-buck" wise (and also "getting up and running fast"-wise)
a conversion of nyuzi's processing front-end to RISC-V would be a
higher "return on investment" than anything else i know of [that's
publicly available]. MIAOW is a totally different focus: it's
*specifically* compatible with ATI (now AMD)'s OpenCL Engine and it
would be... unfair to take that achievement away, because you'd be
throwing away the opportunity to utilise an entire pre-existing
*well-tested* - and proven - toolchain.
apologies, i think in these kinds of pragmatic, practical terms,
taking into consideration both the hardware *and* software aspects,
based on what's available and already been done *right* now.
> That's probably fine for embedded use. But for
> general purpose use, you either need to come up with a good ISA extension
> that isn't directly tied to the hardware implementation or (ideally) we'd
> have some kind of loop detector that would detect when we were in an
> acceleratable loop and do the conversion itself.
the warning here, if i may make one, comes in the form of Imagination
Technologies absolute f****** dog's dinner
cluster****-of-an-architecture. it's. universally. HATED. by.
engineers.
luc verhaegen's "state of free software graphics" talk from i think
2014 is the most informative and insightful, but i also know a little
bit about its background. it was developed as a general-purpose
processor by an Imperial College Professor, some time over 20 years
ago. it was supposed to be "flexible" as well as powerful. however,
the level of control-freak-ism adopted by ImgTec, in combination with
the many many changes *per customer* that were made left the code in
such an absolute mess that not even ImgTec's own engineers could
properly understand it... *even when* they charged customers USD
$150,000 to grant them access to the source code... under NDA... no
access to anyone outside of the company permitted to talk about it.
it's the absolute absolute worst of all worlds, and it comes down to
an attempt to turn a general-purpose processor into a specialist
heavily-customisable 3D-capable one. then try to keep it proprietary,
and prevent and prohibit all and any discussion amongst *top*
researchers and experts in the world who could... y'know... actually
HELP?!?!
so there is a warning, there, which could be, y'know, absolutely
fine, *as long as people communicate*, but also to not, for goodness
sake, try to expect general-purpose compilers like gcc do all the
heavy lifting.
bottom line is, for a first implementation it's *okay* to use
hilarious awk/sed scripts, or raw assembly blocks, or to call out to
DMA-based hard macros. this _is_ 3D after all, y'know?
> OTOH, from an amdahl's law perspective the performance boost being reported
> is too low. They are being selected for 60-80% of the computation time but
> are throughput saturated. While this gives a significant performance boost
> for smaller processors, the numbers indicate that provisioning them with
> more hardware is a net-win. (And doing that might make them beneficial to
> larger processors as well.)
photos of the die area of GPUs show that they're absolutely eeenooormous.
> So I'd like to see some better treatment of the idea and think we'd be able
> to do that in the RISC-V context.
oh, the other thing is: nyuzi does *not* have any specialist
optimised acceleration blocks. it's *very* deliberately focussed at
being a *software-only* general-purpose processor, with performance on
OpenCL that rivals / equals MALI engines.
> The accelerators clearly have benefits in
> the embedded context.
oh! yes, for things like crypto, definitely, an embedded
co-processor is essential for example, as a general-purpose low-power
1ghz processor would just be completely overwhelmed otherwise.
but for video it's actually really really important, the power
savings on an intel laptop for using vaapi make the difference,
particularly on this laptop i'm using which has a 3000 x 1800 LCD, is
insane. i get something like 25% CPU usage @ 800mhz per core when
using vaapi, and something like 50 to 60% CPU usage and cpufreq has to
bump things up to... 1.2 1.6 ghz without. it means i can watch a 2
hour 720p film without the battery running out. or without getting
burned by the back of the laptop!
... and that's a *modern* skylake quad-core i7. accelerators aren't
just good for embedded use-cases, is my point.
> But working out what to do with them in the context of
> say 4-way BOOM is going need more research.
yyeah which is in some ways, i feel, a good reason for keeping things
separated in some way. i'm *partly* talking my way out of
recommending the idea / suggestion that i had, but if someone doesn't
start a 3D / OpenCL Vector Processor for RISC-V we'll never _get_ to
the point where 4-way BOOM research into Vector Processing could even
start, ehn?
:)
oh, before i forget, that reminds me of something, which i should
start as a separate thread, if that's ok.