The Hunt for a Universal Compiler Gets $16M

Klaus D. Witzel

unread,

Apr 10, 2009, 4:17:47 PM4/10/09

to Moebius project discussion

Hi List,

just for the records, have a look at their goals below, especially
where it begins with: attempting to automatically ...

Cheers,
Klaus

- http://gigaom.com/2009/04/07/the-hunt-for-a-universal-compiler-gets-16m/

Rice University researchers have received a $16 million grant from the
Defense Advanced Research Projects Agency to develop a universal
compiler that can run on heterogeneous hardware and multicore
platforms, which are found in supercomputers and embedded systems,
such as those used in routers and game consoles. If the effort is
successful it could result in software that takes advantage of the
underlying hardware to create faster and more energy-efficient
computers and devices. As devices evolve, they contain more
semiconductors and use more power, and writing code that takes full
advantage of the hardware becomes more difficult. Programmers can
either customize code running on the heterogeneous hardware or use a
compiler that can translate the code into the 0s and 1s the specific
type of chip can process. However, developing custom code is time
consuming, expensive, and can only be applied to one specific device,
so most chipmakers or outside vendors develop compilers for each chip,
which can take between three to five years, but can be used with the
chip in a variety of devices. The Rice-led project hopes to build a
universal compiler that will improve software performance by
developing a software suite that maps out the limitations of and
capabilities of the hardware it is running on; creating a planned
compiler to examine the source code of the application and attempting
to automatically partition the source code to run on multicore
processors; and creating a runtime tool that measures the performance
of the application on the system, and possibly even changes the code
on the fly if the software is written for an x86 processor. Rice
University professor Keith Cooper says the researchers will have 54
months to achieve this objective.

Igor Stasenko

unread,

Apr 10, 2009, 4:43:24 PM4/10/09

to moebius-proje...@googlegroups.com

2009/4/10 Klaus D. Witzel <klaus....@cobss.com>:

>
> Hi List,
>
> just for the records, have a look at their goals below, especially
> where it begins with: attempting to automatically ...
>

i have feeling that this post lacks technical expertise.
There are zilions of algorythms which can't be parallelised, and i
fear, badly written code will still use single core
despite compiler's "attempts to automatically partition the source
code to run on multicore processors". From that point, i can't
understand what they really want to achieve, except scoring 16M budget
:)

Sidenote: i currently found that my native code model doesn't takes
arithmetic overflows into account, especially integer multiplication ,
where two 32 bit words could produce 64bit.
Its easy to write:
result64 := arg32 * arg23.
and let compiler determine what it needs to do, to work it properly,
but this is not so easy to implement :) I'm a bit stuck, thinking, how
to properly introduce the type system into native code, and to what
extent..

--
Best regards,
Igor Stasenko AKA sig.

Igor Stasenko

unread,

Apr 10, 2009, 5:37:06 PM4/10/09

to moebius-proje...@googlegroups.com

2009/4/10 Igor Stasenko <sigu...@gmail.com>:

> 2009/4/10 Klaus D. Witzel <klaus....@cobss.com>:
>>
>> Hi List,
>>
>> just for the records, have a look at their goals below, especially
>> where it begins with: attempting to automatically ...
>>
>
> i have feeling that this post lacks technical expertise.
> There are zilions of algorythms which can't be parallelised, and i
> fear, badly written code will still use single core
> despite compiler's "attempts to automatically partition the source
> code to run on multicore processors". From that point, i can't
> understand what they really want to achieve, except scoring 16M budget
> :)
>

An example: lets find a maximum number in array of 10M elements.

max := array first.
array do: [ :each | max < each ifTrue: [ max := each ]].

it can be easily parallelised, by splitting a loop on a number of smaller loops.
But you lose most of benefits from parallelism, because of contention
due to writing/reading max value, especially when number of cores are
MANY.

My guess, that following code will beat the above, because of
eliminating contention:

max := array first.
ranges := array getSlicedRanges: System optimalNumberOfCores.
intermediateMaximums := Array new: ranges size.
intermediateMaximums setAll: max.

ranges parallelWithIndexDo: [:i :range |
range start to: range end do: [:j |
(intermediateMaximums at: i) < (array at: j)
ifTrue: [ (intermediateMaximums at: i put: (array at: j) ]
]].

intermediateMaximums do: [:each | max < each ifTrue: [ max := each ]].

Now, let me ask: is it possible to write such sophisticated compiler,
which by taking a simple code (in first example) will create a more
complex, like latter, which is more appropriate for multicore? If they
will be able to do that, then we will have no more job(s), because
they can adopt their compiler on many other computing problems becides
parallelism :)

Klaus D. Witzel

unread,

Apr 11, 2009, 10:09:33 AM4/11/09

to Moebius project discussion

On Apr 10, 10:43 pm, Igor Stasenko wrote:
> 2009/4/10 Klaus D. Witzel:

>
>
>
> > Hi List,
>
> > just for the records, have a look at their goals below, especially
> > where it begins with: attempting to automatically ...
>
> i have feeling that this post lacks technical expertise.

s/post/project/

> There are zilions of algorythms which can't be parallelised, and i
> fear, badly written code will still use single core
> despite compiler's "attempts to automatically partition the source
> code to run on multicore processors". From that point, i can't
> understand what they really want to achieve, except scoring 16M budget
> :)

I think that comparision (compiler A betterThan: compiler B) will be
made with existing products (like GCC) and not with what would be
theoretically possibile to achieve, so: there is hope that they
succeeed ;)

> Sidenote: i currently found that my native code model doesn't takes
> arithmetic overflows into account, especially integer multiplication ,
> where two 32 bit words could produce 64bit.
> Its easy to write:
> result64 := arg32 * arg23.

| sum oFlag |
"all oops either 32 or 64 bits bits here"
oFlag := (sum := oopA with0Tag + oopB with0Tag) overflow.
(oFlag + (1 - oopA tagBit) + (1 - oopB tagBit)) = 0
ifFalse: ["either an operand didn't have the smallInt tag or an
overflow occured" ... ]

in which #overflow is a message (like your #word message) for
accessing CPU flags and the other messages are self-explanatory.

I have x86 GCC asm ("...") code which does SmallInteger>>#+ exactly
like the above all-in-registers but, is based on 0-tagged smi, like
Huemul, as discussed with Guille, and so saves #with0Tag and the (1 -
xyz tagBit) which is just (xyz tagBit).

> and let compiler determine what it needs to do, to work it properly,

The above works in simulation (given proper method implementations)
and also has sufficient hints for efficient compilation.

> but this is not so easy to implement :) I'm a bit stuck, thinking, how
> to properly introduce the type system into native code, and to what
> extent..
>
> > Cheers,
> > Klaus

...

Igor Stasenko

unread,

Apr 11, 2009, 3:23:42 PM4/11/09

to moebius-proje...@googlegroups.com

2009/4/11 Klaus D. Witzel <klaus....@cobss.com>:

Yeah, sum is pretty easy to implement using overflow flag.
But my original question was about multiplication. What about it?

Klaus D. Witzel

unread,

Apr 11, 2009, 4:24:10 PM4/11/09

to Moebius project discussion

On Apr 11, 9:23 pm, Igor Stasenko wrote:
> 2009/4/11 Klaus D. Witzel :
...

> >> Sidenote: i currently found that my native code model doesn't takes
> >> arithmetic overflows into account, especially integer multiplication ,
> >> where two 32 bit words could produce 64bit.
> >> Its easy to write:
> >> result64 := arg32 * arg23.
>
> > | sum oFlag |
> > "all oops either 32 or 64 bits bits here"
> > oFlag := (sum := oopA with0Tag + oopB with0Tag) overflow.
> > (oFlag + (1 - oopA tagBit) + (1 - oopB tagBit)) = 0
> > ifFalse: ["either an operand didn't have the smallInt tag or an
> > overflow occured" ... ]
>
> > in which #overflow is a message (like your #word message) for
> > accessing CPU flags and the other messages are self-explanatory.
>
> > I have x86 GCC asm ("...") code which does SmallInteger>>#+ exactly
> > like the above all-in-registers but, is based on 0-tagged smi, like
> > Huemul, as discussed with Guille, and so saves #with0Tag and the (1 -
> > xyz tagBit) which is just (xyz tagBit).
>
> Yeah, sum is pretty easy to implement using overflow flag.
> But my original question was about multiplication. What about it?

Same as in my #+ example. TFM[0] says for IMUL - Signed Multiply: With
the two- and three- operand forms, however, result is truncated to the
length of the destination before it is stored in the destination
register. Because of this truncation, the CF or OF flag should be
tested to ensure that no significant bits are lost.

[0] - http://www.intel.com/design/pentiumii/manuals/243191.htm

Igor Stasenko

unread,

Apr 12, 2009, 4:59:24 AM4/12/09

to moebius-proje...@googlegroups.com

2009/4/11 Klaus D. Witzel <klaus....@cobss.com>:
>

Suppose overflow is occured, what now?
If we having a 64bit registers size, then its trivial to convert
result to big integer instance , but if we don't, then multiplication
needs to be split on multiple stages (by splitting operands to 16bit
pairs to make sure that no 32 bit overflow occurs for each 16bit
pairs).

Klaus D. Witzel

unread,

Apr 13, 2009, 5:11:01 AM4/13/09

to Moebius project discussion

On Apr 12, 10:59 am, Igor Stasenko wrote:
> 2009/4/11 Klaus D. Witzel :
...

> >> Yeah, sum is pretty easy to implement using overflow flag.
> >> But my original question was about multiplication. What about it?
>
> > Same as in my #+ example. TFM[0] says for IMUL - Signed Multiply: With
> > the two- and three- operand forms, however, result is truncated to the
> > length of the destination before it is stored in the destination
> > register. Because of this truncation, the CF or OF flag should be
> > tested to ensure that no significant bits are lost.
>
> Suppose overflow is occured, what now?

You probably lost me now.

> If we having a 64bit registers size, then its trivial to convert
> result to big integer instance ,

No, (oper32 * oper32) will not result into result64 when the
instruction also has to detect overflow. The result in this case is

a) result32 and zero overflow
b) result undefined and overflow is set

So for b) you have to redo the operation *and* now must use *higher*
order objects, the very same procedure as if #primitiveFailed on
*lower* order objects.

> but if we don't, then multiplication
> needs to be split on multiple stages (by splitting operands to 16bit
> pairs to make sure that no 32 bit overflow occurs for each 16bit
> pairs).
>

> > [0] -http://www.intel.com/design/pentiumii/manuals/243191.htm

Igor Stasenko

unread,

Apr 13, 2009, 8:50:57 AM4/13/09

to moebius-proje...@googlegroups.com

2009/4/13 Klaus D. Witzel <klaus....@cobss.com>:

>
> On Apr 12, 10:59 am, Igor Stasenko wrote:
>> 2009/4/11 Klaus D. Witzel :
> ...
>> >> Yeah, sum is pretty easy to implement using overflow flag.
>> >> But my original question was about multiplication. What about it?
>>
>> > Same as in my #+ example. TFM[0] says for IMUL - Signed Multiply: With
>> > the two- and three- operand forms, however, result is truncated to the
>> > length of the destination before it is stored in the destination
>> > register. Because of this truncation, the CF or OF flag should be
>> > tested to ensure that no significant bits are lost.
>>
>> Suppose overflow is occured, what now?
>
> You probably lost me now.
>
>> If we having a 64bit registers size, then its trivial to convert
>> result to big integer instance ,
>
> No, (oper32 * oper32) will not result into result64 when the
> instruction also has to detect overflow. The result in this case is
>
> a) result32 and zero overflow
> b) result undefined and overflow is set

c) IMUL r/m32 9-38/12-41 EDX:EAX := EAX * r/m dword
EDX contains high 32bits
EAX contains low 32bits of result

>
> So for b) you have to redo the operation *and* now must use *higher*
> order objects, the very same procedure as if #primitiveFailed on
> *lower* order objects.

in case if using c) i don't have to redo operation, just to store
result into newly created instance of bigint.
But i agree, that its easier to use r32 := r32* r32 and detect
overflow , and don't mess with 64bit results.

Reply all

Reply to author

Forward