Someone's Trying Again (Ascenium)

Quadibloc

unread,

Jul 12, 2021, 5:17:29 PM7/12/21

to

Came across this item:

https://www.nextplatform.com/2021/07/12/gutting-decades-of-architecture-to-build-a-new-kind-of-processor/

about an attempt to design a processor that does away with instruction
sets and all those instructions that just move data around.

John Savard

MitchAlsup

unread,

Jul 12, 2021, 5:56:19 PM7/12/21

to

Sounds like a cross between the Transputer and an systolic array.

luke.l...@gmail.com

unread,

Jul 12, 2021, 6:28:25 PM7/12/21

to

On Monday, July 12, 2021 at 10:17:29 PM UTC+1, Quadibloc wrote:

> about an attempt to design a processor that does away with instruction
> sets and all those instructions that just move data around.

reminds me of Elixent's stuff. they did NSEW neighbour processing.
fascinating design: 4-bit ALUs. tens of thousands of them. programming
this style of processor is an absolute pig.

l.

John Dallman

unread,

Jul 13, 2021, 3:38:14 AM7/13/21

to

In article <9b868d91-8bb9-41b8...@googlegroups.com>,

luke.l...@gmail.com () wrote:

> reminds me of Elixent's stuff. they did NSEW neighbour processing.
> fascinating design: 4-bit ALUs. tens of thousands of them.
> programming this style of processor is an absolute pig.

I once had to respond to a pitch from a company who wanted to build big
grids of 8-bit processors along similar lines. They claimed to be able to
compile C and C++ code into highly optimised programs for this hardware,
and were trying to get ISVs to commit to supporting it to help their case
for venture capital.

It looked rather hard to debug on, which they admitted. I asked how they
were planning to offer support on bugs in their compiler and the response
"Why would there be any?" doomed the pitch.

John

David Brown

unread,

Jul 13, 2021, 5:55:52 AM7/13/21

to

Didn't the Itanium demonstrate that "magic compilers" don't work?
Compilers (and other tools) along with massive arrays of limited cores
are fine for code that has a lot of calculations but little variation
and few conditionals - so they are useful for graphics, AI, physical
simulations, etc. But they fall apart as soon as you are do a few "if"
statements and you get beyond the possibility of running all paths at
once and doing a conditional move at the end.

Marcus

unread,

Jul 13, 2021, 7:25:54 AM7/13/21

to

If at first you don't succeed...?

I'm personally quite skeptical about relying too much on compiler
technology advancements. Itanium proved it's a bad strategy. Auto-
vectorization for SIMD ISA:s proved it's a bad strategy.

OTOH we have the end of Moore's Law around the corner, and we've been
trying to crack the problem of massive parallelism for decades now, so
I really hope that we'll see some (useful) paradigm shifts in the near
future.

To me it seems that it's more of a programming problem than a HW
problem, and it feels like we're using the wrong tools to describe
solutions to our problems. At some point in time serial & branchy
instruction streams derived from traditional programming languages
will be harder and more expensive to construct and run than to simply
throw massively parallel compute at the problem.

I work on a product where we use a mix of hand-made algorithms and
deep learning, so I'm front row witnessing the transition from well
defined logic & algebra to fuzzy neural networks. And how successful
that transition is. And it's frankly quite scary IMO.

Or maybe we should just accept the end of Moore's Law and require
programmers to got back to being great again, like they were a few
decades ago. ;-) I wonder how many modern day programmers would pull
off something like a graphical action/adventure game with a 1 MHz
8-bit CPU and 128 bytes of RAM (e.g. Atari 2600 Pitfall).

David Brown

unread,

Jul 13, 2021, 8:13:13 AM7/13/21

to

What seems to be missing here (at least, /I/ have missed it) is a
discusion about what people actually want to do with computing power.
Is it fair to suggest that most tasks that people currently want to run
faster, are actually reasonably parallel? And that most tasks that
people want to run a /lot/ faster are /very/ parallel?

For the majority of users, about the only thing that needs to go faster
than today's machines is games. Moore's Law is not a problem for them -
graphics cards that do more in parallel make the games faster. Some
aspects are still cpu-bound, but often multiple cpu cores will scale as
well as single-threaded performance. For more professional work, 3D
CAD, software builds, modelling work, simulations, etc., - much of it
can use multiple cpu cores just as much as fast single cores.

On servers, it's usually lots of semi-independent tasks that can be
spread amongst many cores. For HPC, massively parallel is the norm.

And for the current fad of AI, it's inherently very parallel.

So I agree with you that it is a primarily a programming problem, and
the answer lies in better tools, training, languages, etc., aimed at
parallelisation of code.

Where hardware can help is to make multi-core processors that have
features to aid parallel coding and synchronisation. Instead of the
absurdly inefficient and complicated memory models, barriers, fences,
software synchronisation, and overly general bus locks, cache snooping
and the rest of it, we need hardware that supports key OS features.
Cores should know the id of the thread they are running. Processors
should have a shared block of locks, immediately accessible by all cores
rather than putting locks in main memory - so taking a lock will be a
few cycles if it is uncontested. Processors should support mailboxes
for rapid messaging. It is time for processors to support
multi-threading OS's, instead of the OS having to deal with awkward
processors.

Theo Markettos

unread,

Jul 13, 2021, 9:14:19 AM7/13/21

to

David Brown <david...@hesbynett.no> wrote:
> What seems to be missing here (at least, /I/ have missed it) is a
> discusion about what people actually want to do with computing power.
> Is it fair to suggest that most tasks that people currently want to run
> faster, are actually reasonably parallel? And that most tasks that
> people want to run a /lot/ faster are /very/ parallel?

I think the question to ask is: what tasks are we /not/ doing, because we
don't have the computing power to handle them?

Artificial neural networks were invented a long time ago, but were not
feasible to deploy because they needed what was, at the time, infeasible
amounts of computing power. Now the computing power has caught up and
they're feasible - and so we're using them as a hammer to hit every problem.
The hardware is hugely worse in size and power efficiency than the neural
network between our ears, but we pay that cost.

So what fields have dismissed problems because there is at present
insufficient computing resource?

I don't buy the 'throw infinite parallelism at it' argument BTW - even if
you build such a processor you still have to feed it from a memory that
represents the shared state of the system (and manage its coherence).
Again, we can cream off those problems with a limited amount of shared state
that can be parcelled up to individual cores or computers that interact
minimally (the classic scale out) and those are the problems that have seen
the most attention. But I wonder about the problems nobody's trying because
they're too hard to do in that way.

Theo

George Neuner

unread,

Jul 13, 2021, 9:31:13 AM7/13/21

to

Dunno. I programmed Connection Machines. Admittedly a problem had to
be embarrassingly parallel to fit with the hardware ... but given
that, the CM was quite easy to work with.

George

MitchAlsup

unread,

Jul 13, 2021, 12:31:53 PM7/13/21

to

A) Bingo:: this has proven to be a SW problem
B) we do not yet have a "Touring" model of interacting threads
C) One big problem is that HW does not provide proper/adequate primitives
(instructions or streams of instructions) that deliver the kind of ATOMICITY
SW requires.
<
As to C:: My 66000 (and apparently MILL) provide multiple location ATOMICs
With significantly better semantics than DCAS or LL/SC........

<
> At some point in time serial & branchy
> instruction streams derived from traditional programming languages
> will be harder and more expensive to construct and run than to simply
> throw massively parallel compute at the problem.
>
> I work on a product where we use a mix of hand-made algorithms and
> deep learning, so I'm front row witnessing the transition from well
> defined logic & algebra to fuzzy neural networks. And how successful
> that transition is. And it's frankly quite scary IMO.
>
> Or maybe we should just accept the end of Moore's Law and require
> programmers to got back to being great again, like they were a few
> decades ago. ;-) I wonder how many modern day programmers would pull
> off something like a graphical action/adventure game with a 1 MHz
> 8-bit CPU and 128 bytes of RAM (e.g. Atari 2600 Pitfall).
<

1.238% of them.

Terje Mathisen

unread,

Jul 13, 2021, 3:14:13 PM7/13/21

to

The fun part for me is that the founders are all Norwegian (until they
brought in Peter Toley), which is interesting...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

John Levine

unread,

Jul 13, 2021, 4:25:45 PM7/13/21

to

According to Marcus <m.de...@this.bitsnbites.eu>:

>I'm personally quite skeptical about relying too much on compiler
>technology advancements. Itanium proved it's a bad strategy. Auto-
>vectorization for SIMD ISA:s proved it's a bad strategy.

I dunno, the IBM 801 and PL.8 proved it is a good strategy, at least if your name is John Cocke or Fram Allen.

Perhaps the lesson is to be realistic in your plan for how much hardware ugliness you can paper over with
compiler cleverness, and to remeber that the tradeoffs change over time.

Itanium was based on work at Multiflow in the 1980s, when the amount of stuff you could do in hardware was a lot
less than it was a decade or two later. The compiler did what it could to schedule memory references statically,
but once you could do that in hardware, dynamic hardware scheduling worked a lot better.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Quadibloc

unread,

Jul 13, 2021, 5:58:49 PM7/13/21

to

On Tuesday, July 13, 2021 at 6:13:13 AM UTC-6, David Brown wrote:

> What seems to be missing here (at least, /I/ have missed it) is a
> discusion about what people actually want to do with computing power.
> Is it fair to suggest that most tasks that people currently want to run
> faster, are actually reasonably parallel? And that most tasks that
> people want to run a /lot/ faster are /very/ parallel?

Unfortunately, no, it isn't.

Many of the tasks that people want to run faster are parallel, but some of
them are not.

Also, it's at least possible that *some* of the tasks people want to run
faster... are perhaps more parallel than people realize, and some new
programming paradigm might enable this parallelism to be brought to
light. At least that's the hope that fuels attempts to design compilers
that will bring out extra parallelism that's not obvious to human programmers.

John Savard

Theo Markettos

unread,

Jul 13, 2021, 6:12:09 PM7/13/21

to

I'm curious about this:

"We can run 700,000 lines of code, which includes standard C libraries used
in SPEC, and we compile that and run that on our FPGA testbed, which is not
the full architecture, but a big chunk of it, and get functionally correct
results. We have a full symbolic debugger and other infrastructure to
actually make something like that work."

I wonder what's in those 700KLOC. Because frequently existing codebases are
very CPU-oriented. For example, lots of control flow, doing memory
allocations all over the place because that's what CPUs do, but things that
don't lend themselves to parallelism. And then those codebases expect to do
things like fopen() and printf() and mmap() and other things which don't sit
very well in a data-oriented machine. So I'm curious as to what OS support
they have, or whether they're mining some codebase that intentionally
doesn't have all the pesky I/O and control flow that today's real software
does.

Theo

John Levine

unread,

Jul 13, 2021, 6:22:28 PM7/13/21

to

According to Theo Markettos <theom...@chiark.greenend.org.uk>:

>Quadibloc <jsa...@ecn.ab.ca> wrote:
>> https://www.nextplatform.com/2021/07/12/gutting-decades-of-architecture-to-build-a-new-kind-of-processor/
>>
>> about an attempt to design a processor that does away with instruction
>> sets and all those instructions that just move data around.
>
>I'm curious about this:
>
>"We can run 700,000 lines of code, which includes standard C libraries used
>in SPEC, and we compile that and run that on our FPGA testbed, which is not
>the full architecture, but a big chunk of it, and get functionally correct
>results. We have a full symbolic debugger and other infrastructure to
>actually make something like that work."
>
>I wonder what's in those 700KLOC. Because frequently existing codebases are
>very CPU-oriented.

Standard C libraries are written to be portable and to abstract away the details of
the machine into macros and parameter definitions.

If I were looking for test code for a wacky architecture, it would be a good choice.
One thing that's non-negotiable is that data has to be 8 bit byte addressable. That's
built in too many places.

chris

unread,

Jul 13, 2021, 6:51:46 PM7/13/21

to

A convergence with ai techniques, perhaps ?...

MitchAlsup

unread,

Jul 13, 2021, 7:54:37 PM7/13/21

to

On Tuesday, July 13, 2021 at 3:25:45 PM UTC-5, John Levine wrote:
> According to Marcus <m.de...@this.bitsnbites.eu>:
> >I'm personally quite skeptical about relying too much on compiler
> >technology advancements. Itanium proved it's a bad strategy. Auto-
> >vectorization for SIMD ISA:s proved it's a bad strategy.
> I dunno, the IBM 801 and PL.8 proved it is a good strategy, at least if your name is John Cocke or Fram Allen.
>
> Perhaps the lesson is to be realistic in your plan for how much hardware ugliness you can paper over with
> compiler cleverness, and to remeber that the tradeoffs change over time.
>
> Itanium was based on work at Multiflow in the 1980s, when the amount of stuff you could do in hardware was a lot
> less than it was a decade or two later. The compiler did what it could to schedule memory references statically,
> but once you could do that in hardware, dynamic hardware scheduling worked a lot better.
<

Except for that power thing............

MitchAlsup

unread,

Jul 13, 2021, 7:55:14 PM7/13/21

to

On Tuesday, July 13, 2021 at 5:12:09 PM UTC-5, Theo Markettos wrote:
> Quadibloc <jsa...@ecn.ab.ca> wrote:
> > Came across this item:
> >
> > https://www.nextplatform.com/2021/07/12/gutting-decades-of-architecture-to-build-a-new-kind-of-processor/
> >
> > about an attempt to design a processor that does away with instruction
> > sets and all those instructions that just move data around.
> I'm curious about this:
>
> "We can run 700,000 lines of code, which includes standard C libraries used
> in SPEC, and we compile that and run that on our FPGA testbed, which is not
> the full architecture, but a big chunk of it, and get functionally correct
> results. We have a full symbolic debugger and other infrastructure to
> actually make something like that work."
<

Ask them about their SPECint score ??

Marcus

unread,

Jul 14, 2021, 2:01:05 AM7/14/21

to

On 2021-07-13, Theo Markettos wrote:

[snip]

> The hardware is hugely worse in size and power efficiency than the neural
> network between our ears, but we pay that cost.
>

I personally think that the proper implementation of a power efficient
neural network requires two things:

1) Memory (signals and weights) should be co-located with the ALU:s (or
distributed across the compute matrix, if you will).

2) Compute cells should only be active when activated by a signal.
Perhaps the design should not be clocked by a global clock at all?

And if the speed and power efficiency advantage is big enough, some may
even accept a design with non-deterministic output (example: the neural
network between our ears) - e.g. as a result of an asynchronous design.
That could further open up for things like cheaper / denser memory
structures where bit-flips could be accepted (or maybe even desired?),
etc.

That would be a massive paradigm shift.

/Marcus

David Brown

unread,

Jul 14, 2021, 3:02:01 AM7/14/21

to

On 14/07/2021 08:01, Marcus wrote:
> On 2021-07-13, Theo Markettos wrote:
>
> [snip]
>
>> The hardware is hugely worse in size and power efficiency than the neural
>> network between our ears, but we pay that cost.
>>
>
> I personally think that the proper implementation of a power efficient
> neural network requires two things:
>
> 1) Memory (signals and weights) should be co-located with the ALU:s (or
> distributed across the compute matrix, if you will).
>
> 2) Compute cells should only be active when activated by a signal.
> Perhaps the design should not be clocked by a global clock at all?
>

I agree on both accounts. (I haven't looked much at neural networks
since university, but I assume the principles haven't changed.)

A biological neuron encompasses its own memory (weights), its own
processing, its own IO, its own learning system. To make really
powerful artificial neural networks, the component parts need that too.
Then you can scale the whole thing by adding more of the same.

> And if the speed and power efficiency advantage is big enough, some may
> even accept a design with non-deterministic output (example: the neural
> network between our ears) - e.g. as a result of an asynchronous design.
> That could further open up for things like cheaper / denser memory
> structures where bit-flips could be accepted (or maybe even desired?),
> etc.
>
> That would be a massive paradigm shift.
>

Accepting non-deterministic output, or imperfect results, would be a big
change. It would not be suitable for general computing - but could be
fine for some specialised tasks. A good neural network architecture
could be vastly more efficient for vision processing, just as a good
quantum architecture could be efficient for some kinds of optimisation
problems - but neither would be any use for a Usenet client!

Marcus

unread,

Jul 14, 2021, 3:57:22 AM7/14/21

to

On 2021-07-14 09:01, David Brown wrote:
> On 14/07/2021 08:01, Marcus wrote:
>> On 2021-07-13, Theo Markettos wrote:
>>
>> [snip]
>>
>>> The hardware is hugely worse in size and power efficiency than the neural
>>> network between our ears, but we pay that cost.
>>>
>>
>> I personally think that the proper implementation of a power efficient
>> neural network requires two things:
>>
>> 1) Memory (signals and weights) should be co-located with the ALU:s (or
>> distributed across the compute matrix, if you will).
>>
>> 2) Compute cells should only be active when activated by a signal.
>> Perhaps the design should not be clocked by a global clock at all?
>>
>
> I agree on both accounts. (I haven't looked much at neural networks
> since university, but I assume the principles haven't changed.)
>
> A biological neuron encompasses its own memory (weights), its own
> processing, its own IO, its own learning system. To make really
> powerful artificial neural networks, the component parts need that too.
> Then you can scale the whole thing by adding more of the same.

Exactly. You'll not be bounded by memory bandwidth or similar - it's a
truly distributed system that should scale very well.

>
>> And if the speed and power efficiency advantage is big enough, some may
>> even accept a design with non-deterministic output (example: the neural
>> network between our ears) - e.g. as a result of an asynchronous design.
>> That could further open up for things like cheaper / denser memory
>> structures where bit-flips could be accepted (or maybe even desired?),
>> etc.
>>
>> That would be a massive paradigm shift.
>>
>
> Accepting non-deterministic output, or imperfect results, would be a big
> change. It would not be suitable for general computing - but could be
> fine for some specialised tasks. A good neural network architecture
> could be vastly more efficient for vision processing, just as a good
> quantum architecture could be efficient for some kinds of optimisation
> problems - but neither would be any use for a Usenet client!
>

I think it's kind of like lossless vs. lossy compression. Once you
accept imperfection, you get orders of magnitude wins. For some
applications this will be fine. For some applications where we currently
think that determinism is required, it will be fine too. But for most
of the software that we're used to (OS:es, compilers, Usenet readers,
text editors, Web browsers etc) it will be of little use.

/Marcus

David Brown

unread,

Jul 14, 2021, 5:26:32 AM7/14/21

to

That's a good analogy.

But I also think there is scope for big improvements even within
"normal" deterministic code. As I mentioned elsewhere in the thread, I
think there are features that code be added to current processors that
could greatly improve and simplify parallel coding, and thereby make
better use of the architectures we have.

Marcus

unread,

Jul 14, 2021, 8:20:33 AM7/14/21

to

On 2021-07-13 15:14, Theo Markettos wrote:
> David Brown <david...@hesbynett.no> wrote:
>> What seems to be missing here (at least, /I/ have missed it) is a
>> discusion about what people actually want to do with computing power.
>> Is it fair to suggest that most tasks that people currently want to run
>> faster, are actually reasonably parallel? And that most tasks that
>> people want to run a /lot/ faster are /very/ parallel?
>
> I think the question to ask is: what tasks are we /not/ doing, because we
> don't have the computing power to handle them?
>
> Artificial neural networks were invented a long time ago, but were not
> feasible to deploy because they needed what was, at the time, infeasible
> amounts of computing power. Now the computing power has caught up and
> they're feasible - and so we're using them as a hammer to hit every problem.
> The hardware is hugely worse in size and power efficiency than the neural
> network between our ears, but we pay that cost.
>
> So what fields have dismissed problems because there is at present
> insufficient computing resource?

I think that the most interesting problems & solutions are the ones that
most of us never thought of, until there was adequate hardware to
implement them.

A lot of research went in to image denoising a few years ago (I happened
to work at Autodesk when they implemented a couple of denoising
algorithms for ray tracers), and then suddenly there was a GPU that was
capable of doing it in real time using neural networks. I think that
very few people expected that.

To quote John Carmack (regarding ray tracing in games), [1]:

"One significant thing I didn't have on my radar back then [2013] was
neural network denoising / image enhancement."

/Marcus

[1] https://twitter.com/ID_AA_Carmack/status/1098687168443240450

Stefan Monnier

unread,

Jul 14, 2021, 9:23:42 AM7/14/21

to

> 2) Compute cells should only be active when activated by a signal.
> Perhaps the design should not be clocked by a global clock at all?

I recentlyish saw an article (in ACM Communications maybe?) about the
use of delay to encode values, in order to significantly lower power
consumption. The data representation is analog but it uses standard
digital logic elements (e.g. the AND gate performs the `min` operation
and the OR gate performs the `max` operation).

Stefan

MitchAlsup

unread,

Jul 14, 2021, 12:26:49 PM7/14/21

to

On Wednesday, July 14, 2021 at 1:01:05 AM UTC-5, Marcus wrote:
> On 2021-07-13, Theo Markettos wrote:
>
> [snip]
> > The hardware is hugely worse in size and power efficiency than the neural
> > network between our ears, but we pay that cost.
> >
> I personally think that the proper implementation of a power efficient
> neural network requires two things:
>
> 1) Memory (signals and weights) should be co-located with the ALU:s (or
> distributed across the compute matrix, if you will).
>
> 2) Compute cells should only be active when activated by a signal.
> Perhaps the design should not be clocked by a global clock at all?
<

The calculations are signaled by the arrival of the data. No clock, local or
global.

Stephen Fuld

unread,

Jul 14, 2021, 7:41:31 PM7/14/21

to

On 7/14/2021 12:57 AM, Marcus wrote:
> On 2021-07-14 09:01, David Brown wrote:
>> On 14/07/2021 08:01, Marcus wrote:
>>> On 2021-07-13, Theo Markettos wrote:
>>>
>>> [snip]
>>>
>>>> The hardware is hugely worse in size and power efficiency than the
>>>> neural
>>>> network between our ears, but we pay that cost.
>>>>
>>>
>>> I personally think that the proper implementation of a power efficient
>>> neural network requires two things:
>>>
>>> 1) Memory (signals and weights) should be co-located with the ALU:s (or
>>> distributed across the compute matrix, if you will).
>>>
>>> 2) Compute cells should only be active when activated by a signal.
>>> Perhaps the design should not be clocked by a global clock at all?
>>>
>>
>> I agree on both accounts. (I haven't looked much at neural networks
>> since university, but I assume the principles haven't changed.)
>>
>> A biological neuron encompasses its own memory (weights), its own
>> processing, its own IO, its own learning system. To make really
>> powerful artificial neural networks, the component parts need that too.
>> Then you can scale the whole thing by adding more of the same.
>
> Exactly. You'll not be bounded by memory bandwidth or similar - it's a
> truly distributed system that should scale very well.

The problem is the interconnect. In true, i.e. biological, neural nets,
a neuron receives input from thousands of other (apparently, but not
really, random) neurons. The "wires", AKA axons are each insulated (by
Glial cells) to prevent cross talk. You can't replicate this at scale
in a silicon chip without thousands of layers of interconnect.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Jul 14, 2021, 7:56:47 PM7/14/21

to

Lets make that thousands of layers of transistors ! You need to make both
the transistors and the interconnect 3D.

Quadibloc

unread,

Jul 14, 2021, 9:01:49 PM7/14/21

to

Ideally. But with current technology, thermal issues mitigate against that.

And with a hundred layers of interconnect, and enough transistors on the
substrate, one could make the same circuit as one could with a hundred layers
of transistors - the interconnect runs would just be ten times longer.

John Savard

Quadibloc

unread,

Jul 14, 2021, 9:11:33 PM7/14/21

to

On Wednesday, July 14, 2021 at 7:01:49 PM UTC-6, Quadibloc wrote:
> On Wednesday, July 14, 2021 at 5:56:47 PM UTC-6, MitchAlsup wrote:

> > Lets make that thousands of layers of transistors ! You need to make both
> > the transistors and the interconnect 3D.
> Ideally. But with current technology, thermal issues mitigate against that.

Of course, that might be changing soon:

https://www.extremetech.com/computing/324625-tsmc-mulls-on-chip-water-cooling-for-future-high-performance-silicon

John Savard

MitchAlsup

unread,

Jul 14, 2021, 9:45:58 PM7/14/21

to

This sounds a lot like what Stanford was trying to do (?experimenting?) on
in 1992-ish time frame. Stanford ultimately got 1000W/sq-cm ±
>
> John Savard

Stefan Monnier

unread,

Jul 14, 2021, 9:53:27 PM7/14/21

to

That seems irrelevant: the real problem is not how to move power away
from the chip, but how to reduce the power per unit of work so as to
last longer on the same battery, or so as to use a smaller battery and
make the device lighter.

I'm thinking here about mobile devices, but the same is true for
data centers. The only exceptions seem to be the desktops, where people
don't seem to care about the cost of the power consumption, so they're
willing to waste money, space, and decibels on heat removal.
But I hear that the market for desktops is shrinking pretty fast.

Stefan

Quadibloc

unread,

Jul 14, 2021, 10:39:07 PM7/14/21

to

On Wednesday, July 14, 2021 at 7:53:27 PM UTC-6, Stefan Monnier wrote:

> That seems irrelevant: the real problem is not how to move power away
> from the chip, but how to reduce the power per unit of work so as to
> last longer on the same battery, or so as to use a smaller battery and
> make the device lighter.

Irrelevant? The real problem is how to get the work *done*. If power consumption
can be reduced, great. But failing that, removing more heat is the second-best
way to allow more transistors to switch in a tiny space in a given time.

The goal is... to solve problems. To find answers.

John Savard

Quadibloc

unread,

Jul 14, 2021, 10:41:45 PM7/14/21

to

And, *of course*, even after one has reduced the amount of power
a processor needs a hundredfold, one _still_ can benefit from some method
of removing heat faster that allows a hundred times as many processors to
be packed into a tiny space so that they can communicate quickly.

There is no end to the problems that need solving.

John Savard

Stephen Fuld

unread,

Jul 15, 2021, 12:14:49 AM7/15/21

to

Perhaps. But other approaches are possible. IBM has done several,
including an analog chip that uses charge in a capacitor, sort of like a
real neuron uses,

https://research.ibm.com/publications/unassisted-true-analog-neural-network-training-chip

but the paper is behind a paywall,

and their True North chip

https://research.ibm.com/articles/brain-chip.shtml

David Brown

unread,

Jul 15, 2021, 4:02:15 AM7/15/21

to

That is a problem if you are trying to fully replicate biological neural
systems. But that is not a practical aim - at least, not for a long
time yet. In particular, current neural networks are based on layers,
as a compromise between the capability of the networks and our
understanding of algorithms to teach and tune them. I think you can
come a /long/ way with layers that have a lot of interconnects within
the layer, but only connect to adjacent layers rather than having
connections throughout the system.

When you look at biological neural networks, the great majority of
connections are quite local. The number of long-distance connections
drops rapidly with the distance. After all, the scaling, spacing,
power-management and heat management challenges in biology are not much
different from those in silicon. The key difference is that details of
where these connections are made can change somewhat in a biological
system, while they are fixed in silicon.

So you could make your artificial neural networks with a small number of
fixed long-distance connections, rather than trying to support arbitrary
connections across the network.

Anton Ertl

unread,

Jul 15, 2021, 5:20:50 AM7/15/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Tuesday, July 13, 2021 at 3:25:45 PM UTC-5, John Levine wrote:
>> The compiler did what it could to schedule memory references statically,
>> but once you could do that in hardware, dynamic hardware scheduling worked a lot better.
><
>Except for that power thing............

Including that power thing.

E.g., looking at

<https://images.anandtech.com/doci/14072/Exynos9820-Perf-Eff-Estimated.png>

we see that the OoO Cortex-A75 has better Perf/W than the in-order A55
as soon a you need more than 1.5 SPEC2006 Int+FP of performance, and
even if you need less, you fare hardly better.

Intel tried in-order for their low-power line (Atom) with Bonnell,
while AMD went for OoO with Bobcat. Bobcat is twice as fast per cycle
in my testing as Bonnell. Looking at dual-core chips with integrated
graphics, we see

core proc. chip CPU cl. TDP
Bobcat 40nm AMD C-70 1333MHz 9W
Bobcat 40nm AMD E2-2000 1750MHz 18W
Bonnell 45nm Atom D525 1833MHz 13W
Bonnell 32nm Atom D2700 2133MHz 10W

Taking the 2x IPC advantage of Bobcat into account, Bobcat in 40nm in
a 9W power bracket outperforms Bonnell in 32nm in a 10W power bracket.

Later Intel switched their low-power line to OoO with Silvermont, and
has stayed with OoO since.

Apple uses OoO for their energy-efficient cores (Icestorm in the A14).

The only ones who still seem to believe in in-order for
energy-efficient computing are ARM with their Cortex-A510.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Quadibloc

unread,

Jul 15, 2021, 5:29:15 AM7/15/21

to

On Thursday, July 15, 2021 at 3:20:50 AM UTC-6, Anton Ertl wrote:

> Later Intel switched their low-power line to OoO with Silvermont, and
> has stayed with OoO since.

Couldn't that just be a consequence of process improvements?

A laptop has a certain power budget; it's bigger than that of a
smartphone.

So, at one point, OoO was only within the power budget of a desktop
computer. When it became possible to make an OoO processor with
a power budget suitable for laptops, of course this was faster than
perhaps having more cores that were in-order.

Today, we're at the point were processors in smartphones are
usually OoO. In-order is now used only for very small embedded
processors. This isn't because OoO doesn't need more transistors
and more power. It's because transistors got smaller and better,
so it was easier and easier to come up with the power that OoO
needed.

John Savard

Anton Ertl

unread,

Jul 15, 2021, 5:32:46 AM7/15/21

to

MitchAlsup <Mitch...@aol.com> writes:
>This sounds a lot like what Stanford was trying to do (?experimenting?) on

>in 1992-ish time frame. Stanford ultimately got 1000W/sq-cm =C2=B1

<Heavy speculation>It seems to me that efforts like your AMD K9 and
Intel Tejas were designed with that cooling capacity in mind, and in
2005 it turned out to be not practically doable, and both projects
were canceled.</>

Currently, the power density of hot chips seems to be at maybe
200W/cm^2 (140W for a Ryzen 3800XT with 0.7cm^2 chiplet area), with
higher power density in various spots on the die.

Thomas Koenig

unread,

Jul 15, 2021, 6:44:47 AM7/15/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Currently, the power density of hot chips seems to be at maybe
> 200W/cm^2 (140W for a Ryzen 3800XT with 0.7cm^2 chiplet area), with
> higher power density in various spots on the die.

That's a high enery density, but it is within the range that
can be removed by pool boiling, certainly of water, possibly
by refrigerants (especially if you have some space between
the chips).

Mechanical stresses on the delicate chips could be another matter,
though.

Ivan Godard

unread,

Jul 15, 2021, 6:51:59 AM7/15/21

to

Did you used to work on pressurized water reactors?

Thomas Koenig

unread,

Jul 15, 2021, 7:04:36 AM7/15/21

to

Ivan Godard <iv...@millcomputing.com> schrieb:

No, but my diploma thesis was in the field of pool boiling,
and the subject of how much heat you can remove by boiling
(a.k.a. the critical heat flux) is a standard topic in studying
chemical engineering.

Ivan Godard

unread,

Jul 15, 2021, 7:16:17 AM7/15/21

to

My second guess was superheaters in steam locomotives :-)

George Neuner

unread,

Jul 15, 2021, 10:32:15 AM7/15/21

to

On Tue, 13 Jul 2021 14:58:48 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Tuesday, July 13, 2021 at 6:13:13 AM UTC-6, David Brown wrote:
>
>> What seems to be missing here (at least, /I/ have missed it) is a
>> discusion about what people actually want to do with computing power.
>> Is it fair to suggest that most tasks that people currently want to run
>> faster, are actually reasonably parallel? And that most tasks that
>> people want to run a /lot/ faster are /very/ parallel?
>

>Unfortunately, no, it isn't.
>
>Many of the tasks that people want to run faster are parallel, but some of
>them are not.
>
>Also, it's at least possible that *some* of the tasks people want to run
>faster... are perhaps more parallel than people realize, and some new
>programming paradigm might enable this parallelism to be brought to
>light. At least that's the hope that fuels attempts to design compilers
>that will bring out extra parallelism that's not obvious to human programmers.
>
>John Savard

The problem - at least with current hardware - is that programmers are
much better at identifying what CAN be done in parallel than what
SHOULD be done in parallel.

Starting scads of threads, many (or most) of which will end up blocked
due to lack of memory bandwidth to feed the processor(s), is not a
good idea.

I really like Mitch's VVM. I'm primarily a software guy, but from what
I've understood of it, VVM seems to address a number of the problems
that plague auto-vectorization. Waiting for actual hardware. 8-)

But vectorization is just one aspect of parallelism. There's quite a
lot of micro-thread parallelism inherent in many programs, but getting
compilers to extract it is not easy, and current hardware really is
not designed with micro-threads in mind. Watching the Mill and hoping
it succeeds.

YMMV,
George

Quadibloc

unread,

Jul 15, 2021, 11:11:00 AM7/15/21

to

On Thursday, July 15, 2021 at 8:32:15 AM UTC-6, George Neuner wrote:

> Starting scads of threads, many (or most) of which will end up blocked
> due to lack of memory bandwidth to feed the processor(s), is not a
> good idea.

That's true. But you can buy bandwidth, while latency tends to be a hard
limit. So that isn't a fatal obstacle to doing things faster, whereas not
knowing any way to do things in parallel would be.

John Savard

Stephen Fuld

unread,

Jul 15, 2021, 11:12:35 AM7/15/21

to

Certainly true for humans. Getting closer for flies! :-)

> In particular, current neural networks are based on layers,
> as a compromise between the capability of the networks and our
> understanding of algorithms to teach and tune them. I think you can
> come a /long/ way with layers that have a lot of interconnects within
> the layer, but only connect to adjacent layers rather than having
> connections throughout the system.

"long way" toward what goal? If you want to do engineering, that is, do
useful things, then I certainly agree. If you want to do science, that
is investigate more realistic models to better understand how brains
work, then not so much. (I don't mean to imply that science is useless
- far from it.)

> When you look at biological neural networks, the great majority of
> connections are quite local. The number of long-distance connections
> drops rapidly with the distance. After all, the scaling, spacing,
> power-management and heat management challenges in biology are not much
> different from those in silicon. The key difference is that details of
> where these connections are made can change somewhat in a biological
> system, while they are fixed in silicon.

Yes. Silicon systems emulate the changing connections by having lots of
"excess" connections, many of which have weights such that they are
essentially never used. By changing the weights, you emulate making new
and breaking connections. But this means you still have to have more
long distance connections than are "in use" at any particular time.

> So you could make your artificial neural networks with a small number of
> fixed long-distance connections, rather than trying to support arbitrary
> connections across the network.

Works within the limitations implied by the above.

Thomas Koenig

unread,

Jul 15, 2021, 11:28:30 AM7/15/21

to

My name is not, and has never been, David Wardale :-) (who, AFAIK,
was the last person to do serious steam locomotive engineering).

#ifdef PEDANTIC

Superheaters do what the name says, they superheat steam.
When steam leaves a boiler, is in equilibrium with the liqid it
is in close contact with. A superheater is a heat exchanger which
increases the temperature further. This is a pure gas-phase heat
exchanger, with much lower heat transfer coefficients, but also
with a much lower heat load, so this is not a big problem.

So, you could in principle use a boiling liquid for cooling
compouter chips if you can solve the mechanical and other assorted
problems, but using a steam superheater would make little sense.

#endif

Quadibloc

unread,

Jul 15, 2021, 1:11:31 PM7/15/21

to

On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:

> So, you could in principle use a boiling liquid for cooling
> compouter chips if you can solve the mechanical and other assorted
> problems, but using a steam superheater would make little sense.

Yes; even if one uses a working fluid with a sufficiently low boiling
point to be helpful, the point is that one is no longer benefiting from
the large latent heat of condensation. Just like the latent heat of
freezing makes ice so useful in cooling things (but sadly not computer
chips, as the solid phase is inconvenient to move around).

John Savard

Quadibloc

unread,

Jul 15, 2021, 1:17:47 PM7/15/21

to

On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:

> So, you could in principle use a boiling liquid for cooling
> compouter chips if you can solve the mechanical and other assorted
> problems,

Isn't that what heat pipes use *already*?

John Savard

Thomas Koenig

unread,

Jul 15, 2021, 1:57:10 PM7/15/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

Yep, but I was thinking of direct contact of the chips with the
boiling liquid (with a thin insulating laryer, presumably).

_Much_ more efficient. The nice thing is that your temperature
stays pretty much constant as long as there is enough liquid -
no hot edges.

Another nice property is that, if your chips run at 70°C (let's
say) and your vapor comes out the system at 60°C, the vapor is
easy and cheap to condense - even cooling water at 45°C can do it.

Water would be ideal because of its high enthalpy of vaporization
and because you can realize the highest heat fluxes with it.
It is also non-toxic.

However, it has some unpleasant properties for electronics, such
as being rather conductive with only trace amount of ions needed,
so that is probably out. You would also have to run it in a slight
vacuum to get below 100°C, which could be problematic.

So, another liquid would be called for, preferably something legal,
moral and non-fattening, with a boiling point of around 60°C, or
maybe a bit lower.

Hmm... a bit of a search turns up a Wikipedia aricle on
https://en.wikipedia.org/wiki/Novec_649/1230 which cites a source
that this has already been tried by Intel and SGI, so the
idea isn't new (and frankly, I would have been surprised if it was).

https://multimedia.3m.com/mws/media/569865O/3m-novec-engineered-fluid-649.pdf
tells me that the fluid they used has a rather low enthalpy of
vaporization, only 88 kJ/kg. That's not so great, water has
2100 kJ/kg and many other organic liquids have around 300.

Soo... maybe do a better insulation, put the chips on stacks and
build it all up like a plate heat exchanger.

(And no, I'm not 100% serious.)

MitchAlsup

unread,

Jul 15, 2021, 2:13:50 PM7/15/21

to

On Thursday, July 15, 2021 at 12:57:10 PM UTC-5, Thomas Koenig wrote:
> Quadibloc <jsa...@ecn.ab.ca> schrieb:
> > On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:
> >
> >> So, you could in principle use a boiling liquid for cooling
> >> compouter chips if you can solve the mechanical and other assorted
> >> problems,
> >
> > Isn't that what heat pipes use *already*?
> Yep, but I was thinking of direct contact of the chips with the
> boiling liquid (with a thin insulating laryer, presumably).
>
> _Much_ more efficient. The nice thing is that your temperature
> stays pretty much constant as long as there is enough liquid -
> no hot edges.
<

And a bit more dangerous. Boiling a liquid requires the liquid to
create bubbles (of the gas) and these bubbles invariably form at
the solid liquid interface. These microscopic bubbles can lift the
solid surface one atom at a time causing the surface to decompose
over time--limiting lifetime.
<
Non-boiling temperature transfer has much longer actual lifetimes.

John Dallman

unread,

Jul 15, 2021, 5:10:20 PM7/15/21

to

In article <scpstk$rm3$1...@newsreader4.netcologne.de>,

tko...@netcologne.de (Thomas Koenig) wrote:

> Hmm... a bit of a search turns up a Wikipedia aricle on
> https://en.wikipedia.org/wiki/Novec_649/1230 which cites a source
> that this has already been tried by Intel and SGI, so the

> idea isn't new ...

The original "Merced" Itanium had internal liquid cooling. If you shook
the module that contained the CPU and its cache, you could hear sloshing
IIRC.

John

Ivan Godard

unread,

Jul 15, 2021, 5:38:47 PM7/15/21

to

Not my understanding, nor Wikipedia's. The input to the superheater is
wet steam, with liquid droplets embedded in the gas phase carrier with
the temperature at the phase boundary for the pressure. A superheater
moves the whole thing into gas phase. The benefit is partly the
additional energy, but they were first introduced for a different
reason: the wet steam would condense in the power cylinders and valves,
leading to maintenance problems. https://en.wikipedia.org/wiki/Superheater

Quadibloc

unread,

Jul 15, 2021, 9:15:00 PM7/15/21

to

On Thursday, July 15, 2021 at 11:57:10 AM UTC-6, Thomas Koenig wrote:

> So, another liquid would be called for, preferably something legal,
> moral and non-fattening, with a boiling point of around 60°C, or
> maybe a bit lower.

And here I thought it was due to Oscar Wilde, but I see it actually
originated with Alexander Woolcott, and then was taken up by
W. C. Fields.

However, being non-conductive is probably more important in that
application.

John Savard

Thomas Koenig

unread,

Jul 16, 2021, 2:18:37 AM7/16/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Thursday, July 15, 2021 at 12:57:10 PM UTC-5, Thomas Koenig wrote:
>> Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> > On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:
>> >
>> >> So, you could in principle use a boiling liquid for cooling
>> >> compouter chips if you can solve the mechanical and other assorted
>> >> problems,
>> >
>> > Isn't that what heat pipes use *already*?
>> Yep, but I was thinking of direct contact of the chips with the
>> boiling liquid (with a thin insulating laryer, presumably).
>>
>> _Much_ more efficient. The nice thing is that your temperature
>> stays pretty much constant as long as there is enough liquid -
>> no hot edges.
><
> And a bit more dangerous. Boiling a liquid requires the liquid to
> create bubbles (of the gas) and these bubbles invariably form at
> the solid liquid interface. These microscopic bubbles can lift the
> solid surface one atom at a time causing the surface to decompose
> over time--limiting lifetime.

Abrasion is not a big issue in pool boiling, especially with
the right choice of materials and design.

Like I said, I would be more concerned about vibrations, but
that could also be reduced by mounting the chips on the
back of, let's say, a thin metal plate with reasonably high
conductivity, and have the boiling liquid on the other side.

On the other hand, if your system havs high-velocity two-phase flow,
that can be quite abrasive. The solution is simple, then: Just don't
design the system that way.

> Non-boiling temperature transfer has much longer actual lifetimes.

For a misdesigned boiling heat transfer system vs. a well-designed
single-phase system, this is certainly true.

If you go to the turbulent regime, you also have vibrations.
In the laminar regime, you have _much_ lower heat transfer
coefficients unless you go micro heat exchanger, and then
you have large pressure drops.

And so on... there's a huge design space with heat exchangers,
and people have been very busy the last 150 years or so.

Question: Is there any actual experience with a cooling through
boiling system, apart from the (apparently) short trials mentioned
at Wikipedia?

Thomas Koenig

unread,

Jul 16, 2021, 2:38:09 AM7/16/21

to

"Wet steam" is just a nickname for saturated steam.

What you write about maintenance being the reason for introduction
is not what I read in the article, which only mentions maintenance
in the context of maintenance on the superheater, maybe you have
some other source or somebody changed this five minutes ago :-)

The reason why a superheater increases efficiency is that steam
gives off mechanical work during expansion, so it's basically
pressure times volume difference (I will spare you the integrals).
If part of the steam condenses, then there is less volume, therefore
less mechanical work to do. Bringing the steam away from the
saturation line means that it will be less or no condensation
during expansion, leading to more work done and (on the whole)
better efficiency.

You can also do the calculation with an enthalpy-entropy diagram
for water if you're really interested, but you may not be :-)

For piston engines, the problem is worse than for a turbine.
During the expansion of the steam in the cylinder, it cools,
cooling the walls of the cylinder with it. When the hot steam
enters the cylinder in the next cycle, part of it will be
cooled and may condense, so superheating also helps against
those losses.

This is one reason why double or even triple expansion steam
engines were used - the temperature differences in a single
expansion were smaller, thus less cooling down of hot steam.

For turbines, a single spot is always at the same temperature.

Turbines were a failure for steam locomotives becaue of their
poor characteristics when starting, and their poor partial load
characteristics.

There's a book about "failed innovations" that I mislaid about
20 years ago which gives details on this particular failure,
among others. I may have to buy it again, if it is still available.

Ivan Godard

unread,

Jul 16, 2021, 3:52:11 AM7/16/21

to

Was the ETA systems CPU (liquid nitrogen coolant) boiling the nitrogen?

Anton Ertl

unread,

Jul 17, 2021, 12:46:06 PM7/17/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Quadibloc <jsa...@ecn.ab.ca> schrieb:
>> On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:
>>
>>> So, you could in principle use a boiling liquid for cooling
>>> compouter chips if you can solve the mechanical and other assorted
>>> problems,
>>
>> Isn't that what heat pipes use *already*?
>
>Yep, but I was thinking of direct contact of the chips with the
>boiling liquid (with a thin insulating laryer, presumably).
>
>_Much_ more efficient.

If you can avoid the Leidenfrost effect (where the gaseous phase
insulates the heat source from the cooling liquid (well known from
experiments with liquid nitrogen).

There are systems that use this idea (e.g.,
<https://www.computerbase.de/2017-08/der8auer-aqua-exhalare-zwei-phasen-kuehlung/>),
but for now it does not seem to offer enough advantages to justify the
effort.

>The nice thing is that your temperature
>stays pretty much constant as long as there is enough liquid -
>no hot edges.
>
>Another nice property is that, if your chips run at 70°C (let's
>say) and your vapor comes out the system at 60°C, the vapor is
>easy and cheap to condense - even cooling water at 45°C can do it.
>
>Water would be ideal because of its high enthalpy of vaporization
>and because you can realize the highest heat fluxes with it.
>It is also non-toxic.
>
>However, it has some unpleasant properties for electronics, such
>as being rather conductive with only trace amount of ions needed,
>so that is probably out. You would also have to run it in a slight
>vacuum to get below 100°C, which could be problematic.

All of that is not a problem with heat pipes (or more generally, vapor
chambers): The water is nicely enclosed in the heat pipe, lower
pressure is not a problem, and AFAIK they contain solid stuff that
helps avoid the gas bubble insulation effect. That's why we see heat
pipes used in practice, but not direct-contact systems.

One other interesting aspect is that one might expect heat sinks with
direct-touch heat pipes (direct contact between the heat pipe and the
heat spreader of the CPU) to work better than heat sinks with a base
plate with soldered-in heat pipes, but the better (and more expensive)
heat sinks use a base plate.

One might also expect that it's better if the heat sink touches the
die directly, but that's only done for (low-power) mobile CPUs and
(less power-dense) GPUs; desktop and server CPUs put a heat spreader
and solder (or some liquid interface material) between the die and the
heat sink.

Thomas Koenig

unread,

Jul 17, 2021, 1:47:25 PM7/17/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>Quadibloc <jsa...@ecn.ab.ca> schrieb:
>>> On Thursday, July 15, 2021 at 9:28:30 AM UTC-6, Thomas Koenig wrote:
>>>
>>>> So, you could in principle use a boiling liquid for cooling
>>>> compouter chips if you can solve the mechanical and other assorted
>>>> problems,
>>>
>>> Isn't that what heat pipes use *already*?
>>
>>Yep, but I was thinking of direct contact of the chips with the
>>boiling liquid (with a thin insulating laryer, presumably).
>>
>>_Much_ more efficient.
>
> If you can avoid the Leidenfrost effect (where the gaseous phase
> insulates the heat source from the cooling liquid (well known from
> experiments with liquid nitrogen).

That's the critical heat flux I was writing about - you go above
that, you get the Leidenfrost effect (and usually a burnout of your
heating surface, which is one reason why it is called "critical"
heat flux).

>
> There are systems that use this idea (e.g.,
><https://www.computerbase.de/2017-08/der8auer-aqua-exhalare-zwei-phasen-kuehlung/>),
> but for now it does not seem to offer enough advantages to justify the
> effort.

Interesting, thanks!

This seems to be more of a traditional system with extra cooling.

I was thinking of a CPU wall, so to speak - think big :-)

Anton Ertl

unread,

Jul 18, 2021, 12:47:49 PM7/18/21

to

George Neuner <gneu...@comcast.net> writes:
>On Tue, 13 Jul 2021 14:58:48 -0700 (PDT), Quadibloc
><jsa...@ecn.ab.ca> wrote:
>
>>On Tuesday, July 13, 2021 at 6:13:13 AM UTC-6, David Brown wrote:
>>
>>> What seems to be missing here (at least, /I/ have missed it) is a
>>> discusion about what people actually want to do with computing power.
>>> Is it fair to suggest that most tasks that people currently want to run
>>> faster, are actually reasonably parallel? And that most tasks that
>>> people want to run a /lot/ faster are /very/ parallel?
>>
>>Unfortunately, no, it isn't.
>>
>>Many of the tasks that people want to run faster are parallel, but some of
>>them are not.
>>
>>Also, it's at least possible that *some* of the tasks people want to run
>>faster... are perhaps more parallel than people realize, and some new
>>programming paradigm might enable this parallelism to be brought to
>>light. At least that's the hope that fuels attempts to design compilers
>>that will bring out extra parallelism that's not obvious to human programmers.
>>
>>John Savard
>
>The problem - at least with current hardware - is that programmers are
>much better at identifying what CAN be done in parallel than what
>SHOULD be done in parallel.

You make it sound as if that's a problem with the programmers, not
with the hardware. But it's fundamental to programming (at least in
areas affected by the software crisis, i.e., not supercomputers), so
it has to be solved at the system level (i.e., hardware, compiler,
etc.).

Why is it fundamental? Because we build maintainable software by
splitting it into mostly-independent parts. Deciding how much to
parallelize on current hardware needs a global view of the program,
which programmers usually do not have; and even when they have it,
their decisions will probably be outdated after a while of maintaining
the program.

We have similar problems with explicitly managed fast memory, which is
why we don't see that in general-purpose computers; instead, we see
caches (a software-crisis-compatible variant of fast memory).

Yet another problem of this kind is fixed-point scaling. That's why
we have floating-point.

So what do we need of the system? Ideally having more parallel parts
than needed should not cause a slowdown. This has two aspects:

1) Thread creation and destruction should be cheap.

2) The harder part is memory locality: Sequential code often works
very well on caches because it has a lot of temporal and spatial
locality. If the code is split into more tasks than necessary, how do
we avoid losing locality and thus losing some of the benefits of
caching?

>I really like Mitch's VVM. I'm primarily a software guy, but from what
>I've understood of it, VVM seems to address a number of the problems
>that plague auto-vectorization. Waiting for actual hardware. 8-)

It certainly is better than compiler auto-vectorization (at least it
will be when we see it in hardware). But it also follows the (IMO
wrong-headed) auto-vectorization approach of trying to derive
vectorized code/execution from sequential code. When writing
sequential code, programmers tend to inadvertantly put in stuff that
prevents auto-vectorization; VVM may be able to vectorize parts of the
execution, but still: It does not make good use of the ability of
programmers to (as you wrote) identify what can be done in parallel.

Programmers can vectorize manually (just take a look at an APL
program), and reorganize programs into vectorized form that is far out
of reach of auto-vectorizing compilers, and, I think, also
out-of-reach for VVM. As an example, Bernd Paysan once claimed that
TeX's paragraph-formatting can be vectorized (but he did not do it).
And then you can think of what the compiler and hardware have to do to
make it run fast.

MitchAlsup

unread,

Jul 18, 2021, 1:03:58 PM7/18/21

to

Where "we" is the 1% who are both good programmers and look at the
long term rather than the deadline of the day/week.

<
> splitting it into mostly-independent parts.
<

Which inlining compilers whack back into the calling function.

<
> Deciding how much to
> parallelize on current hardware needs a global view of the program,
> which programmers usually do not have; and even when they have it,
> their decisions will probably be outdated after a while of maintaining
> the program.
<

That global view changes over time, so any one programmer is not in a
position to make assumptions that will survive over time.

>
> We have similar problems with explicitly managed fast memory, which is
> why we don't see that in general-purpose computers; instead, we see
> caches (a software-crisis-compatible variant of fast memory).
>
> Yet another problem of this kind is fixed-point scaling. That's why
> we have floating-point.
>
> So what do we need of the system? Ideally having more parallel parts
> than needed should not cause a slowdown. This has two aspects:
>
> 1) Thread creation and destruction should be cheap.
<

Depends on what you mean by cheap:: 1-cycle cheap is vastly
different than 1,000 cycle cheap.

>
> 2) The harder part is memory locality: Sequential code often works
> very well on caches because it has a lot of temporal and spatial
> locality. If the code is split into more tasks than necessary, how do
> we avoid losing locality and thus losing some of the benefits of
> caching?
<

It is a delicate tradeoff that changes every generation. It would be
better if HW could make these decisions. But HW is currently lacking
a model under which it could rest control from the humans in the loop.

<
> >I really like Mitch's VVM. I'm primarily a software guy, but from what
> >I've understood of it, VVM seems to address a number of the problems
> >that plague auto-vectorization. Waiting for actual hardware. 8-)
<
> It certainly is better than compiler auto-vectorization (at least it
> will be when we see it in hardware). But it also follows the (IMO
> wrong-headed) auto-vectorization approach of trying to derive
> vectorized code/execution from sequential code.
<

At least the compiler makes the choice on vectorizing the loop (or not).
But unlike instructions vectorization, the compiler can make its choice
based on simple rules without exotic analysis. And the vectorized code
achieves the same results on non-vector capable HW as on GBOoO
multi-lane vector HW.

<
> When writing
> sequential code, programmers tend to inadvertantly put in stuff that
> prevents auto-vectorization; VVM may be able to vectorize parts of the
> execution, but still: It does not make good use of the ability of
> programmers to (as you wrote) identify what can be done in parallel.
<

The biggest VVM issue is that one cannot put calls inside a vectorized
loop. But to a large extent the calls to functions like SIN,COS,EXP,Ln are
all instructions in My 66000, so a large part of this is ameliorated.

>
> Programmers can vectorize manually (just take a look at an APL
> program), and reorganize programs into vectorized form that is far out
> of reach of auto-vectorizing compilers, and, I think, also
> out-of-reach for VVM.
<

By design VVM vectorizes inner loops--this is where the biggest bang for
the buck is found, and punts entirely on outer loops and whole programs.

<
> As an example, Bernd Paysan once claimed that
> TeX's paragraph-formatting can be vectorized (but he did not do it).
> And then you can think of what the compiler and hardware have to do to
> make it run fast.
<

I once heard that the CRAY 1 ASM symbol table lookup was vectorized.

George Neuner

unread,

Jul 18, 2021, 3:09:47 PM7/18/21

to

On Sun, 18 Jul 2021 15:55:24 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>George Neuner <gneu...@comcast.net> writes:
>

>>The problem - at least with current hardware - is that programmers are
>>much better at identifying what CAN be done in parallel than what
>>SHOULD be done in parallel.
>
>You make it sound as if that's a problem with the programmers, not
>with the hardware. But it's fundamental to programming (at least in
>areas affected by the software crisis, i.e., not supercomputers), so
>it has to be solved at the system level (i.e., hardware, compiler,
>etc.).

It /IS/ a problem with the programmers. The average "developer" now
has no CS, engineering, or (advanced) mathematics education, and their
programming skills are pitiful - only slightly above "script kiddie".
This is an unfortunate fact of life that I think too often is lost on
some denizens of this group.

Given the ability to create "parallel" tasks, an /average/ programmer
is very likely to naively create large numbers of new tasks regardless
of resources being available to actually execute them.

Which maybe is fine if the number of tasks (relatively) is small, or
if many of them are I/O bound and the use is for /concurrency/. But
most programmers do not understand the difference between "parallel"
and "concurrent", and too many don't understand why spawning large
numbers of tasks can slow down the program.

Outside of DBMS and web servers most threads in most programs are
either compute or memory bound. So if a lot of threads get started, a
handful of them run, but most will sit idle for long periods waiting
for the CPU.

Go lurk in some of the language forums for a while. Far too often
there will be some question posed for which the ensuing discussion
reveals that the programmer is attempting to do something that is
(sometimes far) beyond their ability.

Even discounting newbie questions - which can be forgiven - the
numbers of "experienced" programmers who are ignorant about languages
or tools they are trying to use, or are lacking domain knowledge about
the problem they are expected/expecting to solve, is staggering.

To me anyway. YMMV.

>Why is it fundamental? Because we build maintainable software by
>splitting it into mostly-independent parts. Deciding how much to
>parallelize on current hardware needs a global view of the program,
>which programmers usually do not have; and even when they have it,
>their decisions will probably be outdated after a while of maintaining
>the program.
>
>We have similar problems with explicitly managed fast memory, which is
>why we don't see that in general-purpose computers; instead, we see
>caches (a software-crisis-compatible variant of fast memory).

We have similar problems with programmer managed dynamic allocation.
All the modern languages use GC /because/ repeated studies have shown
that average programmers largely are incapable of writing leak-proof
code without it.
[And they /still/ have problems with other unmanaged resources.]

>Yet another problem of this kind is fixed-point scaling. That's why
>we have floating-point.

And the same people who, in the past, would not have understood the
issues of using fixed-point now don't understand the issues of using
floating point.

Unless it is "software arbitrary precision decimal" floating point -
i.e. "Big Decimal" - then odds are good it is being used incorrectly
and the answers it yields probably should be suspect.

>So what do we need of the system? Ideally having more parallel parts
>than needed should not cause a slowdown. This has two aspects:
>
>1) Thread creation and destruction should be cheap.
>
>2) The harder part is memory locality: Sequential code often works
>very well on caches because it has a lot of temporal and spatial
>locality. If the code is split into more tasks than necessary, how do
>we avoid losing locality and thus losing some of the benefits of
>caching?

Agreed! But this has little to do with any of my points.

>- anton
George

MitchAlsup

unread,

Jul 18, 2021, 5:06:45 PM7/18/21

to

On Sunday, July 18, 2021 at 2:09:47 PM UTC-5, George Neuner wrote:
> On Sun, 18 Jul 2021 15:55:24 GMT, an...@mips.complang.tuwien.ac.at
> (Anton Ertl) wrote:
>
> >George Neuner <gneu...@comcast.net> writes:

>snip>

> >Why is it fundamental? Because we build maintainable software by
> >splitting it into mostly-independent parts. Deciding how much to
> >parallelize on current hardware needs a global view of the program,
> >which programmers usually do not have; and even when they have it,
> >their decisions will probably be outdated after a while of maintaining
> >the program.
> >
> >We have similar problems with explicitly managed fast memory, which is
> >why we don't see that in general-purpose computers; instead, we see
> >caches (a software-crisis-compatible variant of fast memory).
<
> We have similar problems with programmer managed dynamic allocation.
> All the modern languages use GC /because/ repeated studies have shown
> that average programmers largely are incapable of writing leak-proof
> code without it.
<

And this is AFTER programming languages created constructors and
destructors which the programmer is allowed to basically "forget" about
their workings and manage memory for them..........

<
> [And they /still/ have problems with other unmanaged resources.]
> >Yet another problem of this kind is fixed-point scaling. That's why
> >we have floating-point.
> And the same people who, in the past, would not have understood the
> issues of using fixed-point now don't understand the issues of using
> floating point.
>
> Unless it is "software arbitrary precision decimal" floating point -
> i.e. "Big Decimal" - then odds are good it is being used incorrectly
> and the answers it yields probably should be suspect.
<
> >So what do we need of the system? Ideally having more parallel parts
> >than needed should not cause a slowdown. This has two aspects:
> >
> >1) Thread creation and destruction should be cheap.
> >
> >2) The harder part is memory locality: Sequential code often works
> >very well on caches because it has a lot of temporal and spatial
> >locality. If the code is split into more tasks than necessary, how do
> >we avoid losing locality and thus losing some of the benefits of
> >caching?
> Agreed! But this has little to do with any of my points.
>
>
> >- anton
> George
<

I should relate a problem I advised a mid-level programmer about a
few years ago when debugging a memory leak problem. He had spent
a large amount of effort tracking down memory leaks in a data base.
After listening to the problem space, I advised him to fork off a (unix)
task, sharing memory, perform the work, and let the OS clean up the
memory leaks (terminate cleanup). Not only did this eliminate* the
memory leak, it ran 3× faster !! even with the task creation and tear
down overheads.
<
(*) It did not get rid of the memory leak, it isolated the memory leak
into a section of memory that was cleanup up in its entirety rather
than in piecemeal.
<
But I wholly agree with George--the problem is the programmers.

Quadibloc

unread,

Jul 18, 2021, 5:10:03 PM7/18/21

to

On Sunday, July 18, 2021 at 11:03:58 AM UTC-6, MitchAlsup wrote:
> On Sunday, July 18, 2021 at 11:47:49 AM UTC-5, Anton Ertl wrote:

> > 1) Thread creation and destruction should be cheap.

> Depends on what you mean by cheap:: 1-cycle cheap is vastly
> different than 1,000 cycle cheap.

Now that is just a totally unfair criticism. Why couldn't he have
meant by "cheap" the following: sufficiently cheap to actually be
usefully cheap in the context of the issue in question?

Leaving the detail of how many cycles that might be for later.

John Savard

MitchAlsup

unread,

Jul 18, 2021, 5:23:18 PM7/18/21

to

On Sunday, July 18, 2021 at 4:10:03 PM UTC-5, Quadibloc wrote:
> On Sunday, July 18, 2021 at 11:03:58 AM UTC-6, MitchAlsup wrote:
> > On Sunday, July 18, 2021 at 11:47:49 AM UTC-5, Anton Ertl wrote:
>
> > > 1) Thread creation and destruction should be cheap.
>
> > Depends on what you mean by cheap:: 1-cycle cheap is vastly
> > different than 1,000 cycle cheap.
<
> Now that is just a totally unfair criticism. Why couldn't he have
> meant by "cheap" the following: sufficiently cheap to actually be
> usefully cheap in the context of the issue in question?
<

I did the HEP thread creation and deletion software.
<
I had a task sitting by collecting threads which died, and a thread
arranging a thread from a pool and getting it ready should anyone
need a new thread created. I could create a new thread for a given
task in 12 instructions. I considered this fast.
<
GPUs have a block of logic that allocate up to 32 threads to a WARP
and can perform this bundling operation every 4 cycles. I consider this
fast.
<
Your typical CPU performs excel() in a handful of thousands of cycles
(after you consider all of the copy on write semantics it may be even
worse.) I do not consider this fast. I have no idea as to how many cycles
are consumed tearing apart such a task.
<
Your typical CPU performs thread creations in on the order of 1,000
instructions. I do not consider these fast, either. I have no idea whether
this includes (or excludes) thread tear-down, either.

Thomas Koenig

unread,

Jul 18, 2021, 5:48:22 PM7/18/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> By design VVM vectorizes inner loops--this is where the biggest bang for
> the buck is found, and punts entirely on outer loops and whole programs.

I know exactly one example of time-critical real-time code where
the inner loop is quite cold (well, it's quite a few examples,
but the pattern is always the same).

This occurs in array processing if the code in question has to
know a number of dimensions that is unknown at compile-time.
The library function is then given a list of exents along each
dimension.

The algorithm works like an old-style mechanical odometer, or a
ripple-carry incrementer (with a different base for each digit).
In C, it is (slighly shortened from gfortran's library). base is
the pointer for the next iteration of something to do, count the
array which shows how far progress has been made along one dimension,
and extent gives the number of elements along each dimension.

n = 0;
do
{
/* When we get to the end of a dimension, reset it and increment
the next dimension. */
count[n] = 0;
base -= sstride[n] * extent[n];
n++;
if (n >= rank)
return;
else
{
count[n]++;
base += sstride[n];
}
} while (count[n] == extent[n]);
}

This do while loop is executed only once for most cases, but once
it has reached the end of one dimension, it will be executed more
than once, leading to (probable) branch mispredicts.

It's time-critical because... well, almost all Fortran array
intrinsics run through this kind of code (matmul doesn't :-)

Marcus

unread,

Jul 19, 2021, 2:06:35 AM7/19/21

to

Some hard numbers on thread and process creation + tear-down on
different machines and OS:es:

https://www.bitsnbites.eu/benchmarking-os-primitives/

The fastest machines (Linux) do create + tear-down in just less than
10 us. On a 3 GHz CPU that corresponds to about 20,000 - 30,000
clock cycles. Please note that it includes all the layers of pthread and
OS thread initialization.

On Windows (same machine), the figure is 3-4 times worse.

This is the reason that when you need instant threads, you set up a
thread pool of dormant threads instead of spawning a new OS thread
every time you need one. I personally consider that an ugly work-around
for inefficient thread creation.

/Marcus

Marcus

unread,

Jul 19, 2021, 2:30:13 AM7/19/21

to

On 2021-07-18, George Neuner wrote:
> On Sun, 18 Jul 2021 15:55:24 GMT, an...@mips.complang.tuwien.ac.at
> (Anton Ertl) wrote:
>
>> George Neuner <gneu...@comcast.net> writes:
>>
>>> The problem - at least with current hardware - is that programmers are
>>> much better at identifying what CAN be done in parallel than what
>>> SHOULD be done in parallel.
>>
>> You make it sound as if that's a problem with the programmers, not
>> with the hardware. But it's fundamental to programming (at least in
>> areas affected by the software crisis, i.e., not supercomputers), so
>> it has to be solved at the system level (i.e., hardware, compiler,
>> etc.).
>
> It /IS/ a problem with the programmers. The average "developer" now
> has no CS, engineering, or (advanced) mathematics education, and their
> programming skills are pitiful - only slightly above "script kiddie".
> This is an unfortunate fact of life that I think too often is lost on
> some denizens of this group.
>

I would guess that the majority of programmers today do not even reflect
over the difference in computational effort (as in number of CPU cycles
required) for the following lines of code (Python):

1) foo = 39+3

2) foo = f'The answer is {39+3}'

3) foo = requests.get('https://en.wikipedia.org/wiki/42')

After all, it's just a single line of code...

I remember the first time I wrote a 6502 assembler language loop on my
C=64: The program changed the character value of the upper left
character of the screen a few hundred times (probably 255), and at first
I thought I had made an error because I only saw the final character,
not the other hundreds of characters, until it dawned on me:

"It's THAT fast!"

I doubt that many programmers these days have these moments when they
actually understand how fast a computer really is.

/Marcus

Thomas Koenig

unread,

Jul 19, 2021, 2:39:46 AM7/19/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> I remember the first time I wrote a 6502 assembler language loop on my
> C=64: The program changed the character value of the upper left
> character of the screen a few hundred times (probably 255), and at first
> I thought I had made an error because I only saw the final character,
> not the other hundreds of characters, until it dawned on me:
>
> "It's THAT fast!"
>
> I doubt that many programmers these days have these moments when they
> actually understand how fast a computer really is.

I certainly had a moment when I understood how much faster computers
had become.

It was a little problem to find four prices (i.e. decimal numbers
with a maximum of two digits after the decimal separator) a,b,c,d
so that a*b*c*d = a+b+c+d = 7.47 .

On a C 64 using Basic, months to hours depending on the cleverness
of the algorithm.

On a modern computer using a compiled language, too fast to notice
even when choosing a rather stupid algorithm.

Quadibloc

unread,

Jul 19, 2021, 4:08:03 AM7/19/21

to

On Monday, July 19, 2021 at 12:39:46 AM UTC-6, Thomas Koenig wrote:

> It was a little problem to find four prices (i.e. decimal numbers
> with a maximum of two digits after the decimal separator) a,b,c,d
> so that a*b*c*d = a+b+c+d = 7.47 .

I see that seven hundred and forty-seven is divisible by nine.

Perhaps that leaves open the possibility of a solution.

I suppose the problem was phrased something like this:

A clerk at a grocery store, when adding up the prices of
four items, accidentally pressed the multiplication key
by mistake. But when he repeated the calculation correctly,
he got exactly the same answer. What were the prices of
the four items, which totalled to $7.47?

7.47 is 2.49 times 3, or 0.83 times 9.

2.49 * 1.50 * 2.00 * 1.00 doesn't add up to anything
with a 7 on the end.

0.83 * 2.25 * 4.00 * 1.00 when added up ends in an 8,
not in a 7.

Hmm. 7.47 minus 2.49 is 4.98. What could add to 4.98
and yet multiply to 3?

Or, 7.47 minus 0.83 is 6.64. What could add to 6.64, and
multiply to 9?

9 is 3 * 3, or 4.5 * 2, or 2.25 * 4. Or it is 1.8 * 5. Or 3.6 * 2.5.
Or 7.2 times 1.25. Or 1.44 times 6.25.

1.25 is 0.25 times 5. 6.25 is 1.25 times 5.

0.05 + 1.25 + 1.44 + 0.83 would end in a 7.

But it would add up to 3.57 and multiply to .0747, so
that's clearly not the answer.

Quite a puzzling problem.

John Savard

Quadibloc

unread,

Jul 19, 2021, 4:12:24 AM7/19/21

to

Attempting to Google the problem and presumably its
solution, I found this:

https://math.stackexchange.com/questions/66302/4-items-add-up-to-and-multiply-to-7-11-what-are-the-value-of-the-items

Apparently the actual total was $7.11 rather than $7.47, which, of course,
fits with the name of a popular convenience store.

John Savard

Terje Mathisen

unread,

Jul 19, 2021, 4:17:52 AM7/19/21

to

Hmmm.

None of them can be zero or negative, so by specifying that a,b,c,d is
in increasing order, a can be in the range 0.01 to 7.47/4=1.86.
b has a slightly larger possible range (0.01 to (7.47-a)/3), the same
for c (0.01 to (7.47-a-b)/2), while d is always (7.47-a-b-c).

(I would do this all in cents of course!)

This should be ~200^3 so 8M iterations, doable in a small fraction of a
second.

Next insight is the fact that the cents values must be such that 6 of
the 8 multiplication decimals end up as zero, i.e. the last digits of
each price must have a lot of '0', '2' or '5', I suspect this reduces
the search space by at least an order of magnitude?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Chris M. Thomasson

unread,

Jul 19, 2021, 4:22:01 AM7/19/21

to

On 7/13/2021 6:31 AM, George Neuner wrote:
> On Mon, 12 Jul 2021 15:28:23 -0700 (PDT), "luke.l...@gmail.com"
> <luke.l...@gmail.com> wrote:
>
>> On Monday, July 12, 2021 at 10:17:29 PM UTC+1, Quadibloc wrote:
>>
>>> about an attempt to design a processor that does away with instruction
>>> sets and all those instructions that just move data around.
>>
>> reminds me of Elixent's stuff. they did NSEW neighbour processing.
>> fascinating design: 4-bit ALUs. tens of thousands of them. programming
>> this style of processor is an absolute pig.
>>
>> l.
>
> Dunno. I programmed Connection Machines. Admittedly a problem had to
> be embarrassingly parallel to fit with the hardware ... but given
> that, the CM was quite easy to work with.

Has anybody here programmed for a sicortex?

https://en.wikipedia.org/wiki/SiCortex

Thomas Koenig

unread,

Jul 19, 2021, 7:09:49 AM7/19/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> Attempting to Google the problem and presumably its
> solution, I found this:
>
> https://math.stackexchange.com/questions/66302/4-items-add-up-to-and-multiply-to-7-11-what-are-the-value-of-the-items
>
> Apparently the actual total was $7.11 rather than $7.47,

In the computer magazine I read this puzzle ("Chip"), it was
actually 7,47 DM. And the issue must have been around October or
November of that year.

7-11 was not a common number combination in Germany in 1986,
but the same puzzle can be set up with many different results.

> which, of course,
> fits with the name of a popular convenience store.

I suspect they chose it for recognition value of the airplane type
(and, of course, the uniqueness of the solution). I would post
it, but rot13 does not work for numbers :-)

Maybe the original was with 7-11. Source prior to 1986, anybody?

Thomas Koenig

unread,

Jul 19, 2021, 8:35:40 AM7/19/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:

>> It was a little problem to find four prices (i.e. decimal numbers
>> with a maximum of two digits after the decimal separator) a,b,c,d
>> so that a*b*c*d = a+b+c+d = 7.47 .
>>
>> On a C 64 using Basic, months to hours depending on the cleverness
>> of the algorithm.
>>
>> On a modern computer using a compiled language, too fast to notice
>> even when choosing a rather stupid algorithm.
>>
> Hmmm.
>
> None of them can be zero or negative, so by specifying that a,b,c,d is
> in increasing order, a can be in the range 0.01 to 7.47/4=1.86.
> b has a slightly larger possible range (0.01 to (7.47-a)/3), the same
> for c (0.01 to (7.47-a-b)/2), while d is always (7.47-a-b-c).
>
> (I would do this all in cents of course!)
>
> This should be ~200^3 so 8M iterations, doable in a small fraction of a
> second.

That's a pretty good guess, going through three variables i <=
j <= k and setting m = 7-i-j-k, while removing duplicates setting
the bounds so that k <= m, gives around 2.9e6 iterations.

> Next insight is the fact that the cents values must be such that 6 of
> the 8 multiplication decimals end up as zero, i.e. the last digits of
> each price must have a lot of '0', '2' or '5', I suspect this reduces
> the search space by at least an order of magnitude?

In the evening that we (myself, plus some people in the same
flat) solved this on a C-64, we spent a few hours making these
simplifications. Tt more tricky because, as soon as you start
making assumptions about the divisibility, you have to make sure
not to lose cases with one variable being smaller than another.

Stefan Monnier

unread,

Jul 19, 2021, 9:08:31 AM7/19/21

to

> This is the reason that when you need instant threads, you set up a
> thread pool of dormant threads instead of spawning a new OS thread
> every time you need one. I personally consider that an ugly work-around
> for inefficient thread creation.

Part of the problem is linked to the precise meaning of "thread" (as in
whether or not you care about all the different features offered for
threads by the OS) and the devil in the details.

Also "instant threads" is still a lie for thread created from a thread
pool of dormant threads, so it would be valuable to have benchmark
numbers to see whether those are closer to "0 cycles" or to "1000
cycle".

Stefan

Marcus

unread,

Jul 19, 2021, 9:21:32 AM7/19/21

to

On 2021-07-19, Stefan Monnier wrote:
>> This is the reason that when you need instant threads, you set up a
>> thread pool of dormant threads instead of spawning a new OS thread
>> every time you need one. I personally consider that an ugly work-around
>> for inefficient thread creation.
>
> Part of the problem is linked to the precise meaning of "thread" (as in
> whether or not you care about all the different features offered for
> threads by the OS) and the devil in the details.

Yes, and short of language support for "thin" threads (e.g. without
proper OS support) most solutions are based on full OS threads.

>
> Also "instant threads" is still a lie for thread created from a thread
> pool of dormant threads, so it would be valuable to have benchmark
> numbers to see whether those are closer to "0 cycles" or to "1000
> cycle".

True. Starting up a thread usually requires things like passing a
function pointer in a message queue and sending a message.

/Marcus

Quadibloc

unread,

Jul 20, 2021, 2:53:12 AM7/20/21

to

On Monday, July 19, 2021 at 5:09:49 AM UTC-6, Thomas Koenig wrote:

> In the computer magazine I read this puzzle ("Chip"), it was
> actually 7,47 DM. And the issue must have been around October or
> November of that year.

There were issues of that magazine on the Internet Archive...

John Savard

Quadibloc

unread,

Jul 20, 2021, 3:11:09 AM7/20/21

to

Or so I thought. Actually, only the very first issue, plus some of their
CHIP Specials, including one on computerizing one's Marklin
model railroad, are there.

Plus, there is another computer magazine of the same name
which publishes in Hungary, Romania, and Russia... and yet
another one which is in English and in Malaysia.

John Savard

pec...@gmail.com

unread,

Jul 20, 2021, 4:30:14 AM7/20/21

to

poniedziałek, 19 lipca 2021 o 14:35:40 UTC+2 Thomas Koenig napisał(a):
> That's a pretty good guess, going through three variables i <=
> j <= k and setting m = 7-i-j-k, while removing duplicates setting
> the bounds so that k <= m, gives around 2.9e6 iterations.

You are not serious...
def brahmagupta(c):
return (c-(c**2-4*c))/2,(c+(c**2-4*c))/2
brahmagupta(7.47) = (1.1893713939382278, 6.2806286060617715)
brahmagupta(6.2806286060617715) = (1.2480169133948371, 5.031983086605163)
brahmagupta(5.031983086605163)=(1.3764817684693538, 3.656171118366921)
"real" solution:
1.1893713939382278,1.2480169133948371,1.3764817684693538,3.656171118366921
starting point:
1.19+1.25+1.37+3.66=7.470000000000001
1.19*1.25*1.37*3.66 = 7.458622500000001
few manual iterations:
1.20*1.25*1.37*3.65=7.50075
1.19*1.25*1.38*3.65=7.492537499999998
1.18*1.25*1.39*3.65=7.483412499999998
1.17*1.25*1.40*3.65=7.473374999999999 - "upper bound"
1.16*1.25*1.41*3.65=7.462424999999999
1.17*1.24*1.41*3.65=7.466542199999999 - "lower bound"

Thomas Koenig

unread,

Jul 20, 2021, 4:53:19 AM7/20/21

to

pec...@gmail.com <pec...@gmail.com> schrieb:

I'm not sure what you calculated here.

However, as stated, this is a Diophantine equation (integer solutions
only), so approximate solutions are not valid.

Diophantine equations are generally much harder to solve than
equations that involve real numbers that can be solved approximately
using floating point values.

Terje Mathisen

unread,

Jul 20, 2021, 7:48:08 AM7/20/21

to

I broke down and wrote a perl solver for this question:

Even with perl's slow interpreter, and no attempt to extract prime
factors from 747000000, just a brute force scan, it took about 0.2
seconds to find the solutions for either 7.47 or 7.11 as the total sum.
(Verifying that they were in fact unique took another 40 ms.)

C(++) would almost certainly run this at least an order of magnitude
faster, but using the prime roots as the starting point, noting that
there is a single large factor (of 747/9=83), and then use that as the
first term would help even more:
...
Yes indeed!

Starting the search with n*0.83 as one of the item prices reduced the
search time from 200ms to less than 9, and full verification took just
11 ms.

Thomas Koenig

unread,

Jul 20, 2021, 1:32:19 PM7/20/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Even with perl's slow interpreter, and no attempt to extract prime
> factors from 747000000, just a brute force scan, it took about 0.2
> seconds to find the solutions for either 7.47 or 7.11 as the total sum.
> (Verifying that they were in fact unique took another 40 ms.)

Which shows that an interpreter on a (I assume) relatively modern
machine 2021 is _much_ faster than an interpreter on a machine
introduced in 1982, whose CPU ran at ~ 0.3% of the clock speed of
today's machine, which had an 8-bit processor doing floating point
on a 40-bit format with a 32-bit mantissa without even instructions
for an 8-bit integer multiply.

It was _all_ shift and add for multiplication.

>
> C(++) would almost certainly run this at least an order of magnitude
> faster, but using the prime roots as the starting point, noting that
> there is a single large factor (of 747/9=83), and then use that as the
> first term would help even more:
> ...
> Yes indeed!

Ah, I don't think we noticed that at the time. Good catch!

> Starting the search with n*0.83 as one of the item prices reduced the
> search time from 200ms to less than 9, and full verification took just
> 11 ms.

There is actually a bit more to the story. It was one of the
first days after I had started studying, which is why I remember
the approximate date so well. The people I shared a flat with had
looked at the problem for a short time without even hitting on the
rather obvious fact that, for a+b+c+d=s, you only need three loops.

We ran a few benchmarks and concluded that a run would take a
few months on the C-64, and gave up for a time.

One of us, a first-semester computer science studend, then left.
The rest of us looked at the problem again, noticed d=s-a-b-c, and
used a few more simplifications, which brought down the calculation
time to around half an hour.

One of us (not me) then had an idea. He sat down and wrote down
random formulas from "Bronstein"The formulas looked impressive,
but had absolutely no bearing on the problem. When the C.S. student
returned late in the evening, we gave him the sheets of paper
and told him this was the analytical soluttion. The unsuspecting
C.S. student believed us for a few weeks, because he still thought
that a calculation would have taken months, and was quite impressed.

Anton Ertl

unread,

Jul 20, 2021, 1:50:59 PM7/20/21

to

George Neuner <gneu...@comcast.net> writes:
>On Sun, 18 Jul 2021 15:55:24 GMT, an...@mips.complang.tuwien.ac.at
>(Anton Ertl) wrote:
>
>>George Neuner <gneu...@comcast.net> writes:
>>
>>>The problem - at least with current hardware - is that programmers are
>>>much better at identifying what CAN be done in parallel than what
>>>SHOULD be done in parallel.
>>
>>You make it sound as if that's a problem with the programmers, not
>>with the hardware. But it's fundamental to programming (at least in
>>areas affected by the software crisis, i.e., not supercomputers), so
>>it has to be solved at the system level (i.e., hardware, compiler,
>>etc.).
>
>It /IS/ a problem with the programmers. The average "developer" now
>has no CS, engineering, or (advanced) mathematics education, and their
>programming skills are pitiful - only slightly above "script kiddie".
>This is an unfortunate fact of life that I think too often is lost on
>some denizens of this group.

One could have an interesting discussion about that, but that's
besides the point wrt the parallelization problem. Even if the
developers have all the education one could wish for, if they have to
produce a maintainable program for a big problem (resulting in a big
program) with the minimal development effort, they will divide the
problem into subproblems and divide the program into parts for dealing
with the subproblems etc. But parallelization with the current cost
structure has to be done for the whole program and cannot be
subdivided in the same way.

>Given the ability to create "parallel" tasks, an /average/ programmer
>is very likely to naively create large numbers of new tasks regardless
>of resources being available to actually execute them.

Yes, if you tell them to create parallel tasks. And good programmers
will do so, too, unless you tell them that efficient parallelization
is more important than maintainability.

>Which maybe is fine if the number of tasks (relatively) is small, or
>if many of them are I/O bound and the use is for /concurrency/. But
>most programmers do not understand the difference between "parallel"
>and "concurrent", and too many don't understand why spawning large
>numbers of tasks can slow down the program.

Sure. That's the way to write parallel programs that is in line with
the divide-and-conquer approach we have established for writing
programs for big problems. So if it slows down programs, the solution
is not to tell the programmers not to do that, but to make systems
that run such programs efficiently. E.g., have hardware where having
many more tasks than hardware threads does not slow down the program.
Or have a compiler and run-time system that combines the many tasks
written by the programmer into so few intermediate tasks that the
overheads of having more tasks than threads play little role. Or
both.

>>Why is it fundamental? Because we build maintainable software by
>>splitting it into mostly-independent parts. Deciding how much to
>>parallelize on current hardware needs a global view of the program,
>>which programmers usually do not have; and even when they have it,
>>their decisions will probably be outdated after a while of maintaining
>>the program.
>>
>>We have similar problems with explicitly managed fast memory, which is
>>why we don't see that in general-purpose computers; instead, we see
>>caches (a software-crisis-compatible variant of fast memory).
>
>We have similar problems with programmer managed dynamic allocation.
>All the modern languages use GC /because/ repeated studies have shown
>that average programmers largely are incapable of writing leak-proof
>code without it.

Good example. Garbage collection is a good solution to the dynamic
memory allocation problem, including for good programmers. Now we
need such a solution for the parallelization problem.

>>Yet another problem of this kind is fixed-point scaling. That's why
>>we have floating-point.
>
>And the same people who, in the past, would not have understood the
>issues of using fixed-point now don't understand the issues of using
>floating point.

Sure, FP has its pitfalls, but it's possible to write, say, a general
FP matrix multiplication subroutine (and the pitfalls of FP typically
don't play much role in that), while for fixed-point you would have to
write one with the right scaling for every application it is used in;
or maybe these days have a templated C++ library, and instantiate it
with the appropriate scalings for each use.

>>So what do we need of the system? Ideally having more parallel parts
>>than needed should not cause a slowdown. This has two aspects:
>>
>>1) Thread creation and destruction should be cheap.
>>
>>2) The harder part is memory locality: Sequential code often works
>>very well on caches because it has a lot of temporal and spatial
>>locality. If the code is split into more tasks than necessary, how do
>>we avoid losing locality and thus losing some of the benefits of
>>caching?
>
>Agreed! But this has little to do with any of my points.

But it has to do with the parallelization problem, and more realistic
solutions for it than perfect programmers with unlimited time on their
hands.

Stefan Monnier

unread,

Jul 20, 2021, 2:05:04 PM7/20/21

to

Anton Ertl [2021-07-20 16:54:54] wrote:
> programs for big problems. So if it slows down programs, the solution
> is not to tell the programmers not to do that, but to make systems
> that run such programs efficiently. E.g., have hardware where having

Indeed, I think the only viable "solution" is to ask the programmers to
write code in a way that exposes as much parallelism as possible, and
then have the compiler "auto-sequentialize" the code.

It should be a bit easier for the compiler: at least it's easy to
sequentialize correctly, so the main difficulty is to sequentialize in
a way that maximizes the performance.

Stefan

Branimir Maksimovic

unread,

Jul 20, 2021, 3:17:57 PM7/20/21

to

Well almost all new languages have async/await mechanism which makes
concurrent programming trivial...
Swift got it in 5.5 which is still beta..

>
>>Given the ability to create "parallel" tasks, an /average/ programmer
>>is very likely to naively create large numbers of new tasks regardless
>>of resources being available to actually execute them.
>
> Yes, if you tell them to create parallel tasks. And good programmers
> will do so, too, unless you tell them that efficient parallelization
> is more important than maintainability.

Not issue any more...

>
>>Which maybe is fine if the number of tasks (relatively) is small, or
>>if many of them are I/O bound and the use is for /concurrency/. But
>>most programmers do not understand the difference between "parallel"
>>and "concurrent", and too many don't understand why spawning large
>>numbers of tasks can slow down the program.
>
> Sure. That's the way to write parallel programs that is in line with
> the divide-and-conquer approach we have established for writing
> programs for big problems. So if it slows down programs, the solution
> is not to tell the programmers not to do that, but to make systems
> that run such programs efficiently. E.g., have hardware where having
> many more tasks than hardware threads does not slow down the program.
> Or have a compiler and run-time system that combines the many tasks
> written by the programmer into so few intermediate tasks that the
> overheads of having more tasks than threads play little role. Or
> both.

Parallel and concurrent is same thing :P
Someone told me that depends on definition, but it is dim :P

>
>>>Why is it fundamental? Because we build maintainable software by
>>>splitting it into mostly-independent parts. Deciding how much to
>>>parallelize on current hardware needs a global view of the program,
>>>which programmers usually do not have; and even when they have it,
>>>their decisions will probably be outdated after a while of maintaining
>>>the program.
>>>
>>>We have similar problems with explicitly managed fast memory, which is
>>>why we don't see that in general-purpose computers; instead, we see
>>>caches (a software-crisis-compatible variant of fast memory).
>>
>>We have similar problems with programmer managed dynamic allocation.
>>All the modern languages use GC /because/ repeated studies have shown
>>that average programmers largely are incapable of writing leak-proof
>>code without it.
>
> Good example. Garbage collection is a good solution to the dynamic
> memory allocation problem, including for good programmers. Now we
> need such a solution for the parallelization problem.

C++, Rust , Swift all does not have GC...
They use RAII and reference counting (shared_ptr and such)

>
>>>Yet another problem of this kind is fixed-point scaling. That's why
>>>we have floating-point.
>>
>>And the same people who, in the past, would not have understood the
>>issues of using fixed-point now don't understand the issues of using
>>floating point.
>
> Sure, FP has its pitfalls, but it's possible to write, say, a general
> FP matrix multiplication subroutine (and the pitfalls of FP typically
> don't play much role in that), while for fixed-point you would have to
> write one with the right scaling for every application it is used in;
> or maybe these days have a templated C++ library, and instantiate it
> with the appropriate scalings for each use.
>
>>>So what do we need of the system? Ideally having more parallel parts
>>>than needed should not cause a slowdown. This has two aspects:
>>>
>>>1) Thread creation and destruction should be cheap.
>>>
>>>2) The harder part is memory locality: Sequential code often works
>>>very well on caches because it has a lot of temporal and spatial
>>>locality. If the code is split into more tasks than necessary, how do
>>>we avoid losing locality and thus losing some of the benefits of
>>>caching?
>>
>>Agreed! But this has little to do with any of my points.
>
> But it has to do with the parallelization problem, and more realistic
> solutions for it than perfect programmers with unlimited time on their
> hands.

Well, Rust and Swift mad efforts to make less bugs with concurrent tasks,
that is with shared state :P
Swift mad efforts to make less bugs with concurrent tasks,
that is with shared state :P

>
> - anton

--
bmaxa now listens Arguments by Robots in disguise from Robots in disguise

Chris M. Thomasson

unread,

Jul 20, 2021, 4:21:01 PM7/20/21

to

[...]

I remember way back when somebody was having a performance issue in a
GC'ed environment when the system was under a great deal of sustained
load. Memory would grow and grow, and the GC was spending a lot of time
trying to handle all of it. So, I suggested using distributed lock-free
node caches, it sped things up by orders of magnitude. However,
pushing/popping from the node cache is a form of manual memory
management. They did not like it because of that, but ended up using it
anyway. So, manual memory management can "help" a GC under a large
amount of stress.

Branimir Maksimovic

unread,

Jul 20, 2021, 4:53:51 PM7/20/21

to

Rust had GC in early versions. They ditched in in favor of reference count
because of performance problems. Swift, which is derived from Rust,
never had GC :P
Simply put GC is hassle in concurrent loads...

--
bmaxa now listens Boys by Robots in disguise from Robots in disguise

Quadibloc

unread,

Jul 20, 2021, 6:07:33 PM7/20/21

to

On Tuesday, July 20, 2021 at 1:17:57 PM UTC-6, Branimir Maksimovic wrote:

> Parallel and concurrent is same thing :P
> Someone told me that depends on definition, but it is dim :P

"Concurrent" refers to tasks which _could_ be done in parallel, but becaue
you only have one serial processor available, it just does them one at a time,
switching back and forth between them.

So now the distinction is no longer dim, but glowing as brightly as the
noonday sun!

John Savard

Chris M. Thomasson

unread,

Jul 20, 2021, 6:11:21 PM7/20/21

to

Now, one "difference" can be that several parallel computations can
never possibly interfere with each other as they run at the same time.
Several concurrent computations might mean that they can interfere with
one other as they run at the same time.

Branimir Maksimovic

unread,

Jul 20, 2021, 6:12:32 PM7/20/21

to

That's your defnition. Such tasks are in no way concurrent.
What you describing is coroutines that must yield for other
tasks to work :p

>
> John Savard

--
bmaxa now listens Woman In Disguise by Angelic Upstarts from The Independent Punk Singles Collection

Branimir Maksimovic

unread,

Jul 20, 2021, 6:19:22 PM7/20/21

to

On 2021-07-20, Chris M. Thomasson <chris.m.t...@gmail.com> wrote:

That is another definition, which is another interpretation. In reallity
no body made clear definition that will all follow :P
My attempt is this: Parallel execution can be done on several different
machines, like MPI, andr concurrent means on single computer. I think that
this was original definition and distinction.

Stefan Monnier

unread,

Jul 20, 2021, 6:32:38 PM7/20/21

to

Quadibloc [2021-07-20 15:07:31] wrote:
> On Tuesday, July 20, 2021 at 1:17:57 PM UTC-6, Branimir Maksimovic wrote:
>> Parallel and concurrent is same thing :P
>> Someone told me that depends on definition, but it is dim :P
> "Concurrent" refers to tasks which _could_ be done in parallel, but
> becaue you only have one serial processor available, it just does them
> one at a time, switching back and forth between them.

That's definitely not my understanding of the term.

To me the difference between the two is that parallelism is concerned
with dividing a task into subtasks that can be performed concurrently so
as to reduce the latency of its execution. Concurrency OTOH starts with
concurrent tasks and is concerned with how to schedule and synchronize
them such that the result is correct and to obey various other
timing constraints like fairness.

Stefan

Branimir Maksimovic

unread,

Jul 20, 2021, 6:36:57 PM7/20/21

to

Well parrallel computation (several computers), concurrent execution (single computer)

--
bmaxa now listens Solidarity by Angelic Upstarts from The Independent Punk Singles Collection

Bill Findlay

unread,

Jul 20, 2021, 7:04:49 PM7/20/21

to

On 20 Jul 2021, Thomas Koenig wrote
(in article <sd62ts$19o$1...@newsreader4.netcologne.de>):

> Diophantine equations are generally much harder to solve than
> equations that involve real numbers that can be solved approximately
> using floating point values.

It's worse than that, Jim ...
The roots of Diophantine equations are not computable
(in the same sense as the TM halting problem).

--
Bill Findlay

pec...@gmail.com

unread,

Jul 20, 2021, 7:07:07 PM7/20/21

to

wtorek, 20 lipca 2021 o 10:53:19 UTC+2 Thomas Koenig napisał(a):

> I'm not sure what you calculated here.

The point is to have a good starting point.

> However, as stated, this is a Diophantine equation (integer solutions
> only), so approximate solutions are not valid.
> Diophantine equations are generally much harder to solve than
> equations that involve real numbers that can be solved approximately
> using floating point values.

It is not a Diophantine equation.
The numbers are decimals with fixed point arithmetic.
Numerical analysis guarantees that fixed point product can not be too far from the real one, so you can dramatically restrict the search space: approximately +-0.03 around every value but we need to consider all permutations of values (multiplication with rounding is not connective).

Stefan Monnier

unread,

Jul 20, 2021, 9:45:01 PM7/20/21

to

Branimir Maksimovic [2021-07-20 22:36:55] wrote:
> On 2021-07-20, Stefan Monnier <mon...@iro.umontreal.ca> wrote:
>> Quadibloc [2021-07-20 15:07:31] wrote:
>>> On Tuesday, July 20, 2021 at 1:17:57 PM UTC-6, Branimir Maksimovic wrote:
>>>> Parallel and concurrent is same thing :P
>>>> Someone told me that depends on definition, but it is dim :P
>>> "Concurrent" refers to tasks which _could_ be done in parallel, but
>>> becaue you only have one serial processor available, it just does them
>>> one at a time, switching back and forth between them.
>>
>> That's definitely not my understanding of the term.
>>
>> To me the difference between the two is that parallelism is concerned
>> with dividing a task into subtasks that can be performed concurrently so
>> as to reduce the latency of its execution. Concurrency OTOH starts with
>> concurrent tasks and is concerned with how to schedule and synchronize
>> them such that the result is correct and to obey various other
>> timing constraints like fairness.

> Well parrallel computation (several computers), concurrent execution
> (single computer)

I don't see where my description made you think concurrency was limited
to the "single computer" case. But of course, parallel computation is
generally a losing proposition if you only have a single compute element
(tho a "single computer" can contain several CPUs, and a single CPU can
also have several concurrent compute elements, e.g. via superscalar
execution, SIMD, you name it).

Stefan

Branimir Maksimovic

unread,

Jul 20, 2021, 11:06:11 PM7/20/21

to

That was original defitibition. Parallel was distributed computing via MPI eg,
while concurrent ment via threads on single computer.
Just sain' :P

i

>
>
> Stefan

--
bmaxa now listens MARTHA & THE MUFFINS - ECHO BEACH.ogg

Quadibloc

unread,

Jul 21, 2021, 2:24:49 AM7/21/21

to

But that only means that _some_ Diophantine equations can't be solved.

Many of them can still be solved, but with difficulty.

John Savard

Marcus

unread,

Jul 21, 2021, 3:40:12 AM7/21/21

to

On 2021-07-20 21:17, Branimir Maksimovic wrote:
> On 2021-07-20, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> George Neuner <gneu...@comcast.net> writes:
>>> On Sun, 18 Jul 2021 15:55:24 GMT, an...@mips.complang.tuwien.ac.at
>>> (Anton Ertl) wrote:
>>>

[snip]

>>>> We have similar problems with explicitly managed fast memory, which is
>>>> why we don't see that in general-purpose computers; instead, we see
>>>> caches (a software-crisis-compatible variant of fast memory).
>>>
>>> We have similar problems with programmer managed dynamic allocation.
>>> All the modern languages use GC /because/ repeated studies have shown
>>> that average programmers largely are incapable of writing leak-proof
>>> code without it.
>>
>> Good example. Garbage collection is a good solution to the dynamic
>> memory allocation problem, including for good programmers. Now we
>> need such a solution for the parallelization problem.
>
> C++, Rust , Swift all does not have GC...
> They use RAII and reference counting (shared_ptr and such)
>

I've never liked GC. I find that when languages try to hide things from
the programmer (e.g. when, how and what memory gets freed) it gets
harder for me to reason about the code and its behavior. I also believe
that it generally lures programmers into making poorer SW architecture.

RAII and reference counting are much more controlled methods with
predictable performance / overhead, so me likes.

It is /possible/ to write C++ code that never leaks, /if/ you use the
right constructs. Rust took this to the next level and simply excluded
the bad constructs from the language.

...but I agree that we need language support for parallelization. For
instance order-independent loops and similar constructs should be the
goto solution (no pun intended) rather than explicit iteration logic.
Likewise pure functions should be the norm (object orientation really
screwed that up). And of course good support for async primitives.

Then we can start designing hardware that can spawn lightweight threads
as easily as they can call subroutines. But as long as everyone uses
old-school C and stdc functionality there's little use in trying.

/Marcus

Anton Ertl

unread,

Jul 21, 2021, 5:03:59 AM7/21/21

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Tuesday, July 20, 2021 at 1:17:57 PM UTC-6, Branimir Maksimovic wrote:
>
>> Parallel and concurrent is same thing :P
>> Someone told me that depends on definition, but it is dim :P
>
>"Concurrent" refers to tasks which _could_ be done in parallel, but becaue
>you only have one serial processor available, it just does them one at a time,
>switching back and forth between them.

Reference needed.

en.wikipedia.org says:

|Concurrent computing is a form of computing in which several
|computations are executed concurrently-during overlapping time
|periods-instead of sequentially-with one completing before the next
|starts.

My impression is that "concurrent" is used when discussing concurrent
access to a data structure common between the tasks, and the
synchronization mechanisms for that, while "parallel" is used when
discussing parallel execution where such issues play no role or are
outside or at the fringes of the scope of the discussion.

So these are basically complementary concepts.