There Is Still Hope

Quadibloc

unread,

May 27, 2014, 11:34:57 PM5/27/14

to

I've been trying to make some sense out of the many materials that are being advocated as successors to silicon, from which fast transistors have been made in the lab.

I've learned that the poor hole mobility of Gallium Arsenide was dealt with by Motorola; they came up with a process called Complementary Gallium Arsenide that runs at a low voltage, is rad-hard because of the wider band gap, and has been used on satellites.

Hole mobility is a problem because the limiting factor of silicon CMOS had historically been that the P-type transistors didn't work as well as the N-type transistors. So you had poorer PMOS performance instead of better NMOS performance when you chose CMOS for lower power consumption.

Strained silicon, though, improves the P-type transistors, but not the N-type ones, and so current chips may be limited by the N-type transistors instead.

But I've learned even more.

One of the exotic fast materials was Indium Gallium Arsenide.

If you tweak the percentage of Indium up a notch, the lattice size matches that of Indium Phosphide. Which is good, because at least you can make wafers out of Indium Phosphide.

If you use just a small amount of Indium, though, you get something with the properties of plain Gallium Arsenide - which matches the lattice spacing of Germanium. That isn't as good, but the N-type transistor performance of GaAs is much better than that of silicon.

But Germanium has the best hole mobility of any known semiconductor!

Germanium's hole mobility, though, doesn't equal the electron mobility of Germanium itself, let alone the electron mobility of Gallium Arsenide.

Still, though, putting Gallium Arsenide N-type transistors on a Germanium substrate which would also supply the P-type transistors would give N-type transistors that are much better than silicon, and the best possible P-type transistors.

And Germanium wafer technology, while not up to that of silicon, at least is well beyond the experimental stage.

Incidentally, using NAND gates whenever possible, and avoiding NOR gates, means that your N-type transistors are in series, while your P-type transistors are in parallel. This technique helps to cope with the inequality in performance characteristics - so there may be a benefit in using the faster Gallium Arsenide N-type transistors instead of just using plain Germanium chips.

John Savard

MitchAlsup

unread,

May 28, 2014, 10:08:53 AM5/28/14

to

On Tuesday, May 27, 2014 10:34:57 PM UTC-5, Quadibloc wrote:

Todays CPUs are not limited by the speed of transistors. I was designing 5 GHz stuff in 2003 targeting 90nm, later 65nm.

Todays CPUs are limited by wire delay, which is getting worse as lithography improves.

Seima Rao

unread,

May 28, 2014, 10:20:31 AM5/28/14

to

On Wednesday, May 28, 2014 7:38:53 PM UTC+5:30, MitchAlsup wrote:
> On Tuesday, May 27, 2014 10:34:57 PM UTC-5, Quadibloc wrote:
>Todays CPUs are not limited by the speed of transistors.
>I was designing 5 GHz stuff in 2003 targeting 90nm, later 65nm.

Assuming a "fictitious wire effect", shouldnt freq be 7Gz at 65nm?

Sincerely,
Seima Rao.

Quadibloc

unread,

May 28, 2014, 11:05:57 AM5/28/14

to

On Wednesday, May 28, 2014 8:08:53 AM UTC-6, MitchAlsup wrote:

> Todays CPUs are limited by wire delay, which is getting worse as lithography
> improves.

Well, I'm really confused now. In that case, speeds should continue to increase as transistors get smaller, since they're also closer together.

Of course, finer wires have more resistance, and maybe more inductance and capacitance in proportion to their size. Low-k dielectrics for the substrate - as opposed to High-k dielectrics, which provide improvement inside transistors - are what they're using to help with that.

But in any event, _if_ some new material that makes faster transistors will help make faster microprocessors, then apparently it won't be Indium Phosphide, Indium Arsenide, Indium Gallium Arsenide, or Gallium Arsenide. All these things do is make faster n-channel transistors. But the p-channel transistors are actually slower than in silicon.

They would be great for making ECL microprocessors out of fast NPN transistors.

But they can't make ECL microprocessors *out of silicon* that are worth making any more. (They *did* make ECL microprocessors, but with low transistor counts, before increasing density led to heat problems.) So making a faster ECL microprocessor out of an exotic material won't lead to better microprocessors than the silicon ones we have already.

Instead, it's the *weakest* link you have to fix.

And the hole mobility of germanium is the best there is.

If they can make a 10 GHz microprocessor out of germanium, I'm not going to sneeze at it because it happens to have been made on a 90 nm process. My point has been that faster single-thread performance is so valuable that a premium-priced fast chip will find customers, as long as it really is faster.

That doesn't need one billion transistors on the chip, but it *does* need nearly a million transistors, unfortunately. You would need way more than 10 GHz to make a 486 competitive, and more than 100 GHz to make a PDP-8 generally useful. (No doubt there are exotic niche applications that would benefit from such devices.)

However, strained silicon - used in IBM's G5 PowerPC chip, and in today's x86 chips, at least from Intel - has increased hole mobility. Maybe just as good as that of germanium.

Now, in that case, we're in trouble, because then unless somebody figures out how to cool an InGaAs ECL microprocessor with a million-odd transistors, or they make RSFQ work for chips on that scale, we're *not* going to get anything much faster than what we have from changing to a new material.

John Savard

Robert Wessel

unread,

May 28, 2014, 11:25:50 AM5/28/14

to

Wires are largely getting slower faster than process shrinks are
making them shorter. Only technology changes (like Al->Co) have been
keeping the slowdown manageable. For example, the RC delay for a 1mm
wire went from roughly 85ps in 130nm to 350ps in 65nm. Other than the
technology discontinuities, it's getting worse.

Quadibloc

unread,

May 28, 2014, 1:27:42 PM5/28/14

to

On Wednesday, May 28, 2014 9:25:50 AM UTC-6, robert...@yahoo.com wrote:

> Only technology changes (like Al->Co) have been
> keeping the slowdown manageable.

ITYM Al -> Cu. Cobalt interconnects are an interesting idea.

But that reminds me; in my searching, one thing I looked at was why silver is not used as an interconnect. I found, as I thought, that it is because it diffuses into the chip - but particularly into silicon dioxide.

However, even copper interconnects can poison the chip, and so a barrier layer is used with them - tantalum, titanium, and titanium nitride (not Sn) are commonly used. And so I've seen that some people are working on making copper interconnects practical - apparently there are some potential benefits besides the tiny decrease in resistance.

But you see my concern. Given the low hole mobility of many of the exotic materials, and the fact that thermal issues make CMOS about the only choice, it seems like the only exotic material that could provide a benefit is germanium, and it may be that strained silicon, already in use, is already providing us with nearly the same benefits as germanium would provide.

This means that while germanium plus gallium arsenide with a dash of indium might be an interesting BiCMOS process, there really isn't some new semiconductor material waiting in the wings to make faster chips possible.

Silicon carbide can survive very high temperatures, but chips made with it have such a high defect density that only small chips can be made with it.

So, without hole mobility, what is the use of a material like Indium Gallium Arsenide on Indium Phosphide? It might make very fast small RF chips, which will help cell phones use higher frequencies, so it's still a worthy area of research. But it won't make faster microprocessors possible, unless someone has a way to remove heat from a million-transistor ECL processor. Or even, say, a million-transistor I^2 L processor.

John Savard

Quadibloc

unread,

May 28, 2014, 1:45:16 PM5/28/14

to

On Wednesday, May 28, 2014 11:27:42 AM UTC-6, Quadibloc wrote:

> And so I've seen that some people are working on making copper interconnects practical - apparently there are some potential benefits besides the tiny decrease in resistance.

Oops, I meant *silver* interconnects.

The potential benefits were:

- Copper electromigrates into *titanium* to a greater extent than silver. This makes the barrier layer an alloy, increasing its resistance.

- The thermal expansion coefficient of silver is closer to that of silicon, allowing larger wire traces without running into breakage difficulties. (Antenna rules for chip design, though, are the limiting factor for trace lengths.)

John Savard

Quadibloc

unread,

May 28, 2014, 1:47:52 PM5/28/14

to

On Wednesday, May 28, 2014 11:27:42 AM UTC-6, Quadibloc wrote:

> Or even, say, a million-transistor I^2 L processor.

Integrated Injection Logic uses both NPN and PNP transistors. But transistors of one of the two polarities are used as resistor substitutes, rather than being part of carrying out the logical function, as in CMOS.

However, if they're slower, they'll behave more like plain old resistors in TTL logic, rather than resisting only when needed, and so the power savings of the technology won't be fully realized.

John Savard

Robert Wessel

unread,

May 28, 2014, 5:11:50 PM5/28/14

to

On Wed, 28 May 2014 10:27:42 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Wednesday, May 28, 2014 9:25:50 AM UTC-6, robert...@yahoo.com wrote:
>
>> Only technology changes (like Al->Co) have been
>> keeping the slowdown manageable.
>
>ITYM Al -> Cu. Cobalt interconnects are an interesting idea.

Of course. Brainfart on my part.

>But that reminds me; in my searching, one thing I looked at was why silver is not used as an interconnect. I found, as I thought, that it is because it diffuses into the chip - but particularly into silicon dioxide.

You're only looking at a moderate increase in conductivity for Ag. Of
course these are desperate times, so even small advantages are likely
to get looked at. But as I understand it, the main focus is on
graphene.

>However, even copper interconnects can poison the chip, and so a barrier layer is used with them - tantalum, titanium, and titanium nitride (not Sn) are commonly used. And so I've seen that some people are working on making copper interconnects practical - apparently there are some potential benefits besides the tiny decrease in resistance.
>
>But you see my concern. Given the low hole mobility of many of the exotic materials, and the fact that thermal issues make CMOS about the only choice, it seems like the only exotic material that could provide a benefit is germanium, and it may be that strained silicon, already in use, is already providing us with nearly the same benefits as germanium would provide.
>
>This means that while germanium plus gallium arsenide with a dash of indium might be an interesting BiCMOS process, there really isn't some new semiconductor material waiting in the wings to make faster chips possible.
>
>Silicon carbide can survive very high temperatures, but chips made with it have such a high defect density that only small chips can be made with it.
>
>So, without hole mobility, what is the use of a material like Indium Gallium Arsenide on Indium Phosphide? It might make very fast small RF chips, which will help cell phones use higher frequencies, so it's still a worthy area of research. But it won't make faster microprocessors possible, unless someone has a way to remove heat from a million-transistor ECL processor. Or even, say, a million-transistor I^2 L processor.

(Near) abandoning non-local wires even in CMOS would a fair bit of
frequency increase, although that would either require many pipeline
stages on the "wires", or very small cores. And it doesn't really
help the thermals all that - much of the dissipation will still be
related to frequency.

OTOH, there is a fair bit of headroom for exotic cooling. Several kW
off a die should be no real problem, and a several fold-increase over
that should at least be theoretically possible. The problem is that
designing a core for such a device is likely to be utterly
uneconomically given the very small volumes. And it doesn't really
help with your desktop system either.

Quadibloc

unread,

May 28, 2014, 6:56:54 PM5/28/14

to

On Wednesday, May 28, 2014 3:11:50 PM UTC-6, robert...@yahoo.com wrote:

> OTOH, there is a fair bit of headroom for exotic cooling. Several kW
> off a die should be no real problem, and a several fold-increase over
> that should at least be theoretically possible. The problem is that
> designing a core for such a device is likely to be utterly
> uneconomically given the very small volumes. And it doesn't really
> help with your desktop system either.

What would help with ordinary personal computers would be going back to Windows 3.1, so that they could do spreadsheets and word processing and other things very well with processors less powerful than even those used in Android devices.

That would allow inexpensive and compact computers again.

Of course, though, the multimedia capabilities of Windows 98 would be nice too, and that would require more horsepower.

NEC and Cray made custom silicon for their vector supercomputers a few years back. They gave up not because of cost, but because no worthwhile performance advantage was gained. So if exotic cooling *would* allow a Gallium Arsenide ECL processor to be part of a high-performance system, I think the market would be sufficient.

Look at the price of the SX-6r. $180,000. A vector supercomputer - in the small size with one CPU chip instead of many in parallel - for a price, given inflation, probably lower in real terms than the $18,000 price, once upon a time, of a... PDP-8!!!

So you will have some portion of the scientific supercomputer market, the part that does not have embarrassingly parallel problems. And the price of even a low-volume custom chip (admittedly, in an exotic technology, but one that would have RF applications too, so the ability to make that kind of chip wouldn't be summoned into being without commercial cross-subsidy - and admittedly with a few bucks for the refrigeration too) won't keep it out of the hands of day traders for long.

The main limit I see is that the electronics plus the cooling need to consume no more than 1000 watts. Not a problem for the supercomputer crowd, but more than that would inhibit residential use. (Turn off your computer when you're using your oven?)

John Savard

Nick Maclaren

unread,

May 29, 2014, 3:25:42 AM5/29/14

to

In article <d3jco9huspu1e5etg...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>
>OTOH, there is a fair bit of headroom for exotic cooling. Several kW
>off a die should be no real problem, and a several fold-increase over
>that should at least be theoretically possible. The problem is that
>designing a core for such a device is likely to be utterly
>uneconomically given the very small volumes. And it doesn't really
>help with your desktop system either.

Sorry, but the only appropriate response is "nuts"! There are other
constraints than physically building chips, and there are several
fundamental and critical ones that mean that there is no headroom
in power draw. Even those unspeakable sites doing unspeakable
research (all half dozen of them) are having serious problems with
that, and could not expand by a significant factor in power.

My understanding is that some of them are already wasting enough
heat to affect the local weather patterns and, if they got much
bigger, would start to change the ones over the scale of a state
(and the states are big there!)

Regards,
Nick Maclaren.

Quadibloc

unread,

May 29, 2014, 4:44:48 AM5/29/14

to

On Thursday, May 29, 2014 1:25:42 AM UTC-6, Nick Maclaren wrote:
> Even those unspeakable sites doing unspeakable
> research (all half dozen of them) are having serious problems with
> that, and could not expand by a significant factor in power.

> My understanding is that some of them are already wasting enough
> heat to affect the local weather patterns and, if they got much
> bigger, would start to change the ones over the scale of a state
> (and the states are big there!)

The good news is that at least one of those unspeakable agencies is working on embarrassingly parallel problems only, and has no need of the types of system being discussed.

In general, physical simulations of large systems also are normally handled through highly parallel algorithms. So they too can stay on the current conventional path of seeking the highest throughput for the lowest cost in energy use.

John Savard

Michael S

unread,

May 29, 2014, 5:58:15 AM5/29/14

to

Michael's principle of materials science and engineering: as civilization progresses, it replaces elements with higher atom numbers with elements with lower atom numbers. Iron age came after bronze age, not before it (well, there was stone age before bronze age, but, in order to keep principle intact, I consider it pre-civilization).

So, forget InAs, forget GaAs, forget Ge. As near terms alternative to Si one can consider Aluminium Phosphide (except that Nick does not like toxic compounds). But the future belongs to either Carbon or Boron Nitride.

BTW, Michael's principle of materials science and engineering only deals with bulk materials. It has nothing against increased use of heavier stuff as low percentage doping/alloying elements.

Michael, a man with principles.

Nick Maclaren

unread,

May 29, 2014, 6:16:19 AM5/29/14

to

In article <9d9bb497-7d3f-44de...@googlegroups.com>,
Michael S <already...@yahoo.com> wrote:
>
>So, forget InAs, forget GaAs, forget Ge. As near terms alternative to Si on=
>e can consider Aluminium Phosphide (except that Nick does not like toxic co=

>mpounds). But the future belongs to either Carbon or Boron Nitride.

Well, no, but aluminium phosphide is ecologically almost harmless,
so that adequate safety precautions in use are all that is needed.
The fact that those are neglected by the dominant multinationals,
especially when mere peasants are affected, is a separate matter.

Regards,
Nick Maclaren.

Quadibloc

unread,

May 29, 2014, 9:23:01 AM5/29/14

to

On Thursday, May 29, 2014 4:16:19 AM UTC-6, Nick Maclaren wrote:

> Well, no, but aluminium phosphide is ecologically almost harmless,
> so that adequate safety precautions in use are all that is needed.
> The fact that those are neglected by the dominant multinationals,
> especially when mere peasants are affected, is a separate matter.

I don't think it's multinationals that neglect safety precautions in the use of aluminum phosphide:

http://www.wired.com/2014/03/dead-tourists-and-a-dangerous-pesticide/

It's difficult to get lots of _individuals_ to properly take safety precautions.

Which is why DDT was so popular. Although closely related to nerve gas, it happened to be one of the few compounds of that group that is almost non-toxic to humans. So it was widely used, until the insects became immune, and until it had widespread ecological effects.

John Savard

Nick Maclaren

unread,

May 29, 2014, 9:51:53 AM5/29/14

to

In article <aa364512-68b0-4279...@googlegroups.com>,

Quadibloc <jsa...@ecn.ab.ca> wrote:
>
>> Well, no, but aluminium phosphide is ecologically almost harmless,
>> so that adequate safety precautions in use are all that is needed.
>> The fact that those are neglected by the dominant multinationals,
>> especially when mere peasants are affected, is a separate matter.
>

>I don't think it's multinationals that neglect safety precautions in the us=
>e of aluminum phosphide:

At present, no, but my point stands for commonly 'outsourced'
industries like semiconductor manufacture - and note that I am not
talking primarily about the fabrication plants, but am including
disposal and recycling.

Regards,
Nick Maclaren.

MitchAlsup

unread,

May 30, 2014, 12:00:56 AM5/30/14

to

On Wednesday, May 28, 2014 10:25:50 AM UTC-5, robert...@yahoo.com wrote:
> Wires are largely getting slower faster than process shrinks are
> making them shorter.

And one of the big reason,s in leading edge technology, is that the thickness of the barrier metals is remaining constant while the lithogragphy gets smaller and the thickness get thinner. The cross sectional area of copper is going down faster than the lithography is shrinking it.

Nick Maclaren

unread,

May 30, 2014, 2:58:10 AM5/30/14

to

In article <58609260-e1d7-4a70...@googlegroups.com>,

In layman's terms, Moore's Law is not yet dead, but it's much worse
than the cracks showing - most of the current effort seems to be in
covering up the cracks that are obstructing progress. And they get
bigger every shrink, which doesn't exactly reduce costs ....

Reverting to the title of the thread, yes, there is still hope;
but there ISN'T any hope of a fairy godmother returning us to
the situation we were in the 1980s but starting at current scales.
We need to try new directions.

Regards,
Nick Maclaren.

Quadibloc

unread,

May 30, 2014, 7:31:09 AM5/30/14

to

On Friday, May 30, 2014 12:58:10 AM UTC-6, Nick Maclaren wrote:

> Reverting to the title of the thread, yes, there is still hope;
> but there ISN'T any hope of a fairy godmother returning us to
> the situation we were in the 1980s but starting at current scales.

> We need to try new directions.

It's true that there are plenty of different things which give hope.

When it comes to the personal computer, today's personal computers already have the power needed for most typical tasks we presently think of as the domain of the computer. That isn't to say that more computer power wouldn't be nice to have; 3-D virtual reality, for example, might be one application of more power.

But most computer *work* got done well enough by 80386 chips running Windows 3.1, and video playback and multimedia was there back in the days of Windows 98 and the Pentium III.

So, while Ken Olsen was proven wrong in thinking people wouldn't want to have pdp-8s - or even 360/195s - at home, I think I can safely say that we don't _desperately_ need more computer power for home use.

But a big increment of computing power _is_ available there, just by getting rid of bloatware.

But in more serious realms, particularly in scientific computing, the sky is the limit. Any amount of computing power, no matter how large, could be put to use, and would be gratefully received if it became available.

If home users, wanting to play more realistic video games, happen to create a mass market that subsidizes this, that's a _good_ thing.

Here, the obvious direction that everyone seems to be taking is to improve compilers and coding practices to make more use of available parallelism. There's nothing wrong with that, and if ISA changes will expose more parallelism, that's useful as well.

Still, *just making things faster*, however difficult that might be, would benefit all problems, including those with an intractable core that doesn't admit of parallelization. There are such problems.

So if someone figures out how, even at great expense, to do something like:

- build germanium CMOS microprocessors that benefit from that material's hole mobility;
- build fast superconducting computers around Josephson Junctions with RSFQ;
- build, and manage to cool, ECL microprocessors made from materials with high electron mobility like gallium arsenide

I think it will be of value, and the premium prices of short-run chips for such devices will be paid by some customers.

This isn't waving a magic wand, since it doesn't bring Moore's Law back to the heady days before the thermal wall was hit by silicon CMOS. Instead, it's a return to an even earlier time, where big and expensive computers could do things that inexpensive desktop computers couldn't do.

In the more distant future, graphene and other novel technologies we don't even know of at present may create a new burst of rapid progress at a higher level. But I expect that to take a while, and for the overall progress of computers to be much slower than Moore's Law over the long run.

Of course, bandwidth limits must be combined with noise limits to determine the information-carrying capacity of a channel. So another frontier that hasn't been investigated much for computing - even if it has been used in, say, telephone modems - would be going to ternary logic instead of binary, for example, or even using four, or eight, or sixteen signal levels in chips.

John Savard

Mark Thorson

unread,

May 30, 2014, 7:22:00 PM5/30/14

to

Quadibloc wrote:
>
> Which is why DDT was so popular. Although closely
> related to nerve gas, it happened to be one of the
> few compounds of that group that is almost non-toxic
> to humans. So it was widely used, until the insects
> became immune, and until it had widespread ecological
> effects.

Bringing it back (sort of) to comp.arch,
DDT is still used today in making chips.
It is used as a photoacid generator in
chemically amplified photoresists, such
as spin-on wafer photoresists.

See, for example:
http://www.google.com/patents/US7989136

Quadibloc

unread,

May 31, 2014, 11:27:02 AM5/31/14

to

On Friday, May 30, 2014 12:58:10 AM UTC-6, Nick Maclaren wrote:

> Reverting to the title of the thread, yes, there is still hope;
> but there ISN'T any hope of a fairy godmother returning us to
> the situation we were in the 1980s but starting at current scales.

Oh, this I know. While I keep trying to think of a way in which computers could have rows of blinking lights without adding much to their cost, or slowing them down, I don't really expect to find a way.

John Savard

Seima Rao

unread,

Jun 1, 2014, 7:26:19 PM6/1/14

to

On Wednesday, May 28, 2014 8:55:50 PM UTC+5:30, robert...@yahoo.com wrote:
>For example, the RC delay for a 1mm wire went from roughly 85ps in 130nm
>to 350
>ps in 65nm. Other than the technology discontinuities, it's getting worse.

What was 65nm intended to achieve that 130 nm couldnt?

Sincerely,
Seima Rao.

Quadibloc

unread,

Jun 1, 2014, 7:54:19 PM6/1/14

to

On Sunday, June 1, 2014 5:26:19 PM UTC-6, Seima Rao wrote:

> What was 65nm intended to achieve that 130 nm couldnt?

Although as process shrinks continue, it is no longer true that signals go faster along the shorter distances - due to inductance and capacitance - and it is no longer true that transistors are faster - due to the voltage being reduced because of heat - *one* benefit remains.

As processes shrink, since the chip is smaller, there is less of a chance of a defect that will spoil the chip in the reduced area. This is despite the fact that everything on the chip is smaller, so a smaller defect can ruin a chip.

So shrinks increase yield and make chips *cheaper*.

John Savard

Seima Rao

unread,

Jun 1, 2014, 8:08:12 PM6/1/14

to

On Monday, June 2, 2014 5:24:19 AM UTC+5:30, Quadibloc wrote:
>So shrinks increase yield and make chips *cheaper*. John Savard

Then I am very sure that Intel & others are laughing all the way
to their plants! They will increase yields under a generous
economy all inside their plants, not inside their design rooms
*for sure*!

Not a good idea then to work on processes.

Sincerely,
Seima Rao.

Robert Wessel

unread,

Jun 2, 2014, 12:54:37 AM6/2/14

to

On Sun, 1 Jun 2014 16:54:19 -0700 (PDT), Quadibloc <jsa...@ecn.ab.ca>
wrote:

Well, getting more transistors on the same size chip is also a benefit
in general. While we're hitting diminishing returns, many structures
can be made wider and shallower by throwing more transistors at them.
In some cases, like SIMD, large quantities of transistors can make
proper parallel implementations of those instructions viable (and then
you only have to find applications for which those are useful). And
then there are obvious effects like being able to put more processors
on a chip, or more memory (or cache). Integration is a benefit as
well, packaging (at both the device and system level) is a big cost,
and having one package can be considerable cheaper (not to mention
smaller) than several.

Yield is not really that big a driver for reducing transistor size. On
mature processes, with reasonable sized chips, defect densities are
low enough that only modest improvements in yield are possible by
reducing die size. OTOH, the early versions of a process tend to have
much higher defect densities than the processes they replace. Of
course for dies pushing the size limit, defect-related yield, even on
fairly mature processes, is a big issue.

Total processing costs, OTOH, do remain fixed on a roughly per-wafer
basis (on a per-process basis), so getting more dies out of a wafer by
whatever means, is a good thing. But wafer processing costs grow
considerably with each process shrink. Again there are
discontinuities here, with increases in wafer size having compensated
a fair bit for other increasing costs, for example, but that's likely
to be near the end of the process too (wafers larger than 450mm seem
unlikely, at least in the foreseeable future).

And the cost-per-transistor argument may be on the verge of going away
too - the cost per transistor may be about to stop dropping for
smaller process generations, and you'll only have whatever advantages
you can get out of more transistors on a die.

Interesting times.

Robert Wessel

unread,

Jun 2, 2014, 1:20:42 AM6/2/14

to

On Fri, 30 May 2014 04:31:09 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Friday, May 30, 2014 12:58:10 AM UTC-6, Nick Maclaren wrote:
>
>> Reverting to the title of the thread, yes, there is still hope;
>> but there ISN'T any hope of a fairy godmother returning us to
>> the situation we were in the 1980s but starting at current scales.
>
>> We need to try new directions.
>
>It's true that there are plenty of different things which give hope.
>
>When it comes to the personal computer, today's personal computers already have the power needed for most typical tasks we presently think of as the domain of the computer. That isn't to say that more computer power wouldn't be nice to have; 3-D virtual reality, for example, might be one application of more power.
>
>But most computer *work* got done well enough by 80386 chips running Windows 3.1, and video playback and multimedia was there back in the days of Windows 98 and the Pentium III.
>
>So, while Ken Olsen was proven wrong in thinking people wouldn't want to have pdp-8s - or even 360/195s - at home, I think I can safely say that we don't _desperately_ need more computer power for home use.
>
>But a big increment of computing power _is_ available there, just by getting rid of bloatware.

The continuous complaint about bloatware has always reminded me of
people's continuous complaints about he lazy, shiftless, whatever,
youth of today. Where "today" is any point in history. Heck
Aristotle (perhaps apocryphally) quotes Socrates complaining about the
youths of his day, who didn't want to work, wanted only to go to
parties, and were disrespectful of their elders. He may have
complained that they liked to sleep late too.

And yes, most of those productive Windows 3.1 applications you mention
*were* considered bloatware by many people.

And when Fortran and Cobol were considered leading causes of bloat by
the assembler guys? Even now there are assembler guys who will argue
that C is slow and inefficient, and C users who complain about C++,
whose users complain about Java...

Which is not to say that I don't agree to a point, but much of the
"bloat" is a tradeoff of development time for (ever cheaper) CPU
cycles, and it's easy to justify the use of very inexpensive CPU
cycles and on minor features, or even cosmetics (which, like it or
not, helps sell things).

And yes, back in the day I was in a shop that supported several
hundred users on a 370/148 with a few MB of memory, which is probably
outclassed by my new washing machine. Don't really want to go back.

(above rant not to be take *too* seriously)

Nick Maclaren

unread,

Jun 3, 2014, 4:03:33 AM6/3/14

to

In article <8t0oo910gnc5ebsbf...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:
>On Fri, 30 May 2014 04:31:09 -0700 (PDT), Quadibloc
><jsa...@ecn.ab.ca> wrote:
>
>>But a big increment of computing power _is_ available there, just by getting rid of bloatware.
>
>The continuous complaint about bloatware has always reminded me of
>people's continuous complaints about he lazy, shiftless, whatever,
>youth of today. Where "today" is any point in history. Heck
>Aristotle (perhaps apocryphally) quotes Socrates complaining about the
>youths of his day, who didn't want to work, wanted only to go to
>parties, and were disrespectful of their elders. He may have
>complained that they liked to sleep late too.

Then perhaps you should read what the informed people say about it
a bit more carefully.

>And yes, most of those productive Windows 3.1 applications you mention
>*were* considered bloatware by many people.
>
>And when Fortran and Cobol were considered leading causes of bloat by
>the assembler guys? Even now there are assembler guys who will argue
>that C is slow and inefficient, and C users who complain about C++,
>whose users complain about Java...

None of that is relevant. While some of those Microsoft programs
could have been counted as bloatware, it has nothing to do with
either featuritis or constant factors (as caused by most choices
of programming language).

>Which is not to say that I don't agree to a point, but much of the
>"bloat" is a tradeoff of development time for (ever cheaper) CPU
>cycles, and it's easy to justify the use of very inexpensive CPU
>cycles and on minor features, or even cosmetics (which, like it or
>not, helps sell things).

It is almost never worth paying a potentially exponential cost
to provide enhancements, and that issue is at the heart of the
bloatware debate. If you solve every problem, including internal
restrictions of and interface problems with subcomponents, by
adding new, largely ab initio components, and do that at all
levels, you end up with a superlinear (potentially exponential)
increase in size and complexity relative to functionality.

Not merely does quite a lot of bloatware have no more actual,
usable functionality than less bloated programs a fraction of its
size, by estimate is that quite a lot of bloatware spends 90%+
of its time executing the bloat (as distinction from function).
Even reducing that to 50% would speed the code up by 5 times.

Oh, yes, the cost of removing bloat is significant - it's much
easier to just add a new subcomponent and not worry about how
it interacts with the existing scheme. And it provides lots
of opportunities for enhancements in adding yet more bloat to
provide a way of using it in combination with each other
subcomponent separately.

If CPUs were designed like that, they wouldn't work at all.
Yes, they have such aspects, but only as mistakes or warts.

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 4, 2014, 7:31:29 AM6/4/14

to

I see that I was premature to claim that process shrinks have stopped yielding at least some thermal benefits:

http://www.extremetech.com/computing/183604-intels-new-14nm-core-m-will-finally-bring-big-core-x86-chips-to-fanless-tablets

http://www.engadget.com/2014/06/03/intel-new-2-in-1-reference-pc/

http://www.theregister.co.uk/2014/06/03/intel_grants_business_typoslabs_their_own_14nm_corem_silicon/

http://arstechnica.com/gadgets/2014/06/intel-shows-off-fanless-broadwell-tablet-thinner-than-the-ipad-air/

John Savard

Paul A. Clayton

unread,

Jun 5, 2014, 5:14:17 PM6/5/14

to

On Monday, June 2, 2014 12:54:37 AM UTC-4, robert...@yahoo.com wrote:
[snip]

> Well, getting more transistors on the same size chip is also a benefit
> in general. While we're hitting diminishing returns, many structures
> can be made wider and shallower by throwing more transistors at them.

WARNING: Blatant self promotion follows (with a little
"real content" later)

Your post enticed me to share a link to my answer to the
Electrical Engineering Stack Exchange question "Why does
more transistors = more processing power?"
( http://electronics.stackexchange.com/q/5592/15426#answer-45099 ).
Though it is only loosely related to this thread (it did
not have any concern with transistor performance, wire
delay, energy use, etc.), I am somewhat happy with the
answer (though it is poor at giving a sense of scale of
costs in transistors, just lumping many uses into
"more transistors").

(It received the comment "Excellent answer from a new
guy!" and received 10 upvotes. The accepted answer is
probably better in getting at the heart of the matter--
"The only *real* parallelism is what you get when you have
more transistors on the job."--and being concise while
giving general examples. My answer "may be exhausting but
is not exhaustive!" with nearly 5 times as many characters--
5069 vs. 1056--and more than 4 times as many words--747 vs.
181. The accepted answer was also not posted almost two
year after the question!)

[BEGIN "real content"]
[snip]

> And the cost-per-transistor argument may be on the verge of going away
> too - the cost per transistor may be about to stop dropping for
> smaller process generations, and you'll only have whatever advantages
> you can get out of more transistors on a die.

One interesting aspect of this is that even the dark silicon
theory (that by using special purpose hardware that is not
always powered performance can increase) is impacted. While
dark silicon addresses the power/thermal wall, it does not
address cost-per-transistor issues.

One obvious implication of a slowing of Moore's Law is the
increasing economic justification for more application
specific designs. (Since I like the idea of cooperation and
sharing, Application Specific Integrated Products seem
especially interesting to me. Providing a device which gets
80% or more of the benefits of being application-specific and
several times the volume seems attractive. I also wonder why
things like the multi-mode modem in NVIDIA's Tegra are not
more common.)

> Interesting times.

It's a curse *and* a blessing. While I agree with some of
the sentiment that software could be significantly improved,
this kind of transition also means interesting work for the
hardware folks.

Quadibloc

unread,

Jun 5, 2014, 6:46:03 PM6/5/14

to

On Thursday, June 5, 2014 3:14:17 PM UTC-6, Paul A. Clayton wrote:

> Your post enticed me to share a link to my answer to the
> Electrical Engineering Stack Exchange question "Why does
> more transistors = more processing power?"

Well, there is an *obvious* answer, up to a point.

A PDP-8 has less transistors in it than a System/360 Model 195.

In that range, more transistors obviously mean more processing power, because you are switching from doing floating-point calculations in software at a multiple of the machine's native precision to doing them directly, and from using a simple ALU to using more sophisticated methods like Sklansky adders and Wallace trees.

We can call this the Grosch range, the range where Grosch's law applies - computer power is proportional to the square of the cost of the computer.

Past that range, two computers are better than one, but not by nearly as much.

John Savard

Paul A. Clayton

unread,

Jun 7, 2014, 9:59:57 PM6/7/14

to

On Thursday, June 5, 2014 6:46:03 PM UTC-4, Quadibloc wrote:
> On Thursday, June 5, 2014 3:14:17 PM UTC-6, Paul A. Clayton wrote:
>
> > Your post enticed me to share a link to my answer to the
> > Electrical Engineering Stack Exchange question "Why does
> > more transistors = more processing power?"
>
> Well, there is an *obvious* answer, up to a point.

Yeah, I gave the answer that is obvious to anyone familiar
with computer architecture, but it seems that even a fair
number of EEs are not familiar with computer architecture.

> A PDP-8 has less transistors in it than a System/360 Model 195.
>
> In that range, more transistors obviously mean more processing
> power, because you are switching from doing floating-point
> calculations in software at a multiple of the machine's native
> precision to doing them directly, and from using a simple ALU to
> using more sophisticated methods like Sklansky adders and
> Wallace trees.
>
> We can call this the Grosch range, the range where Grosch's law
> applies - computer power is proportional to the square of the cost
> of the computer.

The economics of computers is interesting. I suspect that Moore's
Law was reducing the cost of computing by more than just the
transistor doubling. Reducing price/performance can make new uses
economically justifiable, increasing chip volume which reduces
NRE per chip. (Without the increase in volume, following Moore's
Law would presumably not have been practical.)

> Past that range, two computers are better than one, but not by
> nearly as much.

That is somewhat workload dependent, otherwise I doubt the Great
Big Out-of-Order processors would have been as popular and
chip multiprocessors would have been common much earlier.

Bruce Hoult

unread,

Jun 7, 2014, 10:50:39 PM6/7/14

to

On Sunday, June 8, 2014 1:59:57 PM UTC+12, Paul A. Clayton wrote:
> The economics of computers is interesting. I suspect that Moore's
> Law was reducing the cost of computing by more than just the
> transistor doubling. Reducing price/performance can make new uses
> economically justifiable, increasing chip volume which reduces
> NRE per chip. (Without the increase in volume, following Moore's
> Law would presumably not have been practical.)

I was amazed recently to find this:

http://www.newark.com/nxp/lpc810m021fn8fp/mcu-32bit-cortex-m0-30mhz-dip/dp/55W6726

30 MHz ARM Cortex M0+, 4 KB program space, 1 KB RAM, 8 pin DIP package (6 I/O pins)
$1.40 in quantity 1, $0.53 for 2500+.

That's cheap enough that it wouldn't be totally crazy to use it to implement 2 NAND gates if you had a box of them handy but no NAND gate ICs. Or emulate a 555 timer.

Or, I think you could maybe squeeze AES 256 onto it (code size would be the limiting factor) and encrypt something around a 1 Mbps serial stream with 0.5 mS delay.

(this page says 24 MHz can do 150 KB/sec 1.2 Mbps, so 30 MHz should still do 1+ Mbps even with overhead of interrupt servicing for serial ports http://realtimelogic.com/products/sharkssl/Cortex-M0/)

Quadibloc

unread,

Jun 9, 2014, 6:41:01 AM6/9/14

to

On Saturday, June 7, 2014 7:59:57 PM UTC-6, Paul A. Clayton wrote:

> That is somewhat workload dependent, otherwise I doubt the Great
> Big Out-of-Order processors would have been as popular and
> chip multiprocessors would have been common much earlier.

It's true that some workloads are highly parallel. And thus that does determine how far one goes into the territory of diminishing returns.

But going from a 486 to a Pentium still provided such obvious gains that I doubt that different workloads would have led to multicore processors in which the cores were like the 486. A computer that does well on any workload is better than one that is limited to doing well on one kind.

John Savard

Paul Wallich

unread,

Jun 9, 2014, 9:22:44 PM6/9/14

to

On 6/7/14 10:50 PM, Bruce Hoult wrote:
> On Sunday, June 8, 2014 1:59:57 PM UTC+12, Paul A. Clayton wrote:
>> The economics of computers is interesting. I suspect that Moore's
>> Law was reducing the cost of computing by more than just the
>> transistor doubling. Reducing price/performance can make new uses
>> economically justifiable, increasing chip volume which reduces
>> NRE per chip. (Without the increase in volume, following Moore's
>> Law would presumably not have been practical.)
>
> I was amazed recently to find this:
>
> http://www.newark.com/nxp/lpc810m021fn8fp/mcu-32bit-cortex-m0-30mhz-dip/dp/55W6726
>
> 30 MHz ARM Cortex M0+, 4 KB program space, 1 KB RAM, 8 pin DIP package (6 I/O pins)
> $1.40 in quantity 1, $0.53 for 2500+.
>
> That's cheap enough that it wouldn't be totally crazy to use it to implement 2 NAND gates if you had a box of them handy but no NAND gate ICs. Or emulate a 555 timer.

Way less power consumption than a 555, less noise, fewer external
components (depending on what you intend to drive). Also digital
filtering of incoming analog signals and low-quality analog output.

It is, in a sense, The Matrix. Only cheaper.

paul

Terje Mathisen

unread,

Jun 10, 2014, 2:49:25 AM6/10/14

to

:-)

At that price point, and with presumably some hw irq capabilities on
those IO ports it could be the basis for an NTP server.

You would need to add an Ethernet port though, preferably one that
supports 1588 TPT with hw timestamping of packets.

At that point the board to mount it on, the gps, the ethernet and the
power supply could all be more expensive than the cpu. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Jeremy Linton

unread,

Jun 12, 2014, 9:42:18 PM6/12/14

to

On 6/2/2014 12:20 AM, Robert Wessel wrote:
> Which is not to say that I don't agree to a point, but much of the
> "bloat" is a tradeoff of development time for (ever cheaper) CPU
> cycles, and it's easy to justify the use of very inexpensive CPU
> cycles and on minor features, or even cosmetics (which, like it or
> not, helps sell things).

There are a couple differences though. For a couple decades the cost of
"bloat" was offloaded to the end users who paid the expense of upgrading
their machines. Today, "modern" network applications costs are more
directly borne by the developer of said applications. In many cases
those costs are not insignificant. Shops which choose to forgo the more
efficient languages are paying 2x-100x performance costs directly.
Instead of having to purchase 1000 machines to host their applications
they have to purchase, power and cool 100,000. Which seems insane if you
consider the disk IOP, throughput and compute power available in a
modern machine compared with the tiny pipes attaching them to the internet.

Quadibloc

unread,

Jun 13, 2014, 5:58:58 AM6/13/14

to

On Thursday, June 12, 2014 7:42:18 PM UTC-6, Jeremy Linton wrote:
> Today, "modern" network applications costs are more
> directly borne by the developer of said applications. In many cases
> those costs are not insignificant. Shops which choose to forgo the more
> efficient languages are paying 2x-100x performance costs directly.

I would assume that SaaS vendors will optimize their code, or the market will take matters into its own (invisible) hands.

John Savard

Bruce Hoult

unread,

Jun 13, 2014, 8:06:01 AM6/13/14

to

On Friday, June 13, 2014 1:42:18 PM UTC+12, Jeremy Linton wrote:
> their machines. Today, "modern" network applications costs are more
> directly borne by the developer of said applications. In many cases
> those costs are not insignificant. Shops which choose to forgo the more
> efficient languages are paying 2x-100x performance costs directly.
> Instead of having to purchase 1000 machines to host their applications
> they have to purchase, power and cool 100,000. Which seems insane if you
> consider the disk IOP, throughput and compute power available in a
> modern machine compared with the tiny pipes attaching them to the internet.

Yes.

Back in the early days of the web, I helped a client (an advertising company) develop a number of sites for their clients. They did the HTML bits. I did anything that needed a database or calculations.

I wrote everything in C++, as libraries that loaded into the address space of the WebStar server (on a PowerPC Mac) and kept all my databases in RAM using arrays and hash tables etc (checkpointed to disk regularly). Had no trouble dealing with hundreds of requests per second and saturating any connection we could get then.

I was *shocked* when I learned that the industry best practice was to spin up a new process for every request. And run a Perl interpreter in it. And connect and disconnect to an SQL database for every request.

Given the power of modern machines, very few sites outside of Google or Facebook should require more than a single machine.

Paul Wallich

unread,

Jun 13, 2014, 1:32:59 PM6/13/14

to

Yep. I did a short project for someone where parsing their logs in
lisp(!) saved hours of time every day over the perl they were using.

But. Consider what an Amazon machine instance costs per hour versus a
good programmer. Throwing hardware at performance problems lets people
put working sites up in internet time instead of
conventional-software-project time. And only sites like Google or
Facebook would see any significant impact to profits (say, bigger than
the tee shirt or bus budget) from recoding to use fewer machines.

Which suggests, though, that there might be a good place for
architectures that have lousy peak performance but run badly-written
code in inefficient languages more effectively than current ones do.

Terje Mathisen

unread,

Jun 13, 2014, 1:41:47 PM6/13/14

to

THAT is not "best practise", it is simply the fastest way to get
something up & running. I use exactly this model for the bespoke system
I wrote for my father-in-law.

3-5 users, 100 Mbit/s network, single (mirrored) server disk, a GB or
two of total DB space.

The only operations which have any (visible) wait time at all are for
global statistics, requiring a full join over the two tables that
contain 99% of the data. Since this is needed maybe once a week or so it
doesn't matter that it takes 10-30 seconds.

>
> Given the power of modern machines, very few sites outside of Google
> or Facebook should require more than a single machine.

Back in the 386 (!) days, NetWare saturated multiple (3 or 4) 100 MBit/s
Ethernet interfaces, streaming random (selected by the clients) video
streams to a bunch of clients.

The EAS candidate code we optimized was considered "done" when we could
run full duplex 100 Mbit/s on a single PentiumPro cpu.

I.e. most people don't have multi-Gbit/s server connections, so they
should in fact handle wire speed on a single cpu, running all
connections over https.

Nick Maclaren

unread,

Jun 13, 2014, 1:43:33 PM6/13/14

to

In article <05d8da92-a2e9-452c...@googlegroups.com>,
Bruce Hoult <bruce...@gmail.com> wrote:
>
>Back in the early days of the web, I helped a client (an advertising compan=
>y) develop a number of sites for their clients. They did the HTML bits. I d=

>id anything that needed a database or calculations.
>

>I wrote everything in C++, as libraries that loaded into the address space =
>of the WebStar server (on a PowerPC Mac) and kept all my databases in RAM u=
>sing arrays and hash tables etc (checkpointed to disk regularly). Had no tr=
>ouble dealing with hundreds of requests per second and saturating any conne=

>ction we could get then.
>

>I was *shocked* when I learned that the industry best practice was to spin =
>up a new process for every request. And run a Perl interpreter in it. And c=

>onnect and disconnect to an SQL database for every request.

Well, actually, I am shocked that you were shocked! There are very,
very good reasons to use processes rather than threads, in any
application where RAS or security is involved. Almost all threading
interfaces are really, really, bad news in those respects :-(

If, of course, neither RAS nor security actually matters, then you
don't need processes, but they still make a lot of things much
easier.

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 13, 2014, 4:54:40 PM6/13/14

to

I just came across this article,

http://www.theregister.co.uk/2014/05/22/energy_economics_coal/

which gives very encouraging news. The world is not going to run out of Hafnium any time soon. Or, more importantly for the subject of this thread, Indium either.

John Savard

Bruce Hoult

unread,

Jun 13, 2014, 10:25:51 PM6/13/14

to

On Saturday, June 14, 2014 5:32:59 AM UTC+12, Paul Wallich wrote:
> Yep. I did a short project for someone where parsing their logs in
> lisp(!) saved hours of time every day over the perl they were using.

Nothing wrong with lisp, with any compiler from the last 15 or 20 years (e.g. CMUCL)

> But. Consider what an Amazon machine instance costs per hour versus a
> good programmer. Throwing hardware at performance problems lets people
> put working sites up in internet time instead of
> conventional-software-project time.

That assumes that more productive languages run slower than less productive ones, which does not have to be the case. Certainly that's true if you compare Ruby to C, but there are a number of languages with as good libraries (or at least ability to create libraries), metaprograming, and conciseness as Ruby or Perl while running within a factor of 1.5 or 2 (at worst) of C. Lisp is one of them. Objective C is not bad (and Swift will be better).

Dylan is slightly less dynamic than Lisp or ObjC, but my programs built with Gwydion d2c regularly run within 10% of C, while being many times faster to write and more reliable and robust -- see my Dylan Hackers team's record in the ICFP Programming Contest, especially with prizes in 2001, 2003, 2005 though we were always in the top 10% of entries.

Paul Wallich

unread,

Jun 14, 2014, 12:02:41 AM6/14/14

to

I think we're mostly in violent agreement. There's nothing that
constrains decent-to-code-in languages from running quickly. It's simply
that writing in a faster-to-code, slower-to-run language is still
probably a good tradeoff given the current prices of cycles, RAM and
disk vs the current prices of coders and designers.

If you can find fast-to-code and fast-to-run in the same language and
team then you win even more. But it does make you wonder if there's an
architecture that would execute crap code significantly faster even at
the cost of way lower peak performance on the good stuff.

paul

Tom Gardner

unread,

Jun 14, 2014, 6:07:32 AM6/14/14

to

On 14/06/14 05:02, Paul Wallich wrote:
> There's nothing that constrains decent-to-code-in languages from running quickly. It's simply that writing in a faster-to-code, slower-to-run language is still probably a good tradeoff given the
> current prices of cycles, RAM and disk vs the current prices of coders and designers.

And if you were making this argument in a commercial/professional
context, I'm sure you wouldn't forget to add in "easier to maintain
and enhance" as a significant criteria.

Nick Maclaren

unread,

Jun 14, 2014, 6:47:22 AM6/14/14

to

In article <F9Vmv.312054$b96.1...@fx09.am4>,

And then ignore it :-(

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 14, 2014, 10:40:33 AM6/14/14

to

On Friday, June 13, 2014 10:02:41 PM UTC-6, Paul Wallich wrote:
> It's simply
> that writing in a faster-to-code, slower-to-run language is still
> probably a good tradeoff given the current prices of cycles, RAM and
> disk vs the current prices of coders and designers.

That may be true, but when you're doing SaaS, and paying for the cycles used by thousands of users, the equation does change.

It's not as if C or Fortran or Pascal are all that much worse than Python or Perl or Lisp or APL either. And it isn't just choice of language. The bloated software for the PC written by Microsoft and others has usually been written in C or C++.

John Savard

MitchAlsup

unread,

Jun 14, 2014, 6:11:55 PM6/14/14

to

On Friday, June 13, 2014 12:43:33 PM UTC-5, Nick Maclaren wrote:
> In article <05d8da92-a2e9-452c...@googlegroups.com>,
>
> Bruce Hoult <bruce...@gmail.com> wrote:

> >I was *shocked* when I learned that the industry best practice was to spin =
> >up a new process for every request.
>

> Well, actually, I am shocked that you were shocked! There are very,
> very good reasons to use processes rather than threads, in any
> application where RAS or security is involved.

Last summer, a Jr programmer and I met a my local bar. He was having performance problems with one of his data base appliations (hotel management). As he explained the problem it became clear that he was spending more time allocating stuff at the beginning of a "request" and in finding and deallocating all the allocated stuff at the end of the "request" than he was in actually performing the request. He though he had to do it like this because of the extensive use of threads in the application.

I advised him that he should quit trying to do all this work in the appliation, but instead, spin off a new process, perform the request, then let the task die and have the OS reclaim the allocated resources.

One week later he had that up and runninig at more than 2X the performance of the previous incarnation!

There are certain things that are better done en massé even if you have to let the OS do it/them for you.

Mitch

Paul Wallich

unread,

Jun 15, 2014, 12:42:37 PM6/15/14

to

These days so few people wait to finish something before going live that
it's just one big never-ending lump of coding.

paul

Rick Jones

unread,

Jun 16, 2014, 2:56:30 PM6/16/14

to

Paul Wallich <p...@panix.com> wrote:
> But. Consider what an Amazon machine instance costs per hour versus
> a good programmer. Throwing hardware at performance problems lets
> people put working sites up in internet time instead of
> conventional-software-project time. And only sites like Google or
> Facebook would see any significant impact to profits (say, bigger
> than the tee shirt or bus budget) from recoding to use fewer
> machines.

Now multiply that by thousands if not millions of people simply
throwing another cloud instance at something. Doesn't that turn the
various cloud providers into something more like what you are
envisioning for Google or Facebook?

rick jones
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

Quadibloc

unread,

Jun 16, 2014, 10:05:39 PM6/16/14

to

Incidentally, just saw a news item about Microsoft using FPGAs to massively increase its bang for the buck in handling some stuff in the cloud.

http://www.theregister.co.uk/2014/06/16/microsoft_catapult_fpgas/

John Savard

Stephen Sprunk

unread,

Jun 17, 2014, 1:32:37 AM6/17/14

to

"Massive"? According to that article, they only gained 95% (less than
2x) at a cost of 10% more power, or 77% improvement in perf/Watt. That
doesn't seem to justify the extra development costs.

I suspect you could get at least as much by better tuning. For
instance, during routine capacity testing of 2/4/8 CPU systems, one of
the testers accidentally set a VM for 3 CPUs--and it outperformed an 8
CPU system. WTF? We've spent months benchmarking our code to find the
issue, but so far it appears to be somewhere down in the JVM and/or OS.
A properly tuned stack should _never_ exhibit such behavior.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Tom Gardner

unread,

Jun 17, 2014, 2:20:57 AM6/17/14

to

On 17/06/14 06:32, Stephen Sprunk wrote:
> On 16-Jun-14 21:05, Quadibloc wrote:
>> Incidentally, just saw a news item about Microsoft using FPGAs to
>> massively increase its bang for the buck in handling some stuff in
>> the cloud.
>>
>> http://www.theregister.co.uk/2014/06/16/microsoft_catapult_fpgas/
>
> "Massive"? According to that article, they only gained 95% (less than
> 2x) at a cost of 10% more power, or 77% improvement in perf/Watt. That
> doesn't seem to justify the extra development costs.

Apparently the financial community casts *financial* algorithms
into FPGAs in order to reduce the latency by a few milli/microseconds.
Apparently that is worth big bucks to the high frequency trading mob.

> I suspect you could get at least as much by better tuning. For
> instance, during routine capacity testing of 2/4/8 CPU systems, one of
> the testers accidentally set a VM for 3 CPUs--and it outperformed an 8
> CPU system. WTF? We've spent months benchmarking our code to find the
> issue, but so far it appears to be somewhere down in the JVM and/or OS.
> A properly tuned stack should _never_ exhibit such behavior.

The high performance computing fraternity has been all too
aware of similar effects for a long time now. The core issues
are
- the stacks are not intolerably appalling for average
workloads and average development/deployment staff
- too few people understand all levels of a stack (i.e.
from L1 through NUMA and networking to multithreading
and application), let alone how they interact

Hell's teeth: in my experience you are lucky to find someone
that realises that you have asynchronous comms protocols
built on top of synchronous comms protocols (multiple times!)

Nick Maclaren

unread,

Jun 17, 2014, 4:18:48 AM6/17/14

to

In article <d7Rnv.386405$B%3.2...@fx23.am4>,

Tom Gardner <spam...@blueyonder.co.uk> wrote:
>
>Hell's teeth: in my experience you are lucky to find someone
>that realises that you have asynchronous comms protocols
>built on top of synchronous comms protocols (multiple times!)

God help us all, yes :-( I have been trying to explain that this
is a significant problem for many decades, often to so-called
experts, and my success rate is low. Because they don't have a
'feel' for asynchronism and pipelining, they can't get their
head around the issues, and keep denying they exist.

The latest lunatic dogma is that you can make any synchronous
mechanism (including communications) asynchronous by simply
running it in a separate thread. Well, yes, in the same sense
that you can add SIMD support to any CPU by providing opcodes
with a SIMD interface and emulating it using serial logic.

Regards,
Nick Maclaren.

Tom Gardner

unread,

Jun 17, 2014, 4:34:43 AM6/17/14

to

On 17/06/14 09:18, Nick Maclaren wrote:
> In article <d7Rnv.386405$B%3.2...@fx23.am4>,
> Tom Gardner <spam...@blueyonder.co.uk> wrote:
>>
>> Hell's teeth: in my experience you are lucky to find someone
>> that realises that you have asynchronous comms protocols
>> built on top of synchronous comms protocols (multiple times!)
>
> God help us all, yes :-( I have been trying to explain that this
> is a significant problem for many decades, often to so-called
> experts, and my success rate is low. Because they don't have a
> 'feel' for asynchronism and pipelining, they can't get their
> head around the issues, and keep denying they exist.

My least favourite example, perpetrated by large companies
and conslutants (sic) was SOAP and Web Services - RPC
over HTTP sanctified by the use of XML. Took 10 years
before people twigged it wasn't making good use (to put
it mildly) of HTTP/web infrastructure.

> The latest lunatic dogma is that you can make any synchronous
> mechanism (including communications) asynchronous by simply
> running it in a separate thread. Well, yes, in the same sense
> that you can add SIMD support to any CPU by providing opcodes
> with a SIMD interface and emulating it using serial logic.

Oh, that's come around /again/, has it? Was it really
only 5 years ago that everybody realised it was A Bad
Thing. Institutional memory seems to be suffering from
something akin to attention deficit disorder. Sigh.

Michael S

unread,

Jun 17, 2014, 5:37:16 AM6/17/14

to

Why is it (emulating SIMD interface using serial logic) A Bad Thing? I tend to think that it is An Excellent Thing, as long as everythings is properly pipelined (original Atom DP SIMD is an example of failure on that front, but successful pipelining is more common) and the ratio between logical and physical execution width is in proper range, i.e. 4 to 8. Ratios of 2 and 16 also could work in some situations, but are, IMHO, suboptimal.

According to my understanding, early Cray machines had logical-to-physical ratio of 64. May be, under limitations of 35 years ago it was good decision, but under today's circumstances it sounds crazy.

Nick Maclaren

unread,

Jun 17, 2014, 5:59:17 AM6/17/14

to

In article <D4Tnv.56697$0p4....@fx29.am4>,

I may be out of phase, and it may still be (temporarily) out,
at least in most areas. But it got into C++11 :-(

At my age, I find it difficult to track fads with a period of
less than about a decade - they seem to pass me in a flash!
But there are always some people who hold to debunked dogmas
past the time that almost everyone has abandoned them.

Regards,
Nick Maclaren.

Tom Gardner

unread,

Jun 17, 2014, 6:26:05 AM6/17/14

to

I can tolerate people holding onto cherished ideas; that's
natural and is akin to tenacity - which can be beneficial.

What I find it harder to tolerate is the ignorant triumphant
re-invention of elliptical wheels. I taught my daughter
that it is OK to make mistakes, but provided they are
/new/ mistakes.

Nick Maclaren

unread,

Jun 17, 2014, 6:33:13 AM6/17/14

to

In article <1JUnv.30243$IC3....@fx35.am4>,

That's the definition of research :-)

Regards,
Nick Maclaren.

Quadibloc

unread,

Jun 17, 2014, 8:27:10 AM6/17/14

to

On Tuesday, June 17, 2014 3:37:16 AM UTC-6, Michael S wrote:

> Why is it (emulating SIMD interface using serial logic) A Bad Thing?

Oh, it isn't bad in itself. But if you use the emulated SIMD to emulate serial floating point of another kind, *then* you've just bought yourself a whopping performance hit. It's SIMD as an intermediate stage that's bad.

John Savard

Anne & Lynn Wheeler

unread,

Jun 17, 2014, 10:26:50 AM6/17/14

to

Tom Gardner <spam...@blueyonder.co.uk> writes:
> My least favourite example, perpetrated by large companies
> and conslutants (sic) was SOAP and Web Services - RPC
> over HTTP sanctified by the use of XML. Took 10 years
> before people twigged it wasn't making good use (to put
> it mildly) of HTTP/web infrastructure.

HTTP transaction oriented protocol using TCP, a session oriented
protocol ... there was period in 1995 as server load started to scaleup,
server CPUs were spending 95% of the time in FINWAIT list processing.
Most systems had assumed FINWAIT list would only have a trivial few
items (list of recently closed TCP sessions to check each incoming
packet, if it belonged to session that had already been closed). HTTP
transactions using TCP was exploding items on FINWAIT list to tens of
thousands. It took six months or so before vendors started shipping
rewritten FINWAIT list handling that improved on the problem.

In the meantime large web operations were adding machines like mad and
working on gimmicks for load-balance across the available backend
servers. First boundary routers sending connects to randomly selected
backend servers, then started to see the early implementations of
service boundary routers keeping track of number of sessions to each of
the backend servers ... implementation continued to be used after
deployments of improved FINWAIT list processing.

--
virtualization experience starting Jan1968, online at home since Mar1970

Michael S

unread,

Jun 17, 2014, 11:07:18 AM6/17/14

to

On Tuesday, June 17, 2014 11:18:48 AM UTC+3, Nick Maclaren wrote:
> In article <d7Rnv.386405$B%3.2...@fx23.am4>,
>
> Tom Gardner <spam...@blueyonder.co.uk> wrote:
> >
>
> >Hell's teeth: in my experience you are lucky to find someone
> >that realises that you have asynchronous comms protocols
> >built on top of synchronous comms protocols (multiple times!)
>
> God help us all, yes :-( I have been trying to explain that this
> is a significant problem for many decades, often to so-called
> experts, and my success rate is low. Because they don't have a
> 'feel' for asynchronism and pipelining, they can't get their
> head around the issues, and keep denying they exist.
>
> The latest lunatic dogma is that you can make any synchronous
> mechanism (including communications) asynchronous by simply
> running it in a separate thread.

Can you define in reasonably precise terms what you mean by "asynchronous comms protocols" and "synchronous comms protocols" or to point me to someone's else definition that you consider adequate?
Because quick googling only brought 963,000 explanations of the difference between UART and various sync ports. As expected. 900 p. Andrew S. Tanenbaum book was hardly more helpful. I am pretty sure that it's *not* what you meant above.

Michael S

unread,

Jun 17, 2014, 11:09:45 AM6/17/14

to

I probably should have ask Tom rather than Nick, sorry.
But since Nick obviously understood what Tom is talking about then I'd leave it as it is.

Nick Maclaren

unread,

Jun 17, 2014, 11:50:04 AM6/17/14

to

In article <01ffc98f-159f-44c2...@googlegroups.com>,

Well, it's not just SIMD. You get a similar hit if you run a SIMD
program on SIMD hardware with an intermediate serial stage! It's
flipping between them that causes the trouble.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Jun 17, 2014, 12:08:07 PM6/17/14

to

In article <7b734302-3479-4bf7...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>
>Can you define in reasonably precise terms what you mean by
>"asynchronous comms protocols" and "synchronous comms protocols"
>or to point me to someone's else definition that you consider
>adequate?

No, unfortunately :-( But I can describe the properties.

With a synchronous transfer interface, the transfer is completed
by the time the call returns.

With a fully asynchronous one, it is started by one call and then
proceeds in parallel; the caller is notified in some way (e.g. by
a signal or interrupt) when it has completed.

Most asynchronous interfaces have a separate call to either test
for or wait for that signal or interrupt, rather than exposing the
signal or interrupt directly.

A fully asynchronous protocol uses asynchronism at all levels, and
the program is completely decoupled from the transfer. If no
reblocking or reformatting is needed, it needs no intermediate
buffers and can block only if the destination is unavailable.
Very high bandwidths are achievable with negligible memory overhead.

With a synchronous step in an asynchronous pipeline, the transfer
has to take place and be waited for before the next action is
taken. The usual solution to the performance loss is to use
massive buffers, which introduces its own problems.

Asynchronism also handles expedited transfers ('pushes') better,
but that's more subtle, and not strictly due to the asynchronism.

Regards,
Nick Maclaren.

Rick Jones

unread,

Jun 17, 2014, 12:54:01 PM6/17/14

to

Anne & Lynn Wheeler <ly...@garlic.com> wrote:
> HTTP transaction oriented protocol using TCP, a session oriented
> protocol ... there was period in 1995 as server load started to
> scaleup, server CPUs were spending 95% of the time in FINWAIT list
> processing. Most systems had assumed FINWAIT list would only have a
> trivial few items (list of recently closed TCP sessions to check
> each incoming packet, if it belonged to session that had already
> been closed). HTTP transactions using TCP was exploding items on
> FINWAIT list to tens of thousands. It took six months or so before
> vendors started shipping rewritten FINWAIT list handling that
> improved on the problem.

Which FIN_WAIT? FIN_WAIT_1 or FIN_WAIT_2? I'm assuming FIN_WAIT_2
because FIN_WAIT_1 is an active retransmission state. FIN_WAIT_2 was
an oops/hole in TCP's specs (IIRC) because it ass-u-me-d that there
would be a FIN (or RST) from the remote TCP in a "reasonable" length
of time.

Of course, couple clients' bogusly using an abortive (RST - in what
may have been their il-conceived kludge to deal with TIME_WAIT),
non-retransmitted close with inevitable packet losses... and it gets
even worse.

rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones

Jeremy Linton

unread,

Jun 17, 2014, 6:54:05 PM6/17/14

to

On 6/17/2014 12:32 AM, Stephen Sprunk wrote:
> "Massive"? According to that article, they only gained 95% (less than
> 2x) at a cost of 10% more power, or 77% improvement in perf/Watt. That
> doesn't seem to justify the extra development costs.

The _pilot_ program is 1632 machines. The actual number of machines
running bing are? (microsoft says they have over a million machines in
their datacenter).

Lets say its 50k machines, so now they only need 25k, make up some
reasonable numbers for the machine/power costs. The savings are easily
in the 10's of millions of dollars. That is more than enough to dedicate
a handful of engineers to moving the critical portions of the
computation into an FPGA. Especially if it moves the optimization
bottleneck to an area that can be further optimized.

Hence my previous comment about how software performance matters again
because instead of pushing the cost of upgrading hundreds of thousands
of machines to their customers they are paying it directly. SaaS is much
more like traditional manufacturing, shave a few cents off a big enough
production run and it can really affect the bottom line.

Mike Stump

unread,

Jun 18, 2014, 3:34:03 AM6/18/14

to

In article <lnpp57$oih$1...@needham.csi.cam.ac.uk>,

Nick Maclaren <nm...@cam.ac.uk> wrote:
>In article <7b734302-3479-4bf7...@googlegroups.com>,
>Michael S <already...@yahoo.com> wrote:
>>
>>Can you define in reasonably precise terms what you mean by
>>"asynchronous comms protocols" and "synchronous comms protocols"
>>or to point me to someone's else definition that you consider
>>adequate?
>
>No, unfortunately :-( But I can describe the properties.

So, read up on the languages go (async channel) and erlang. For
general programming, read up on futures or even aio,
http://man7.org/linux/man-pages/man7/aio.7.html to get a feel for the
different between sync and async.

I liked your description. I'd view it this way, async rules, as you
can build sync with async; but, if you only have sync, you're screwed.
sync just waits for the action to be done, async merely queues the
action up and returns, immediately.

For example, in:

int temp = get_temperature ();
calculate_pi ();
printf("The temp is %d", temp);

imagine that get_temperature hits up a web server with the temperature
in your city. Let's say the server takes 0.5 seconds to get the
temperature and 0.5 seconds to communicate it to you. Let's say the
calculate_pi takes 1 seconds. This program runs in 2 seconds then.

Now imagine:

future int temp = get_temperature ();
calculate_pi ();
printf("The temp is %d", temp);

The variable temp is a future for the temperature. The first line
runs in 0.0 seconds. The next line runes in 1 second, and the next
line runs in 0 seconds, for a total time of 1 second. This is 2x the
performance for 1 extra word.

Now, how is this so? The request is started asynchronously to get the
temperature and queued in the first line. The communications happens
as pi is being calculated, and by the time pi is finished, we will
have the answer back, and placed into temp for direct use. In:

future int temp = get_temperature ();
printf("The temp is %d", temp);

the first line runs in 0 seconds, and the evaluation of temp requires
1 second, and the rest of the line is 0 seconds, for a net total of 1
second. For completeness:

future int temp = get_temperature ();
printf("The temp is %d", temp);
calculate_pi ();

runs in 2 seconds. Futures are a way to make use of an async service
in a program that was written as a synchronous program so that one can
quickly and easily transition to an underlying async world and start
seeing the benefits of that, sooner without massive rewritting of that
code. The programming model is compatible with most people's
conceptions of how programming works, so it makes it easier to
transition to. C++ has futures in it, if you want to see them in the
context of a C style language.

Tom Gardner

unread,

Jun 18, 2014, 4:13:13 AM6/18/14

to

On 18/06/14 08:34, Mike Stump wrote:
> In article <lnpp57$oih$1...@needham.csi.cam.ac.uk>,
> Nick Maclaren <nm...@cam.ac.uk> wrote:
>> In article <7b734302-3479-4bf7...@googlegroups.com>,
>> Michael S <already...@yahoo.com> wrote:
>>>
>>> Can you define in reasonably precise terms what you mean by
>>> "asynchronous comms protocols" and "synchronous comms protocols"
>>> or to point me to someone's else definition that you consider
>>> adequate?
>>
>> No, unfortunately :-( But I can describe the properties.
>
> So, read up on the languages go (async channel) and erlang. For
> general programming, read up on futures or even aio,
> http://man7.org/linux/man-pages/man7/aio.7.html to get a feel for the
> different between sync and async.
>
> I liked your description. I'd view it this way, async rules, as you
> can build sync with async; but, if you only have sync, you're screwed.
> sync just waits for the action to be done, async merely queues the
> action up and returns, immediately.

Unfortunately not, and therein lies one source of inefficiency.

Consider, as merely one example, JMS. This is used by application
programmers to send a message from one server to another, with no
response expected. Typically they regard it as sort of very
vaguely equivalent to a reliable email service, i.e. a low level
fast primitive. But they also rely on it being "reliable" (i.e.
they don't have to - and don't - consider failure/recovery
mechanisms).

Thus at the application level JMS is an asynchronous protocol.

The "reliability" implies guaranteed once and only once
delivery, so JMS is of course built on top of TCP and requires
multiple round-trips between the two servers. TCP is,
of course, a synchronous protocol built on IP which is
an asynchronous protocol built on top of... You get the
idea.

Other examples, arguably with less inefficiency, are any
telecom protocol where one-way asynchronous events such as
"picked up handset" between FSMs in servers scattered
across the globe. The events are transmitted on top of TCP.

Michael S

unread,

Jun 18, 2014, 5:01:24 AM6/18/14

to

On Tuesday, June 17, 2014 7:08:07 PM UTC+3, Nick Maclaren wrote:
> In article <7b734302-3479-4bf7...@googlegroups.com>,
>
> Michael S <already...@yahoo.com> wrote:
>
> >
> >Can you define in reasonably precise terms what you mean by
> >"asynchronous comms protocols" and "synchronous comms protocols"
> >or to point me to someone's else definition that you consider
> >adequate?
>
> No, unfortunately :-( But I can describe the properties.
>
> With a synchronous transfer interface, the transfer is completed
> by the time the call returns.
>
> With a fully asynchronous one, it is started by one call and then
> proceeds in parallel; the caller is notified in some way (e.g. by
> a signal or interrupt) when it has completed.
>
> Most asynchronous interfaces have a separate call to either test
> for or wait for that signal or interrupt, rather than exposing the
> signal or interrupt directly.
>

That's asynchronous vs synchronous (a.k.a non-blocking vs blocking) APIs. I am not sure that it is what Tom meant.

> A fully asynchronous protocol uses asynchronism at all levels, and
> the program is completely decoupled from the transfer.

I think, consistent asynchronism at all levels is extremely rare in general-purpose (as opposed to embedded) computing. Suppose, you are initiating packet transfer with AIO OS call. You are initiating async action, but you are doing it with synchronous tool (OS call).
Digging deeper, pretty much nothing can be done without memory reads. And, ignoring exotics for sake of brevity, memory reads have synchronous semantic.

Michael S

unread,

Jun 18, 2014, 5:08:53 AM6/18/14

to

Does not the present of sliding window place TCP into neither-sync-nor-async class?

> which is
> an asynchronous protocol built on top of... You get the idea.
>

May be, I got the idea. But I am not sure that I got. It's still too vague in my mind. Still, the gray area is as big as black (sync) or white (async).

Nick Maclaren

unread,

Jun 18, 2014, 6:31:37 AM6/18/14

to

In article <74c1ef3b-7076-44af...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>
>> A fully asynchronous protocol uses asynchronism at all levels, and
>> the program is completely decoupled from the transfer.
>
>I think, consistent asynchronism at all levels is extremely rare in general
>-purpose (as opposed to embedded) computing. Suppose, you are initiating pa
>cket transfer with AIO OS call. You are initiating async action, but you ar
>e doing it with synchronous tool (OS call).

Not really. Obviously the copying of the arguments has to be synchronous,
but none of the real work does. Asynchronism at all levels is rare
nowadays outside the embedded world, true, but it used not to be and
many of the mainframes had it. That is how they could outperform
modern systems (relative to their 'clock rate') so spectacularly on
so many workloads.

>Digging deeper, pretty much nothing can be done without memory reads. And,
>ignoring exotics for sake of brevity, memory reads have synchronous semanti
>c.

Most do, but not all. Quite a few modern architectures have had
asynchronous features even on memory reads.

Regards,
Nick Maclaren.

Ivan Godard

unread,

Jun 18, 2014, 6:49:36 AM6/18/14

to

One can see the Mill's deferred loads as a form of asynchronous memory
access. And likewise for OOO access reordering, and even fetch-ahead
predictors.

Tom Gardner

unread,

Jun 18, 2014, 6:54:07 AM6/18/14

to

On 18/06/14 10:01, Michael S wrote:
> Digging deeper, pretty much nothing can be done without memory reads. And, ignoring exotics for sake of brevity, memory reads have synchronous semantic.

Have you considered the operation of and interaction
between multicore processors, multilevel caches,
cache-memory interconnects, cache invalidation, NUMA?

Most of those have been available on home consumer
machines for a decade so can't be considered "exotic".

Michael S

unread,

Jun 18, 2014, 8:29:50 AM6/18/14

to

On Wednesday, June 18, 2014 1:54:07 PM UTC+3, Tom Gardner wrote:
> On 18/06/14 10:01, Michael S wrote:
>
> > Digging deeper, pretty much nothing can be done without memory reads. And, ignoring exotics for sake of brevity, memory reads have synchronous semantic.
>
>
>
> Have you considered the operation of and interaction
> between multicore processors, multilevel caches,
> cache-memory interconnects, cache invalidation, NUMA?
>

Yes, I did.
Implementation can be asynchronous or mix of sync/async (blocking/non-blocking), but semantic of memory read withing given thread of execution is synchronous (=blocking).

Terje Mathisen

unread,

Jun 18, 2014, 9:00:34 AM6/18/14

to

Nick Maclaren wrote:
> In article <74c1ef3b-7076-44af...@googlegroups.com>,
> Michael S <already...@yahoo.com> wrote:
>>
>>> A fully asynchronous protocol uses asynchronism at all levels, and
>>> the program is completely decoupled from the transfer.
>>
>> I think, consistent asynchronism at all levels is extremely rare in general
>> -purpose (as opposed to embedded) computing. Suppose, you are initiating pa
>> cket transfer with AIO OS call. You are initiating async action, but you ar
>> e doing it with synchronous tool (OS call).
>
> Not really. Obviously the copying of the arguments has to be synchronous,
> but none of the real work does. Asynchronism at all levels is rare
> nowadays outside the embedded world, true, but it used not to be and
> many of the mainframes had it. That is how they could outperform
> modern systems (relative to their 'clock rate') so spectacularly on
> so many workloads.

Doing absolutely everything async was how I could fit a shared printer
driver inside 1600 bytes:

Local I/O to serial/parallel printer ports ran on interrupts, any needed
polling from the 18 Hz timer, and all network activity was based on
setting p a couple of NetWare's Async Control Blocks which would do a
realtime callback to my code whenever something interesting, like a
packet arrival, happened.

More importantly in some ways was the fact that this way of thinking
about the problem made it possible to write everything in a single
5-hour session, and having the code correct and working first try, after
I fixed just a couple of syntax (i.e. typing/spelling) errors so it
would assemble.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

unread,

Jun 18, 2014, 9:34:31 AM6/18/14

to

On Wednesday, June 18, 2014 4:54:07 AM UTC-6, Tom Gardner wrote:

> Have you considered the operation of and interaction
> between multicore processors, multilevel caches,
> cache-memory interconnects, cache invalidation, NUMA?

> Most of those have been available on home consumer
> machines for a decade so can't be considered "exotic".

Most is right. NUMA, at least, hasn't yet made its appearance in consumer home computers. Yet. But maybe in the next generation of game consoles...

John Savard

Michael S

unread,

Jun 18, 2014, 10:43:52 AM6/18/14

to

On Wednesday, June 18, 2014 4:34:31 PM UTC+3, Quadibloc wrote:
> On Wednesday, June 18, 2014 4:54:07 AM UTC-6, Tom Gardner wrote:
>
>
>
> > Have you considered the operation of and interaction
>
> > between multicore processors, multilevel caches,
>
> > cache-memory interconnects, cache invalidation, NUMA?
>
>
>
> > Most of those have been available on home consumer
>
> > machines for a decade so can't be considered "exotic".
>
>
>
> Most is right. NUMA, at least, hasn't yet made its appearance in consumer home computers.

http://en.wikipedia.org/wiki/AMD_Quad_FX_platform
Do you count gaming enthusiasts as consumers?

Quadibloc

unread,

Jun 18, 2014, 4:07:56 PM6/18/14

to

On Wednesday, June 18, 2014 8:43:52 AM UTC-6, Michael S wrote:

> http://en.wikipedia.org/wiki/AMD_Quad_FX_platform
>
> Do you count gaming enthusiasts as consumers?

OK, I'm mystified.

I could see NUMA on a game console. After all, the games and the system software would be written around the console.

AFAIK, Microsoft Windows only supports SMP; it couldn't *run* on NUMA hardware.

Now, there is the Steam OS, and there are games that run on Linux, but the item you refer to appears to predate the Steam OS, and Linux is not a platform for gaming enthusiasts even if some games can run on it well.

Maybe some further Googling for info beyond the Wikipedia article will enlighten me.

John Savard

Mike Stump

unread,

Jun 18, 2014, 4:06:38 PM6/18/14

to

In article <tSbov.589079$aZ2.5...@fx07.am4>,

Tom Gardner <spam...@blueyonder.co.uk> wrote:
>> I liked your description. I'd view it this way, async rules, as you
>> can build sync with async; but, if you only have sync, you're screwed.
>> sync just waits for the action to be done, async merely queues the
>> action up and returns, immediately.
>
>Unfortunately not, and therein lies one source of inefficiency.

>Consider, as merely one example, JMS. This is used by application
>programmers to send a message from one server to another, with no
>response expected. Typically they regard it as sort of very
>vaguely equivalent to a reliable email service, i.e. a low level
>fast primitive. But they also rely on it being "reliable" (i.e.
>they don't have to - and don't - consider failure/recovery
>mechanisms).
>
>Thus at the application level JMS is an asynchronous protocol.
>
>The "reliability" implies guaranteed once and only once
>delivery, so JMS is of course built on top of TCP and requires
>multiple round-trips between the two servers. TCP is,
>of course, a synchronous protocol built on IP which is
>an asynchronous protocol built on top of... You get the
>idea.

Inefficiencies can be treated as bugs and fixed. The client code and
above, should not have to change to obtain those efficiencies. The
design is solid, even if the implementation sucks. That said, it
isn't wrong to use TCP, it is easy, convenient, easy to code, quick to
implement, reliable, bug free, already written, won't consume yet more
ram for yet another protocol that is poorly written and so on.

The world is littered with people that thought they could do better
than TCP, they were wrong.

The conflation of two things doesn't happen as one need state to
implement a reliable protocol on top of an unreliable protocol. TCP
provides that state. Further, it can amortize the open across all
async messages, so it is just about as cost free as can be engineered.
Also, if they use a connection for just it, it won't conflate with
anything else going on. If they do reuse that connection for other
things, well, they either speced it that way and it has to be done, or
it is a mere area for optimization improvements, if it proves to be a
problem.

>Other examples, arguably with less inefficiency, are any
>telecom protocol where one-way asynchronous events such as
>"picked up handset" between FSMs in servers scattered
>across the globe. The events are transmitted on top of TCP.

Again, please show us your implementation that is better. Most people
that think this way, are just wrong.

Mike Stump

unread,

Jun 18, 2014, 4:15:13 PM6/18/14

to

In article <lnrpq9$v5k$1...@needham.csi.cam.ac.uk>,

Nick Maclaren <nm...@cam.ac.uk> wrote:
>>Digging deeper, pretty much nothing can be done without memory reads. And,
>>ignoring exotics for sake of brevity, memory reads have synchronous semanti
>>c.
>
>Most do, but not all. Quite a few modern architectures have had
>asynchronous features even on memory reads.

No, memory reads are usually completely async. Try it sometime. The
destination register is a future that if you don't access, it doesn't
slow down your code any, and if you do, then it does. The amount it
slows is roughly the latency to the memory that has the item. dram
296 cycles away, then the use of that register will add 296 cycles.

If you have ever seen a profile:

add r1,r1,r2

and noticed that 97% of all your execute time is spent on it, that
_is_ this future cost. The hardware folks might call it
scoreboarding.

I suppose some CPUs from the 1980s might not do this, but, I suspect
they are kinda rare now, or just really small processors.

Stephen Sprunk

unread,

Jun 18, 2014, 4:54:12 PM6/18/14

to

On 18-Jun-14 15:07, Quadibloc wrote:
> AFAIK, Microsoft Windows only supports SMP; it couldn't *run* on NUMA
> hardware.

IIRC, older versions of Windows didn't understand NUMA, so sometimes
they'd schedule threads or store memory pages in the wrong place, but
they would still _run_.

Newer versions do understand NUMA, with interesting optimizations like
putting a copy of certain performance-critical (but read-only) pages on
every node, moving data/code pages to the same node as the thread that
requested them, etc.

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Message has been deleted

Stephen Sprunk

unread,

Jun 18, 2014, 5:05:05 PM6/18/14

to

Or they could just switch to Linux.

A company I know had a CTO that liked to play golf with Bill Gates, so
they had hundreds of Windows servers--plus 24x7 staff at their data
centers to run around rebooting them when they bluescreened. That CTO
eventually got canned, and his replacement switched to LAMP; they cut
the data center staff to one (very bored) shift, sold 80% of their
servers on eBay, and turned a profit for the first time.

Same story repeated at my last PPOE and my CPOE: as soon as we ditched
Windows, performance went through the roof and our support costs went
way down. The only reason we change hardware now is because vendors
stop making/supporting the old models, not a lack of computing power.

Stephen Sprunk

unread,

Jun 18, 2014, 5:07:26 PM6/18/14

to

On 17-Jun-14 01:20, Tom Gardner wrote:
> On 17/06/14 06:32, Stephen Sprunk wrote:
>> On 16-Jun-14 21:05, Quadibloc wrote:
>>> Incidentally, just saw a news item about Microsoft using FPGAs to
>>> massively increase its bang for the buck in handling some stuff
>>> in the cloud.
>>>
>>> http://www.theregister.co.uk/2014/06/16/microsoft_catapult_fpgas/

>>
>> "Massive"? According to that article, they only gained 95% (less
>> than 2x) at a cost of 10% more power, or 77% improvement in
>> perf/Watt. That doesn't seem to justify the extra development
>> costs.
>

> Apparently the financial community casts *financial* algorithms into
> FPGAs in order to reduce the latency by a few milli/microseconds.
> Apparently that is worth big bucks to the high frequency trading
> mob.

Well, the HFT mob are a bunch of scammers, so I suppose it's karma that
they would fall victim to scammers themselves.

>> I suspect you could get at least as much by better tuning. For
>> instance, during routine capacity testing of 2/4/8 CPU systems,
>> one of the testers accidentally set a VM for 3 CPUs--and it
>> outperformed an 8 CPU system. WTF? We've spent months
>> benchmarking our code to find the issue, but so far it appears to
>> be somewhere down in the JVM and/or OS. A properly tuned stack
>> should _never_ exhibit such behavior.
>
> The high performance computing fraternity has been all too aware of
> similar effects for a long time now.

Yep; however, what they've learned hasn't filtered out to the rest of
the computing world and/or nobody cares enough to fix it.

Robert Wessel

unread,

Jun 18, 2014, 5:14:41 PM6/18/14

to

On Wed, 18 Jun 2014 13:07:56 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Wednesday, June 18, 2014 8:43:52 AM UTC-6, Michael S wrote:
>
>> http://en.wikipedia.org/wiki/AMD_Quad_FX_platform
>>
>> Do you count gaming enthusiasts as consumers?
>
>OK, I'm mystified.
>
>I could see NUMA on a game console. After all, the games and the system software would be written around the console.
>
>AFAIK, Microsoft Windows only supports SMP; it couldn't *run* on NUMA hardware.

Sure it does:

http://msdn.microsoft.com/en-us/library/windows/desktop/aa363804(v=vs.85).aspx

Basically since Vista/WS08. And the support is active on pretty much
any multi-socket system.

Of course boxes running Windows with more than the common handful of
sockets, and with large ratios in memory access times are pretty rare.
Accessing a DIMM on a memory controller in another socket is usually
only a modest factor-of-two, or so, slower than accessing a local
DIMM. So on the vast majority of systems there's not a huge amount of
NUMA workload management going on, and even when it's goes wrong, the
penalty is not that huge, but it is there. Things like (trying to)
allocating threads or memory in a process all on the same socket.

Quadibloc

unread,

Jun 18, 2014, 10:49:20 PM6/18/14

to

I've learned that one of the roadblocks to using germanium in chips is that germanium oxide isn't a good insulator. But people are working on ways around that.

As far back as 2009, Toshiba had an announcement about using a layer of strontium germanide over chips to allow the application of lanthanum aluminate, a good high-k insulator.

John Savard

Mike Stump

unread,

Jun 19, 2014, 12:59:12 AM6/19/14

to

In article <35e76cb2-ea88-4af6...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>Implementation can be asynchronous or mix of sync/async (blocking/non-blocking), but semantic of memory read withing given thread of execution is synchronous (=blocking).

No, the semantics of a memory read is 100% async on most systems. Try it.

Mike Stump

unread,

Jun 19, 2014, 2:09:27 AM6/19/14

to

In article <74c1ef3b-7076-44af...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>That's asynchronous vs synchronous (a.k.a non-blocking vs blocking) APIs. I am not sure that it is what Tom meant.

Don't worry, it is. :-)

>> A fully asynchronous protocol uses asynchronism at all levels, and
>> the program is completely decoupled from the transfer.
>
>I think, consistent asynchronism at all levels is extremely rare in general-purpose (as opposed to embedded) computing. Suppose, you are initiating packet transfer with AIO OS
>call. You are initiating async action, but you are doing it with synchronous tool (OS call).

I would view the the OS call as async, cause the functionality
requested is async. If it does not block, that is the very definition
of async.

Michael S

unread,

Jun 19, 2014, 5:45:55 AM6/19/14

to

You don't understand semantics of word semantics ;)

Implementation can be async (but never 100% async, every conventional CPU has a limit for the number of outgoing L1D cache misses, typically about dozen) or sync, but semantics of memory load are synchronous within given thread of execution (i.e. load shell observe effect of all preceding stores and shell not observe effect of all succeeding (is it a proper English word?) stores) and sometimes even system-wide (although not on x86/Power/ARM, not sure about SPARC and zArch).

Michael S

unread,

Jun 19, 2014, 6:02:28 AM6/19/14

to

On Wednesday, June 18, 2014 11:07:56 PM UTC+3, Quadibloc wrote:
> On Wednesday, June 18, 2014 8:43:52 AM UTC-6, Michael S wrote:
>
> > http://en.wikipedia.org/wiki/AMD_Quad_FX_platform
> > Do you count gaming enthusiasts as consumers?
>
> OK, I'm mystified.
>
> I could see NUMA on a game console. After all, the games and the system software would be written around the console.
>
> AFAIK, Microsoft Windows only supports SMP; it couldn't *run* on NUMA hardware.
>

NUMA, at least in the currently-prevalent meaning of the term, *is* SMP, one of the several possible implementations of it.
So, as long as Microsoft Windows supports SMP, it automatically supports NUMA. Everything runs, possibly sub-optimally from performance perspective, but runs nevertheless.

As others mentioned, during AMD Quad FX platform timeframe Microsoft's consumer OS (Vista) not just supported NUMA, but contained sophisticated NUMA optimizations.
May be, it's not co-incidence that Quad FX was launched in Nov 2006, the same month as Vista RTM?

Quadibloc

unread,

Jun 19, 2014, 10:10:32 AM6/19/14

to

On Thursday, June 19, 2014 4:02:28 AM UTC-6, Michael S wrote:

> As others mentioned, during AMD Quad FX platform timeframe Microsoft's
> consumer OS (Vista) not just supported NUMA, but contained sophisticated NUMA
> optimizations.

I knew there were and are motherboards with more than one processor chip on them.

I also knew that modern microprocessors have the memory controller on the same die with the CPU.

It just did not occur to me to put two and two together to realize the implications of those facts.

Or is the AMD Quad FX the only NUMA motherboard, with the multi-chip server motherboards not being NUMA? (Instead, I presume, it's unique in being "consumer".)

John Savard

Michael S

unread,

Jun 19, 2014, 11:18:51 AM6/19/14

to

Yes, it's unique in being "consumer". Other than that, it is pretty similar to contemporary AMD dual-socket workstation motherboards. Less similar to server motherboards - those tend to have integrated graphics, several x8 PCIe slots and no x16 slots.

As mentioned in Wikipeadea article, the most meaningful difference between Quad FX and dual-socket Opteron workstation boards is use of [faster, lower capacity] unbuffered DIMMs instead of [slower, higher capacity] registered DIMMs.

>
>
>
> John Savard

Robert Wessel

unread,

Jun 19, 2014, 11:41:52 AM6/19/14

to

Pretty much all the current multi-socket server boards are effectively
NUMA as well. Although the NUMA factor remains fairly small (usually
under 2, as I mentioned).

It's more that it's just the baseline now, and it just doesn't need to
get mentioned constantly.

Mike Stump

unread,

Jun 19, 2014, 3:44:41 PM6/19/14

to

In article <9ce528d2-73f8-4336...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>Implementation can be async (but never 100% async, every conventional CPU has a limit for the number of outgoing L1D cache misses, typically about dozen)

Ah, yes, true. Once you hit the load/store unit queue length or some
other such limit, you then have to wait around until it clears some.

Quadibloc

unread,

Jun 19, 2014, 4:58:04 PM6/19/14

to

On Wednesday, June 18, 2014 8:49:20 PM UTC-6, Quadibloc wrote:
> I've learned that one of the roadblocks to using germanium in chips is that germanium oxide isn't a good insulator.

Now I've learned the _real_ problem is that germanium can't be allowed to get nearly as hot as silicon, which reduces the potential benefits from its increased electron and hole mobility. Which makes strained silicon and silicon-germanium much more attractive.

John Savard

Quadibloc

unread,

Jun 19, 2014, 5:15:59 PM6/19/14

to

And I've found something else of interest. The University of Colorado owns the patents of a firm called Phiar that went under, which had been able to make tunnel diode chips using an insulated metal process.

At one time, the tunnel diode was seen to promise very fast circuits, but it had manufacturing problems. If there's a way to overcome this, an alternative way of making fast logic is offered.

John Savard

Quadibloc

unread,

Jun 19, 2014, 5:30:22 PM6/19/14

to

On Thursday, June 19, 2014 3:15:59 PM UTC-6, Quadibloc wrote:

> At one time, the tunnel diode was seen to promise very fast circuits, but it
> had manufacturing problems. If there's a way to overcome this, an alternative
> way of making fast logic is offered.

But even with that problem solved, there is still the problem that tunnel diode circuits have *resistors* in them.

So they're like TTL or NMOS, rather than like CMOS, which limits them, like the ECL in BiCMOS, to things like RF circuitry that only require a very few very high-speed components on a simple chip.

John Savard

Quadibloc

unread,

Jun 19, 2014, 5:40:46 PM6/19/14

to

And this reminded me of the other way to improve the microprocessor. If germanium's high hole and electron mobility is made less useful by the material not being able to be used at high temperatures, and fast circuit types like ECL are denied us by high temperatures... then maybe what a semiconductor material needs is the ability to run hot.

Thus, what about Silicon Carbide?

In 2004, there was a report that Toyota was able to make three-inch SiC wafers in the lab with a low defect rate.

Just this year, they've announced a practical use for new silicon carbide power ICs which will make their electric cars more efficient. (I remember reading elsewhere that SiC ICs with about 2,000 transiistors on them are already in normal use.)

John Savard

Quadibloc

unread,

Jun 19, 2014, 5:46:28 PM6/19/14

to

On Thursday, June 19, 2014 3:40:46 PM UTC-6, Quadibloc wrote:

> In 2004, there was a report that Toyota was able to make three-inch SiC
> wafers in the lab with a low defect rate.

While an article about this in Forbes was not available, I did find this:

http://www.eetimes.com/document.asp?doc_id=1151074

100 defects per square centimeter, while it is much lower than previous defect rates for that material, is not low enough for making a Silicon Carbide equivalent of the Pentium.

John Savard