Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Intel's Skylake Prime Number Bug.

627 views
Skip to first unread message

Skybuck Flying

unread,
Jan 11, 2016, 10:44:52 AM1/11/16
to
Hello,

Apperently Intel's Skylake Processors can freeze up when calculating certain
Prime Numbers.

I am investigating this story further, for now here is a link about it:

https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553

Enjoy ! =D

Bye,
Skybuck =D

Larc

unread,
Jan 11, 2016, 11:10:38 AM1/11/16
to
Intel is apparently aware of this and is working with its partners to distribute a
fix in form of a BIOS update.

http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/

Larc

Rick C. Hodgin

unread,
Jan 11, 2016, 1:16:41 PM1/11/16
to
If you read the thread Skybuck posted, they already have a fix:

4 days ago:

"Intel has identified an issue that potentially
affects the 6 th Gen Intel(R) Core(tm) family of
products. This issue only occurs under certain
complex workload conditions, like those that may
be encountered when running applications like
Prime95. In those cases, the processor may hang
or cause unpredictable system behavior. Intel has
identified and released a fix and is working with
external business partners to get the fix deployed
through BIOS."

Best regards,
Rick C. Hodgin

John Larkin

unread,
Jan 11, 2016, 1:23:03 PM1/11/16
to
On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
wrote:
I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
some firmware?


--

John Larkin Highland Technology, Inc
picosecond timing precision measurement

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com

Jerry Stuckle

unread,
Jan 11, 2016, 1:42:59 PM1/11/16
to
On 1/11/2016 1:22 PM, John Larkin wrote:
> On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
> wrote:
>
>> On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying" <skybu...@hotmail.com> wrote:
>>
>> | Hello,
>> |
>> | Apperently Intel's Skylake Processors can freeze up when calculating certain
>> | Prime Numbers.
>> |
>> | I am investigating this story further, for now here is a link about it:
>> |
>> | https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>
>> Intel is apparently aware of this and is working with its partners to distribute a
>> fix in form of a BIOS update.
>>
>> http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>
>> Larc
>
> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
> some firmware?
>
>

Nowadays processors (from micro to mainframe) are run by microcode. A
problem like this can be in the circuitry (as was with the FPU problem
in the early-mid 90's) or it can be in the microcode. IOW, it can be a
hardware bug or a software bug :)

Obviously if it's a microcode bug, an update should be able to fix it.
Even if it's a hardware bug, there might be a way around the bug in the
microcode.

Being as this is a hang, my guess would be it's a microcode bug. But
obviously I don't know.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstu...@attglobal.net
==================

John Larkin

unread,
Jan 11, 2016, 2:12:48 PM1/11/16
to
On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle
<jstu...@attglobal.net> wrote:

>On 1/11/2016 1:22 PM, John Larkin wrote:
>> On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
>> wrote:
>>
>>> On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying" <skybu...@hotmail.com> wrote:
>>>
>>> | Hello,
>>> |
>>> | Apperently Intel's Skylake Processors can freeze up when calculating certain
>>> | Prime Numbers.
>>> |
>>> | I am investigating this story further, for now here is a link about it:
>>> |
>>> | https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>>
>>> Intel is apparently aware of this and is working with its partners to distribute a
>>> fix in form of a BIOS update.
>>>
>>> http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>>
>>> Larc
>>
>> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
>> some firmware?
>>
>>
>
>Nowadays processors (from micro to mainframe) are run by microcode.

Not all. Some RISC machines are pure logic. ARM, Coldfire, maybe MIPS?

Intels are still microcode based.



A
>problem like this can be in the circuitry (as was with the FPU problem
>in the early-mid 90's) or it can be in the microcode. IOW, it can be a
>hardware bug or a software bug :)
>
>Obviously if it's a microcode bug, an update should be able to fix it.
>Even if it's a hardware bug, there might be a way around the bug in the
>microcode.
>
>Being as this is a hang, my guess would be it's a microcode bug. But
>obviously I don't know.

--

EricP

unread,
Jan 11, 2016, 2:34:38 PM1/11/16
to
John Larkin wrote:
> On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
> wrote:
>
>> On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying" <skybu...@hotmail.com> wrote:
>>
>> | Hello,
>> |
>> | Apperently Intel's Skylake Processors can freeze up when calculating certain
>> | Prime Numbers.
>> |
>> | I am investigating this story further, for now here is a link about it:
>> |
>> | https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>
>> Intel is apparently aware of this and is working with its partners to distribute a
>> fix in form of a BIOS update.
>>
>> http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>
>> Larc
>
> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
> some firmware?

I wondered that myself. I have seen Intel documentation refer
to a signed microcode patch that is loaded by BIOS or OS at boot,
but gave no details.

I think a look-aside associative memory that can match a particular
microcode address and substitute individual new microcode lines
would work, but only for microcode.

If the problem was in a hardware sequencer, say the hardware
page table walker, then you're screwed unless maybe they could do
something like trap to a special exception handler in the BIOS ROM
for emulation. I've never seen such a mechanism mentioned in the
normal documentation, and the Intel BIOS Writers Guide is not
publically available

The AMD BIOS Guides are available, and do mention a
Microcode Patch Buffer but give no details.

There are many patents for microcode patch mechanisms such as this
one from VIA which appears to function as I suggested above.

Apparatus and method for fast one-to-many microcode patch 2007
https://www.google.com/patents/US20090031090

and older ones from AMD

Microcode patching apparatus and method 1998
https://www.google.com/patents/US5796974

Microcode patch device and method for patching microcode
using match registers and patch routines 1999
https://www.google.com/patents/US6438664

There are also some papers on the subject, for example:

Patching processor design errors with programmable hardware 2007
http://cse.iitd.ac.in/~srsarangi/files/papers/toppicks07.pdf

Security Analysis of x86 Processor Microcode 2014
https://www.dcddcc.com/pubs/paper_microcode.pdf

Eric



Skybuck Flying

unread,
Jan 11, 2016, 3:44:35 PM1/11/16
to
Compute well this processor does not - Yoda.

Wrong is much with this processor - Yoda.

Breath I would not hold - Yoda.

Bye,
Skybuck ;) =D

Rick C. Hodgin

unread,
Jan 11, 2016, 4:06:24 PM1/11/16
to
Skybuck Flying wrote:
> Compute well this processor does not - Yoda.
> Wrong is much with this processor - Yoda.

I think that's an incorrect and unfair assessment.
Of the billions of components, in one particular
data cass, the hardware locked up. Intel was
able to fix it in existing physical products, and they
did so within days.

How many unfixed software bugs exist in the OS
(Linux or Windows) or user software? Those are
easily fixed, and far less concrete in the limitations
on how they can be fixed.

Intel's track record (in product development and
manufacturing) over the decades is outstanding
given the complexity of the devices they create.
In fact, it's probably the highest of any man-made
device in human history.

My only wish is that Intel would compete openly
and honestly on merit alone, rather than employing
underhanded and monopolistic practices in their
many product releases and business dealings.

Quadibloc

unread,
Jan 11, 2016, 4:54:51 PM1/11/16
to
On Monday, January 11, 2016 at 11:23:03 AM UTC-7, John Larkin wrote:

> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
> some firmware?

From what I read, it sounded like they mis-estimated the power requirements of
the floating-point section, so they'll need to throttle its maximum speed.

John Savard

Paul

unread,
Jan 11, 2016, 4:55:28 PM1/11/16
to
Some bugs on the processor can be fixed by microcode.
A typical retail motherboard can have as many as eight
microcode files, covering the compatible CPU table
on the motherboard maker site, and those are stored
in the BIOS flash chip. Microcode is tiny, and variable
length. The last time I took apart that file, the segments
were in multiples of 2KB or so.

Microcode releases have a revision number, and
a patcher loading a microcode, is allowed to
install a patch which has a higher release number
than the one currently in the processor.

The BIOS has its microcode patcher. The microcode must
be good enough, to allow the system to boot into the OS.
So no storage bugs can exist with the shipped BIOS
microcode. All it has to do, is get the system booted.

Windows and Linux also have microcode patchers.
The Windows one does its job early after boot,
and then the service exits. So you don't really
see it.

The Windows one allows deployment of updates.
It's unclear how much faster either a BIOS update
would deploy a new version, versus how fast
Microsoft could push a new file via Windows Update.

If you have a copy of the Intel Processor Identification
Utility (PIU), the field "revision" is actually the
release number of the microcode. There was one incident,
where no microcode was getting loaded, and the number
was zero. Most of the time, you will find a small finite
number for that field. In some cases, the utility
mistakenly masks the value read out, and some digits
may not belong there. (Maybe you see F07 instead
of 07.)

Some bugs in processors are fixed by actual code.
When AMD had a TLB bug in the 9500, they distributed
maybe a 15KB or so code module, to be added to the
BIOS. This code disabled the TLB, or a portion of
it, costing a small amount of performance. A
fixed version of the processor, for the same family,
had "50" added to the lower digits, so if you bought
a 9550 you knew it was fixed, whereas a 9500 wasn't.
So that fix wasn't microcode based, because it
wasn't an actual instruction problem. It was a
problem with virtual to physical address translation
of some sort.

The average processor has 100 errata. Some of the errata
are discovered a year or two after the first batch is
distributed for sale. Testing continues after release.
Many bugs are repaired via microcode updates. Some
are labeled "won't fix", meaning even if a new mask
revision was in the pipe, they had no plan to patch
out the problem. Some issues are innocuous enough they
don't need fixing.

In the case of the Prime95 issue above, the hand-coded FFTs
are perfect material for uncovering bugs. Frequently,
compilers produce "lame" code that doesn't give particularly
good fault coverage. So you don't see bugs, because the
instruction sequences aren't that challenging.

One AMD processor, had an FPU bug caused by actual
electrical noise. It was discovered after release.
It took assembler code to do it. The assembler code
consisted of a nonsensical continuous sequence of
one FPU instruction after another. This drew enough
current to cause a noise problem in the substrate.
Errata like that receive a "will not fix" rating,
because it is not expected that anyone will be
coding with assembler, and using that stupid a sequence
of instructions. Real FPU code needs an occasional
bit test, branch condition, and so isn't solid 100% FPU
instructions one after another. And when a HLL is used,
the compiler/assembler wouldn't even get close to
the required FPU code density to break that processor.
(If I owned such a processor though, I'd be pissed.
For that not being caught in testing, or recognized
as a potential issue during design.)

When it comes to test benches for hardware design, you
run the important ones first (and try to finish them by
design close). The ridiculous tests are saved for later,
after production has begun. And that's when the AMD testers
carried out their artificial 100% density test and discovered
a problem. For our chip designs, some staff were running
simulation test cases a year after we had hardware in hand.
(And ours didn't have microcode to patch with either.
We had another feeble mechanism for emergencies :-) )

The level of bugs is rather constant. I don't recollect ever
looking at an errata sheet for a CPU and seeing zero bugs. It
just doesn't happen. I expect in some cases, staff already know
of multiple errata, even before design close, but the
boss says "ship it". I doubt they would hold up a mask
release, chasing every possible bug and making the CPU
two years late. That just isn't going to happen, especially
when the "good ole microcode" can pull your bacon out of the
fire.

So while it's sad that this "bendable" processor also has
an errata, it probably has another 99 errata to keep that
one company. Most of those errata are invisible to end
users. The janitorial staff already cleaned up the mess :-)

It's the ones that cost you performance, that
make the fanbois crazy. The AMD TLB bug certainly
upset a few happy owners of the affected silicon.
And the Intel FDIV bug certainly cost Intel
(all those Excel spreadsheet jokes). Intel has had
a few wakeup calls, and I like to think that Skylake
is another wakeup call ("staff getting sloppy,
poor decision making"). So far it hasn't cost them
any "big money". I don't know if users are
returning their "bent" processors or not.

Paul

Martin Brown

unread,
Jan 11, 2016, 5:22:05 PM1/11/16
to
On 11/01/2016 18:22, John Larkin wrote:
> On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
> wrote:
>
>> On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying" <skybu...@hotmail.com> wrote:
>>
>> | Hello,
>> |
>> | Apperently Intel's Skylake Processors can freeze up when calculating certain
>> | Prime Numbers.
>> |
>> | I am investigating this story further, for now here is a link about it:
>> |
>> | https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>
>> Intel is apparently aware of this and is working with its partners to distribute a
>> fix in form of a BIOS update.
>>
>> http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>
>> Larc
>
> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
> some firmware?

Probably tweaks the microcode as the system boots to fix the problem.

--
Regards,
Martin Brown

krw

unread,
Jan 11, 2016, 9:37:07 PM1/11/16
to
On Mon, 11 Jan 2016 10:22:55 -0800, John Larkin
<jjla...@highlandtechnology.com> wrote:

>On Mon, 11 Jan 2016 11:10:44 -0500, Larc <la...@notmyaddress.com>
>wrote:
>
>>On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying" <skybu...@hotmail.com> wrote:
>>
>>| Hello,
>>|
>>| Apperently Intel's Skylake Processors can freeze up when calculating certain
>>| Prime Numbers.
>>|
>>| I am investigating this story further, for now here is a link about it:
>>|
>>| https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>
>>Intel is apparently aware of this and is working with its partners to distribute a
>>fix in form of a BIOS update.
>>
>>http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>
>>Larc
>
>I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
>some firmware?

Change microcode?

krw

unread,
Jan 11, 2016, 9:38:38 PM1/11/16
to
Or trap the failing instruction and emulate it.

B00ze

unread,
Jan 11, 2016, 11:48:31 PM1/11/16
to
Lol, I read that bug on a few sites; I was SURE Skybuck was going to
push it to the group before I could ;-)

On 2016-01-11 16:55, Paul <nos...@needed.com> wrote:

> Skybuck Flying wrote:
>>
>> Compute well this processor does not - Yoda.
>> Wrong is much with this processor - Yoda.
>> Breath I would not hold - Yoda.

I decided to wait for the next CPU a while ago. Not for the Bit
Manipulation bug you posted on this group a while ago (as far as I know,
all you need is AND OR and XOR, so what if some esoteric bit instruction
is buggy). I decided to wait because I am not impressed with the "Turbo"
boosts (0/0/0/2 - since Windoze never runs only 1 core, you will never
see 4.2GHz). I decided I'd wait for the Skylake equivalent of Devil's
Canyon...

[big snip, sorry, all good stuff but no need to repeat it]

> It's the ones that cost you performance, that
> make the fanbois crazy. The AMD TLB bug certainly
> upset a few happy owners of the affected silicon.
> And the Intel FDIV bug certainly cost Intel

Ya, I have a tagline or two around that:

- According 2 Intel, 1+1 equals 3, for very large values of 1.
- A bad random number generator: 1, 1, 1, 4.33e+67, 1, 1, 1...
- Hitchhicker's pentium Ed: The meaning of life is 41.9815...

> (all those Excel spreadsheet jokes). Intel has had
> a few wakeup calls, and I like to think that Skylake
> is another wakeup call ("staff getting sloppy,
> poor decision making"). So far it hasn't cost them
> any "big money". I don't know if users are
> returning their "bent" processors or not.

Well, they managed to make it 5-10% faster than previous model, even tho
it doesn't turbo as well and even tho they dropped 8-way set associative
L2 cache; it's still pretty impressive...

Best Regards,

--
! _\|/_ Sylvain / B00...@hotmail.com
! (o o) Member-+-David-Suzuki-Fdn/EFF/Red+Cross/Planetary-Society-+-
oO-( )-Oo "What's that?" -Arthur, "Something blue." -Ford

JJ

unread,
Jan 12, 2016, 1:11:54 AM1/12/16
to
On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle wrote:
>
> Nowadays processors (from micro to mainframe) are run by microcode. A
> problem like this can be in the circuitry (as was with the FPU problem
> in the early-mid 90's) or it can be in the microcode. IOW, it can be a
> hardware bug or a software bug :)
>
> Obviously if it's a microcode bug, an update should be able to fix it.
> Even if it's a hardware bug, there might be a way around the bug in the
> microcode.
>
> Being as this is a hang, my guess would be it's a microcode bug. But
> obviously I don't know.

Were there any information whether the CPU froze due to infinite loop or
stop executing, internally? e.g. if it's due to an internal infinite loop,
the CPU temperature won't decrease several minutes after it frozes.

And if the CPU stops executing, could its state be an untrappable exception
(i.e. hardware crash), deadlock, or a component which isn't either 0/false
or 1/true (i.e. undetermined state; a kind of deadlock when the component is
checked)?

Tom Del Rosso

unread,
Jan 12, 2016, 2:44:56 AM1/12/16
to
Jerry Stuckle wrote:
>
> Obviously if it's a microcode bug, an update should be able to fix it.
> Even if it's a hardware bug, there might be a way around the bug in
> the microcode.

That's not so obvious unless the microcode is in flash. Isn't it a mask
ROM?

--


Keith Thompson

unread,
Jan 12, 2016, 2:46:30 AM1/12/16
to
"Skybuck Flying" <skybu...@hotmail.com> writes:
> Apperently Intel's Skylake Processors can freeze up when calculating certain
> Prime Numbers.
>
> I am investigating this story further, for now here is a link about it:
>
> https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553

Will people replying to this please remove any irrelevant newsgroups,
particularly comp.lang.c? Thanks.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Paul

unread,
Jan 12, 2016, 4:38:12 AM1/12/16
to
Try page 431 here, section 9.11.1 "Microcode Update".
Microcode updates are a part of the BIOS flash image, and a
file in the BIOS file system contains a set of updates.
A typical file in the past, might have contained
eight patches. I used to dissect these, so I could
tell people whether there was a patch for their
non-standard processor or not. (Maybe someone stuffs
a Xeon into a regular desktop socket.)

https://web.archive.org/web/20110516145322/http://www.intel.com/Assets/PDF/manual/253668.pdf

Intel don't generally give information about the "encrypted
data" within a patch.

Some processors, the patch area is only 2KB in size.
Later processors, the patching format was made variable
length, in multiples of 2KB. I've taken apart a BIOS
module before, and found some of the slightly longer
patches (at least 4KB long, maybe slightly longer).

Some processors have as many as 1000 instructions, and
it's not clear that the patch actually contains an entire
microcode. But without a document describing the encrypted
section, we'll never know. It's encrypted presumably to
prevent some sort of hacking (bypassing security features
such as they are).

Intel provides a file suitable for usage by Linux, and
this is a text file for inclusion in compiled code.
Unpack the tar file, get the .dat out, change the
file extension to text and have a look. You can then
compare the samples in here, to the header format
in the above architecture book.

https://downloadmirror.intel.com/18148/eng/microcode-20090927.tgz

/* Sun Sep 27 10:52:54 CST 2009 */
/* 727-MU168313.inc */
0x00000001, 0x00000013, 0x02062001, 0x00000683,
0x2f0da1b0, 0x00000001, 0x00000001, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000,
0xbf5ad468, 0xc79f5237, 0xbd53889e, 0x896bfd13,
0x7adc0c8f, 0x44e9e0bc, 0x1a331fc9, 0x00b0f479,
0x53e9ceb3, 0xb14131a4, 0x39fc8310, 0x6993ee0d,
0xdb0c59b4, 0x67f24fd0, 0x63e83516, 0x0a4d411d,
0xb86a4294, 0x72c2edc5, 0xc543c5df, 0x7f3dd290,
...

The only vaguely interesting part, is the bit at the beginning,
as the encrypted body, we can't look at that. I'm sure someone
has figured out by now, how that is encrypted, and it's even
possible that is a variant design as well (encryption became
stronger with time). Whatever encryption they used in the P6
era, probably isn't strong enough any more.

Both the BIOS and the OS have a patcher. Patches are only
accepted, if the version number of the patch is higher than
the current version number loaded. And you can check whether
a patch has "taken" later, by using Intel PIU or similar.
So if I checked the BIOS and the version was 07 for your
processor, Intel PIU should report at least 07, and if the
OS loaded 11, then Intel PIU would report 11 instead of 07.
The BIOS must be able to patch out egregious bugs well enough
to allow OS booting to finish. The OS may have a later
version than the crusty old BIOS file content you were
given.

If Asus provides say, eight different versions of BIOS images
for your downloading pleasure, perhaps four of those contain
no substantive changes, and they have the latest versions of
Intel microcode content. So when it says "supports new processors",
sometimes that includes new microcode with a higher version
number. Part of the BIOS support is "recognition" code,
and some is "new microcode patches". Without recognition
code, the BIOS may declare "you have a P6", when you actually
have a Core2, that sort of thing. I have at least one computer
here, where the recognition is wrong, and the thing still works.
That's because I patched the microcode by hand, so the microcode
is actually up to date, but the recognition section, I didn't
have a way to modify that.

The BIOS also used to include a "microcode cache" function.
At least the BIOS for Award, there is a way to load the
microcode yourself, it is stored in flash in the BIOS.
It's non-volatile until you pull the CPU out of the
socket, and plug in another CPU. Doing so would purge the
cache, and you'd have to do the manual loading operation
again if you put the original processor back. The machine I
manually patched, I haven't pulled the processor out of it
since that. I probably wouldn't remember how to repeat
the procedure now. I was putting a Tualatin in a motherboard
that wasn't intended for Tualatin.

HTH,
Paul

David Brown

unread,
Jan 12, 2016, 5:06:48 AM1/12/16
to
On 11/01/16 20:12, John Larkin wrote:
> On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle

>> Nowadays processors (from micro to mainframe) are run by microcode.
>
> Not all. Some RISC machines are pure logic. ARM, Coldfire, maybe MIPS?

(When the microcode based 68K architecture was being redesigned into a
RISC structure with little or no microcode for the Coldfire, the
designers planned to keep the microcoded division instructions. But
then someone noticed that a pure software division routine ran faster
than the microcoded division hardware, so support was dropped.)

One of the driving forces of "RISC" compared to older "CISC" designs was
to get rid of microcode.

And since by far most modern cpus (both in terms of the numbers
produced, and the number of designs) are microcontrollers, which almost
never use microcode, it's fair to say that only a small proportion of
currently active processors have microcode - even though those
processors are rather important.

>
> Intels are still microcode based.
>

Yes, both Intel and AMD's x86 processors have a lot of microcode. But
they are not really "microcode based". Some parts, such as the FPU
(especially multi-cycle functions) are microcoded - but many other parts
are handled directly in hard logic. The distinction between what is
microcoded and what is hard logic is not easy to guess, and details are
considered part of the design secrets.

Microcode lets Intel and AMD do some fixes by microcode patches, either
using the BIOS (needed for Windows) or loaded by the Linux kernel at
startup.

Richard Heathfield

unread,
Jan 12, 2016, 5:21:11 AM1/12/16
to
On 12/01/16 10:06, David Brown wrote:
> One of the driving forces of "RISC" compared to older "CISC" designs was
> to get rid of microcode.

David, you're posting to five newsgroups whose only common factor is one
of their subscribers. I know it's not your doing, but could you please
trim comp.lang.c from followups in future? Thanks.

<snip>

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

David Brown

unread,
Jan 12, 2016, 5:42:43 AM1/12/16
to
On 12/01/16 11:21, Richard Heathfield wrote:
> On 12/01/16 10:06, David Brown wrote:
>> One of the driving forces of "RISC" compared to older "CISC" designs was
>> to get rid of microcode.
>
> David, you're posting to five newsgroups whose only common factor is one
> of their subscribers. I know it's not your doing, but could you please
> trim comp.lang.c from followups in future? Thanks.
>
> <snip>
>

My apologies - I was not paying attention to the newsgroups.

(I'll trim c.l.c and other off-topic groups if I post again in this
thread, but the apologies need to go to the right places.)

MitchAlsup

unread,
Jan 12, 2016, 9:33:14 AM1/12/16
to
Yes, but there are ways to take a trap at a microcode address and transfer
microcontrol to BIOS programmable address.

Skybuck Flying

unread,
Jan 12, 2016, 10:35:29 AM1/12/16
to
First of all comparing hardware to software is comparing apples to oranges.

To compare intel processors with operating systems would require intel's
processors to be mostly software so they can be fixed ! ;)

Second of all some success in the past is no garantee for the future.

Currently Intel's products are a big fat mess.

(Reminds me of my first 80486 which produces weird pixels on the screen...
never really figured out what caused that though ! ;) for now I assume
memory or graphics card related or could be co-processors... yes that rings
a bell... so again here we are...
1. 80486 co-processor problem (fpu)
2. pentium co-processor problem (fpu)
3. skylake co-processor problem ? (fpu)
Seems very much like Intel up to its old bad tricks again ! HA-HA ! :))

Anyway back to skylake, very worrieing reports are in that this processors
locks up randomly while gaming !

Also intel is being very secretive about what the exact problem is.

This can't be good.

Either they don't know... but then their fix would be bogus.

Or they do know and it's something serious and they ain't telling ! ;)

Bye,
Skybuck.

John Larkin

unread,
Jan 12, 2016, 12:25:24 PM1/12/16
to
On Tue, 12 Jan 2016 11:06:45 +0100, David Brown
<david...@hesbynett.no> wrote:

>On 11/01/16 20:12, John Larkin wrote:
>> On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle
>
>>> Nowadays processors (from micro to mainframe) are run by microcode.
>>
>> Not all. Some RISC machines are pure logic. ARM, Coldfire, maybe MIPS?
>
>(When the microcode based 68K architecture was being redesigned into a
>RISC structure with little or no microcode for the Coldfire, the
>designers planned to keep the microcoded division instructions. But
>then someone noticed that a pure software division routine ran faster
>than the microcoded division hardware, so support was dropped.)

Some CPU, maybe the HPPA Risc or something, didn't even have a
multiply instruction. It cluttered the pipeline so they eliminated it.


>
>One of the driving forces of "RISC" compared to older "CISC" designs was
>to get rid of microcode.
>
>And since by far most modern cpus (both in terms of the numbers
>produced, and the number of designs) are microcontrollers, which almost
>never use microcode, it's fair to say that only a small proportion of
>currently active processors have microcode - even though those
>processors are rather important.

I think the fundamental RISC concept is to design an instruction set
that's compiler friendly and not people friendly. CISC attempted to
make assembly programming look like a programming language; RISC
pretty much assumes that binaries are created by compilers.




>
>>
>> Intels are still microcode based.
>>
>
>Yes, both Intel and AMD's x86 processors have a lot of microcode. But
>they are not really "microcode based". Some parts, such as the FPU
>(especially multi-cycle functions) are microcoded - but many other parts
>are handled directly in hard logic. The distinction between what is
>microcoded and what is hard logic is not easy to guess, and details are
>considered part of the design secrets.

They are stuck with a CISC instruction set that was originally
microcoded. They have to work very hard to butcher that to make it
fast.


>
>Microcode lets Intel and AMD do some fixes by microcode patches, either
>using the BIOS (needed for Windows) or loaded by the Linux kernel at
>startup.

Less bugs is another approach.


--

John Larkin Highland Technology, Inc

lunatic fringe electronics

Casper H.S. Dik

unread,
Jan 12, 2016, 12:27:35 PM1/12/16
to
EricP <ThatWould...@thevillage.com> writes:

>I wondered that myself. I have seen Intel documentation refer
>to a signed microcode patch that is loaded by BIOS or OS at boot,
>but gave no details.

Intel releases new microcode for all of its CPUs every
few months; these can be loaded directly or they can
be inserted in the BIOS and loaded by the BIOS.

>I think a look-aside associative memory that can match a particular
>microcode address and substitute individual new microcode lines
>would work, but only for microcode.

I would assume that on power-on, the original microcode is
loaded and new microcode can be loaded by replacing it
completely. The examples I've seen are only a few KBs but
I think the SoC have much larger microcode.

>If the problem was in a hardware sequencer, say the hardware
>page table walker, then you're screwed unless maybe they could do
>something like trap to a special exception handler in the BIOS ROM
>for emulation. I've never seen such a mechanism mentioned in the
>normal documentation, and the Intel BIOS Writers Guide is not
>publically available

They certainly have the ability to disable specific instructions as
they have done with the original transactional memory instructions.

Casper

krw

unread,
Jan 12, 2016, 12:35:17 PM1/12/16
to
Microcode patches are often done in RAM. BIOS loads the patches every
reset.

DecadentLinuxUserNumeroUno

unread,
Jan 12, 2016, 12:53:11 PM1/12/16
to
On Tue, 12 Jan 2016 16:35:26 +0100, "Skybuck Flying"
<skybu...@hotmail.com> Gave us:

>Currently Intel's products are a big fat mess.

You're a goddaMNED TOTAL RETARD.
>
>(Reminds me of my first 80486 which produces weird pixels on the screen...
>never really figured out what caused that though ! ;)

YOU'RE A GODDAMNED RETARDED IDIOT AS WELL.


And you are a Usenet abusing cross-posting uncivil retarded punk as
well.

Your ISP should shitcan you for that stupid shit alone. ESPECIALLY
since you have been told to STOP the stupid behavior several times.

wolfgang kern

unread,
Jan 12, 2016, 1:53:54 PM1/12/16
to

Skybuck Flying <said:
[restricted to com.arch]

> To compare intel processors with operating systems would require intel's
> processors to be mostly software so they can be fixed ! ;)

> Second of all some success in the past is no garantee for the future.

> Currently Intel's products are a big fat mess.

> (Reminds me of my first 80486 which produces weird pixels on the screen...
> never really figured out what caused that though ! ;) for now I assume
> memory or graphics card related or could be co-processors... yes that
> rings a bell... so again here we are...
> 1. 80486 co-processor problem (fpu)
> 2. pentium co-processor problem (fpu)
> 3. skylake co-processor problem ? (fpu)
> Seems very much like Intel up to its old bad tricks again ! HA-HA ! :))
>
> Anyway back to skylake, very worrieing reports are in that this processors
> locks up randomly while gaming !

...
while your report this time seem to hit a really problem,
there is nothing in reporting on how M$ sells bugs as featurears...
If you read AMD-documnent you may find more truth on what
any modern CPU will choke or not. INTELs often delay such
'features'-info for many month (please dont ask me why).

__
wolfgang
(rare to even reply to flying buckets posts at all).


Mark -

unread,
Jan 12, 2016, 3:00:09 PM1/12/16
to

DecadentLinuxUserNumeroUno, you are just a warm ray of sunshine.

I am not a Skybuck Flying fan.

You losing control of yourself, feeding him, is helpful to know one, least
you.
Killfile him and be done with it.

My2c.


David Brown

unread,
Jan 12, 2016, 4:05:09 PM1/12/16
to
On 12/01/16 18:25, John Larkin wrote:
> On Tue, 12 Jan 2016 11:06:45 +0100, David Brown
> <david...@hesbynett.no> wrote:
>
>> On 11/01/16 20:12, John Larkin wrote:
>>> On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle
>>
>>>> Nowadays processors (from micro to mainframe) are run by microcode.
>>>
>>> Not all. Some RISC machines are pure logic. ARM, Coldfire, maybe MIPS?
>>
>> (When the microcode based 68K architecture was being redesigned into a
>> RISC structure with little or no microcode for the Coldfire, the
>> designers planned to keep the microcoded division instructions. But
>> then someone noticed that a pure software division routine ran faster
>> than the microcoded division hardware, so support was dropped.)
>
> Some CPU, maybe the HPPA Risc or something, didn't even have a
> multiply instruction. It cluttered the pipeline so they eliminated it.

There are plenty of small cpus with no multiply, in order to keep the
design small and simple. (I don't expect that die space was a major
factor when omitting a multiplier from the HPPA Risc, or whichever
processor you meant.) The interesting thing with division on the
Coldfire was that the software-only solution turned out to be faster
than the hardware (at least with the old design).

>
>
>>
>> One of the driving forces of "RISC" compared to older "CISC" designs was
>> to get rid of microcode.
>>
>> And since by far most modern cpus (both in terms of the numbers
>> produced, and the number of designs) are microcontrollers, which almost
>> never use microcode, it's fair to say that only a small proportion of
>> currently active processors have microcode - even though those
>> processors are rather important.
>
> I think the fundamental RISC concept is to design an instruction set
> that's compiler friendly and not people friendly. CISC attempted to
> make assembly programming look like a programming language; RISC
> pretty much assumes that binaries are created by compilers.
>

That's true. But a key point in a RISC design is that the instructions
should each be simple, and do one thing (so you have a "load"
instruction, and an "add" instruction, but not a "load then add"
instruction). Then you make these instructions run as fast as possible
- and that means no microcoding to add delays.

>
>
>
>>
>>>
>>> Intels are still microcode based.
>>>
>>
>> Yes, both Intel and AMD's x86 processors have a lot of microcode. But
>> they are not really "microcode based". Some parts, such as the FPU
>> (especially multi-cycle functions) are microcoded - but many other parts
>> are handled directly in hard logic. The distinction between what is
>> microcoded and what is hard logic is not easy to guess, and details are
>> considered part of the design secrets.
>
> They are stuck with a CISC instruction set that was originally
> microcoded. They have to work very hard to butcher that to make it
> fast.

Yes. So what happens internally is that the CISC instructions are
decoded and turned into mostly RISC instructions of a design-specific
RISC cpu that is invisible to the programmer. In most cases, one CISC
instruction turns into one or more RISC instructions to avoid
microcoding - but sometimes a few CISC instructions map to one RISC
instruction, and for some units microcoding is still used.

>
>
>>
>> Microcode lets Intel and AMD do some fixes by microcode patches, either
>> using the BIOS (needed for Windows) or loaded by the Linux kernel at
>> startup.
>
> Less bugs is another approach.
>

:-)



Larc

unread,
Jan 12, 2016, 4:54:04 PM1/12/16
to
Good advice. I don't recall his ever once posting anything useful. Or even
intelligible except on rare occasion.

Larc

Ivan Godard

unread,
Jan 12, 2016, 5:03:30 PM1/12/16
to
Funny; I'd call that "microcoded". Perhaps you are distinguishing
between horizontal and vertical microcode?

Mark -

unread,
Jan 12, 2016, 5:14:09 PM1/12/16
to

Doh.

> You losing control of yourself, feeding him, is helpful to know one,
> least you.
> Killfile him and be done with it.

Should be no one.

> Good advice. I don't recall his ever once posting anything useful.
> Or even intelligible except on rare occasion.

He (Skybuck Flying ), assuming a he and only one person posting under that
nom de plume, is one of a kind.


Rick Jones

unread,
Jan 12, 2016, 5:46:43 PM1/12/16
to

> > Some CPU, maybe the HPPA Risc or something, didn't even have a
> > multiply instruction. It cluttered the pipeline so they eliminated
> > it.

> There are plenty of small cpus with no multiply, in order to keep the
> design small and simple. (I don't expect that die space was a major
> factor when omitting a multiplier from the HPPA Risc, or whichever
> processor you meant.)

Indeed, per my recollection HP-PA or PA-RISC (two names for the same
architecture) 1.0 chips did not have an integer multiply. And for
those the floating point unit was separate and optional.

My recollection may be a bit fuzzy but by PA 1.1 (or somewhere in the
"middle" of it, and certainly by PA 2.0, all the chips had an
integrated floating point unit.

I probably have a couple things off - wasn't a PA hardware guy.

rick jones
--
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Nick Maclaren

unread,
Jan 12, 2016, 6:04:40 PM1/12/16
to
In article <n73ve1$93n$1...@news.hpeswlab.net>,
Rick Jones <rick....@hpe.com> wrote:
>
>> > Some CPU, maybe the HPPA Risc or something, didn't even have a
>> > multiply instruction. It cluttered the pipeline so they eliminated
>> > it.
>
>> There are plenty of small cpus with no multiply, in order to keep the
>> design small and simple. (I don't expect that die space was a major
>> factor when omitting a multiplier from the HPPA Risc, or whichever
>> processor you meant.)
>
>Indeed, per my recollection HP-PA or PA-RISC (two names for the same
>architecture) 1.0 chips did not have an integer multiply. And for
>those the floating point unit was separate and optional.
>
>My recollection may be a bit fuzzy but by PA 1.1 (or somewhere in the
>"middle" of it, and certainly by PA 2.0, all the chips had an
>integrated floating point unit.

Yes, but the integer multiply was part of the floating-point unit,
and not the integer/pointer arithmetic unit, which caused a massive
slowdown on matrix codes where strength reduction was inapplicable.
I saw a factor of 3, on the total application time.

It has been known for half a century that integer multiply is a
critical addressing operation for a significant proportion of
important codes, though 'computer scientists' often refuse to
admit it.


Regards,
Nick Maclaren.

Ivan Godard

unread,
Jan 12, 2016, 7:16:47 PM1/12/16
to
On 1/12/2016 3:01 PM, Nick Maclaren wrote:

> It has been known for half a century that integer multiply is a
> critical addressing operation for a significant proportion of
> important codes, though 'computer scientists' often refuse to
> admit it.

Nick - mul for scaling by stride, when you can't strength-reduce to an
add, or when there are more disjoint addresses than you have registers
to use for control variables.

However, with modern high-reg-count architectures (or long belt-count)
and a general choice of natural alignment in arrays (leading to the mul
being a manifest power-of-two and hence implementable with a shift) it's
much less an issue. And where neither strength reduction nor shift work,
most compilers are fairly clever about doing immediate multiplies as
shift-and-add combinations.

We looked hard at this, and concluded that we needed to support stride
scaling in addressing for all scalar strides. Pending measurement we
have not yet settled on whether to extend that to 2*scalar stride,
primarily for complex. Actually the bits are there to encode strides up
to 128, although the utility of that is questionable :-)

However, an integer multiplier is not that much of a monster these days,
for any but the extreme low end of our range, and so at least one is
configured in to nearly all of the anticipated Mill family members.

krw

unread,
Jan 12, 2016, 8:53:08 PM1/12/16
to
On Tue, 12 Jan 2016 22:05:06 +0100, David Brown
<david...@hesbynett.no> wrote:

>On 12/01/16 18:25, John Larkin wrote:
>> On Tue, 12 Jan 2016 11:06:45 +0100, David Brown
>> <david...@hesbynett.no> wrote:
>>
>>> On 11/01/16 20:12, John Larkin wrote:
>>>> On Mon, 11 Jan 2016 13:42:57 -0500, Jerry Stuckle
>>>
>>>>> Nowadays processors (from micro to mainframe) are run by microcode.
>>>>
>>>> Not all. Some RISC machines are pure logic. ARM, Coldfire, maybe MIPS?
>>>
>>> (When the microcode based 68K architecture was being redesigned into a
>>> RISC structure with little or no microcode for the Coldfire, the
>>> designers planned to keep the microcoded division instructions. But
>>> then someone noticed that a pure software division routine ran faster
>>> than the microcoded division hardware, so support was dropped.)
>>
>> Some CPU, maybe the HPPA Risc or something, didn't even have a
>> multiply instruction. It cluttered the pipeline so they eliminated it.
>
>There are plenty of small cpus with no multiply, in order to keep the
>design small and simple. (I don't expect that die space was a major
>factor when omitting a multiplier from the HPPA Risc, or whichever
>processor you meant.) The interesting thing with division on the
>Coldfire was that the software-only solution turned out to be faster
>than the hardware (at least with the old design).

THen there was the IBM 1800 (IIRC) that didn't have an add
instruction. It was nicknamed CADET (can't add, didn't even try).
When hardware is expensive, lots of things don't get done.
>
>>
>>
>>>
>>> One of the driving forces of "RISC" compared to older "CISC" designs was
>>> to get rid of microcode.
>>>
>>> And since by far most modern cpus (both in terms of the numbers
>>> produced, and the number of designs) are microcontrollers, which almost
>>> never use microcode, it's fair to say that only a small proportion of
>>> currently active processors have microcode - even though those
>>> processors are rather important.
>>
>> I think the fundamental RISC concept is to design an instruction set
>> that's compiler friendly and not people friendly. CISC attempted to
>> make assembly programming look like a programming language; RISC
>> pretty much assumes that binaries are created by compilers.
>>
>
>That's true. But a key point in a RISC design is that the instructions
>should each be simple, and do one thing (so you have a "load"
>instruction, and an "add" instruction, but not a "load then add"
>instruction). Then you make these instructions run as fast as possible
>- and that means no microcoding to add delays.

As complexity of chips increased more of the CISCy instructions found
their way into RISCy processors. They've more or less met in the
middle.

Robert Baer

unread,
Jan 12, 2016, 11:23:53 PM1/12/16
to
John Larkin wrote:
> On Mon, 11 Jan 2016 11:10:44 -0500, Larc<la...@notmyaddress.com>
> wrote:
>
>> On Mon, 11 Jan 2016 16:44:49 +0100, "Skybuck Flying"<skybu...@hotmail.com> wrote:
>>
>> | Hello,
>> |
>> | Apperently Intel's Skylake Processors can freeze up when calculating certain
>> | Prime Numbers.
>> |
>> | I am investigating this story further, for now here is a link about it:
>> |
>> | https://communities.intel.com/mobile/mobile-access.jspa#jive-content?content=%2Fapi%2Fcore%2Fv3%2Fcontents%2F524553
>>
>> Intel is apparently aware of this and is working with its partners to distribute a
>> fix in form of a BIOS update.
>>
>> http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/
>>
>> Larc
>
> I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
> some firmware?
>
>
EXACTLY what i thought.
Maybe the patch is to trap the offending instruction(s) which then
_emulates_ them (correctly?).
A good way to excessively slow down calculations of Pi to umpteen
digits, or do a Fourier on 10^4 digits or more.

John Levine

unread,
Jan 13, 2016, 12:19:05 AM1/13/16
to
>I wonder how the BIOS can fix an FPU error. Trap exceptions? Change
>some firmware?

Probably adjust timing or voltages to avoid a race condition.

Terje Mathisen

unread,
Jan 13, 2016, 12:24:14 AM1/13/16
to
David Brown wrote:
> On 12/01/16 18:25, John Larkin wrote:
>> Some CPU, maybe the HPPA Risc or something, didn't even have a
>> multiply instruction. It cluttered the pipeline so they eliminated it.
>
> There are plenty of small cpus with no multiply, in order to keep the
> design small and simple. (I don't expect that die space was a major
> factor when omitting a multiplier from the HPPA Risc, or whichever
> processor you meant.) The interesting thing with division on the
> Coldfire was that the software-only solution turned out to be faster
> than the hardware (at least with the old design).

If you have a fast/hw multiplier then it is relatively easy to use it to
generate a sw DIV algorithm which will at least show better throughput
than a classic 1 bit/cycle microcoded DIV instruction, i.e. similar to
what a 486 used.

The sw advantage increases with wider registers, since you can use
quadratic NR approximations which only require one more iteration to get
from 32 to 64 bits.

BTW, it is still somewhat hard to synthesize a 128 DIV 64 -> 64
result/64 remainder, any suggestions?

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Stephen Fuld

unread,
Jan 13, 2016, 12:44:16 AM1/13/16
to
At first, your response stunned me. It seemed obvious that there was a
huge difference. But the more I thought about it, the harder it became
to explain to myself what the difference was. :-(

After some time, I came up with the following. Traditional microcode is
like an interpreter. When encountering an instruction, the CPU
"branches" to the appropriate microcode to handle that instruction.
There are pre-coded routines in the microcode store for each instruction
to be executed.

On the other hand, "breaking" a CISC instruction into one or more
micro-ops is more like a JIT compiler. When the complex instruction is
encountered, it is "compiled" into a sequence of "native" instructions,
which are then executed. Unlike the microcode approach, the
instructions don't exist before the instruction decode creates them
(though, of course, the decoded may have something like a "template" to
guide its "compilation").

Does this make any sense?



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

unread,
Jan 13, 2016, 2:41:52 AM1/13/16
to
Indeed it does.

In my mental model the big difference between an instruction which is
microcoded and one where the instruction decoder splits it into a fixed
set of actual hw instructions lies in the scheduling:

With a traditional complex microcoded instruction the CPU effectively
branches to a hw function, doing nothing else until it is done, while a
decoder that spits out multiple instructions will allow them to run
independently through the rest of the cpu stages.

Particularly with OoO cpus this difference can be significant.

upsid...@downunder.com

unread,
Jan 13, 2016, 3:15:23 AM1/13/16
to
On Tue, 12 Jan 2016 09:25:25 -0800, John Larkin
<jjla...@highlandtechnology.com> wrote:

>
>>
>>One of the driving forces of "RISC" compared to older "CISC" designs was
>>to get rid of microcode.
>>
>>And since by far most modern cpus (both in terms of the numbers
>>produced, and the number of designs) are microcontrollers, which almost
>>never use microcode, it's fair to say that only a small proportion of
>>currently active processors have microcode - even though those
>>processors are rather important.
>
>I think the fundamental RISC concept is to design an instruction set
>that's compiler friendly and not people friendly. CISC attempted to
>make assembly programming look like a programming language; RISC
>pretty much assumes that binaries are created by compilers.

The original reason of going to microcoding and CISC was the huge cost
of memory. From CPU hardware design point of view it would be nice if
the ALU function (ADD/SUB/AND/OR) and each data path (data selector)
could be directly controlled by a bit in the instruction word.
Unfortunately, for most instructions, there are a lot of "Don't care"
bits, wasting a lot of expensive core memory bits.

One way to avoid this is to use some compact instruction set in core
memory and then use a complex instruction set decoder built from
random logic to generate all the control point signals needed by the
actual CPU hardware. At some point this became too complex and memory
chips were used to convert the compacted instruction set to generate
the individual data path control signals.

In addition, some sequences were common, so it made sense to generate
multiple long instruction word sequences using this compact to
expanded microcode store. Core main memory was slow (about 1 us) so if
the fast semiconductor microcode control store could generate multiple
hardware sequences during that, this was definitively a win. This also
further reduced the number of instruction needed to be stored in the
main program memory.

With the drop of memory prices and when caches become popular, it
became realistic to use long memory words to (more or less) directly
control each data path in the CPU and skip the microcode control
store.

Regarding CISC/RISC development, one might study the instruction set
of the 16 bit Data General Nova from the 1970's. Some instruction set
bits controlled directly the 74181 ALU chip function bits, some the
Carry_In to that chip and some bits were used to control the data
selectors.

Quadibloc

unread,
Jan 13, 2016, 3:52:08 AM1/13/16
to
A traditional microcoded computer has a permanent microprogram in some sort of
high-speed storage, and it goes from one micro-instruction to the next as it
operates. In order to work that way, the microprogram would include
microinstructions that fetch instructions and decode them.

My understanding of how modern x86 microprocessors work is this:

Hardwired circuitry fetches and decodes instructions.

After they're decoded, they're turned into "micro-ops", which resemble
microcode in some ways, and RISC instructions in others. Most significantly, an
instruction to add (or perform some other arithmetic or logic operation) a data
item from memory to the contents of a register will be turned into separate
micro-ops, one to do a load and one to do the arithmetic on a register-
to-register basis. (This is the specific thing that leads to the term
"decoupled microarchitecture", since load-store RISC computers are also
referred to as having a decoupled architecture by virtue of separating the load
from the arithmetic.)

So there is no permanent microprogram resident in the processor - rather, a
stream of micro-ops is produced on the fly as a translation of the CISC program.

However, this picture has two exceptions, which lead to references to microcode
in modern x86 processors.

Instructions similar to the IBM 360 MVC, EDMK, and so on are implemented by
permanent microprograms, since they involve long sequences of operations. So
when they're encountered, the decoding and control circuitry calls upon the
microprogrammed mode of operation.

As well, the conversion from CISC instructions to micro-ops isn't really 100%
hardwired - it's perhaps table-driven, so that changes can be made after the
chip is released. I could be mistaken here, and perhaps the operation is closer
to the traditional microcode model than I realize, but my understanding is that
were today's processors... or even the x86 processors going back to the Pentium
or even the 486... microcoded, they would not perform as well as they do.

John Savard

already...@yahoo.com

unread,
Jan 13, 2016, 4:55:57 AM1/13/16
to
See this response:
https://communities.intel.com/message/362383#362383
Henk says that the problem happens even on underclocked CPUs. That pretty much proves that the reason is NOT a critical timing pass so it can not be fixed by less aggressive power management.

Nick Maclaren

unread,
Jan 13, 2016, 5:38:36 AM1/13/16
to
In article <n744s9$lhb$1...@dont-email.me>,
Ivan Godard <iv...@millcomputing.com> wrote:
>
>> It has been known for half a century that integer multiply is a
>> critical addressing operation for a significant proportion of
>> important codes, though 'computer scientists' often refuse to
>> admit it.
>
>Nick - mul for scaling by stride, when you can't strength-reduce to an
>add, or when there are more disjoint addresses than you have registers
>to use for control variables.
>
>However, with modern high-reg-count architectures (or long belt-count)
>and a general choice of natural alignment in arrays (leading to the mul
>being a manifest power-of-two and hence implementable with a shift) it's
>much less an issue. And where neither strength reduction nor shift work,
>most compilers are fairly clever about doing immediate multiplies as
>shift-and-add combinations.

Not in my experience :-( The sort of algorithms that critically need
such code are not typically ones where you can control the sizes or
strides. I would be interested in which compilers, because I have
observed the converse - i.e. that trick is less common than it used
to be. But see below: dynamic sizes and strides are the ones of
interest.

>We looked hard at this, and concluded that we needed to support stride
>scaling in addressing for all scalar strides. Pending measurement we
>have not yet settled on whether to extend that to 2*scalar stride,
>primarily for complex. Actually the bits are there to encode strides up
>to 128, although the utility of that is questionable :-)

Grrk. That's a CATASTROPHIC solution! I am not arguing about whether
to be able to address in terms of simple, fixed object sizes - that's
a separate issue and the one that actually resolves. But, once you
start thinking of array sizes and strides as fixed at compile-time,
you end up with (original) Pascal/C/C++ - ghastly languages for array
algorithms, all.

>However, an integer multiplier is not that much of a monster these days,
>for any but the extreme low end of our range, and so at least one is
>configured in to nearly all of the anticipated Mill family members.

My point is that at least one needs to be part of the addressing
pipeline, in designs which separate that from the floating-point one.
It doesn't matter if it is fairly slow (e.g. 3x addition is fine,
and 5x is OK) and restricted to 'addressing size' results, but the
classic mistake is to allow it to force the addressing pipeline to
be halted and restarted. That's why PA-RISC ran like a drain on
such codes.

If I were doing it, and was inflicted with a two pipeline model, I
would have 'addressing only' operations in the addressing pipeline,
and a complete set of integer ones in the 'floating-point' one.
It would be only the latter that would have single*single=>double
and double/single=>(single,single), of course.


Regards,
Nick Maclaren.

already...@yahoo.com

unread,
Jan 13, 2016, 5:39:35 AM1/13/16
to
On modern Intel and AMD cores there are few more steps. Specifically, in case of Load+Op, the two micro-ops that were just recently separated are immediately fused together again int "macro-op" or "fused micro-op". The exact term depends on manufacturer. Then they kept together in Decoded Icache (that is, on processors that have Decoded Icache) their operands are renamed together, they are tracked together by reorder buffer and retired together after execution. They are separated again only when dispatched to EUs. So, when you look at the whole duration of pipeline, the two parts of Load+Op behave as a single unit for 80-85% of the time. On AMD K7/K8 they went one step further and kept Load-Op-Store together in single macro-op.

Overall it is more like "recoupled microarchitecture" than "decoupled microarchitecture".

Anonymous

unread,
Jan 13, 2016, 9:36:06 AM1/13/16
to
Yup. It is fascinating that the KDF9, delivered 52 years ago, had
auto-increment addressing modes that updated the loop count by 1
and the address/offset by a dynamically-settable stride that could be
as large as the largest address, and either positive or negative.

In conjunction with its FMA operation, and a jump-on-nonzero-count
instruction that looped inside the instruction buffer, it made a superb
scalar-product evaluator of a machine with very modest h/w resources.

I think I might detect the influence of Jim Wilkinson in this.

--
Bill Findlay


Ivan Godard

unread,
Jan 13, 2016, 2:55:03 PM1/13/16
to
You are clearly thinking Fortran, in which the second dimension often
has a runtime stride, while I'm thinking C where there is no second
dimension, just a first dimension done twice :-)

However, I really don't think that Fortran matrix code will be run on
the lowest-end Mills, so lacking a multiplier shouldn't be much of an
issue. They don't have floating point either.

>> However, an integer multiplier is not that much of a monster these days,
>> for any but the extreme low end of our range, and so at least one is
>> configured in to nearly all of the anticipated Mill family members.
>
> My point is that at least one needs to be part of the addressing
> pipeline, in designs which separate that from the floating-point one.
> It doesn't matter if it is fairly slow (e.g. 3x addition is fine,
> and 5x is OK) and restricted to 'addressing size' results, but the
> classic mistake is to allow it to force the addressing pipeline to
> be halted and restarted. That's why PA-RISC ran like a drain on
> such codes.

Not applicable here. Mill is RISC-like in much of its addressing, and
there is no multiplier in the addressing pipes; if you need a stride,
you use an integer mul instruction and then compose your address from
the result. Where we are less RISCy is that we optimize the common case
of static *2/4/8/16 by muxing in the address adders, so those cases
don't need a multiply. And the Mill is statically scheduled/exposed
pipeline, so there's no issue of "stopping the pipeline" for any ops,
addressing included.

> If I were doing it, and was inflicted with a two pipeline model, I
> would have 'addressing only' operations in the addressing pipeline,
> and a complete set of integer ones in the 'floating-point' one.
> It would be only the latter that would have single*single=>double
> and double/single=>(single,single), of course.

By and large that's the way it works, with the division between flow and
exu sides, although all but the smallest Mill has several copies of each
side. The flow-side address adders are *not* ALU's - they are 3-input
mixed length summers that know about Mill pointers (which are *not*
integers, despite the llvm assumptions that cause us so much
trouble).There is no flow-side mul. The exu side has regular ALU amd MUL
FUs (and FPUs if configured). And there's some doubt whether we will
ever have hardware division.

Nick Maclaren

unread,
Jan 13, 2016, 3:44:14 PM1/13/16
to
In article <n769th$n5o$1...@dont-email.me>,
Ivan Godard <iv...@millcomputing.com> wrote:
>>
>> Grrk. That's a CATASTROPHIC solution! I am not arguing about whether
>> to be able to address in terms of simple, fixed object sizes - that's
>> a separate issue and the one that actually resolves. But, once you
>> start thinking of array sizes and strides as fixed at compile-time,
>> you end up with (original) Pascal/C/C++ - ghastly languages for array
>> algorithms, all.
>
>You are clearly thinking Fortran, in which the second dimension often
>has a runtime stride, while I'm thinking C where there is no second
>dimension, just a first dimension done twice :-)

Actually, it applies equally well to a multidimensional array class
in C++. But you are right that C does not support multidimensional
arrays - that was precisely my point! Damn the languages, I am
talking about the algorithms and languages.

>However, I really don't think that Fortran matrix code will be run on
>the lowest-end Mills, so lacking a multiplier shouldn't be much of an
>issue. They don't have floating point either.

Don't bet on it. There are lots of matrix algorithms used in graph
theory, often using Boolean arithmetic, and those are important in
even some embedded codes.

>Not applicable here. Mill is RISC-like in much of its addressing, and
>there is no multiplier in the addressing pipes; if you need a stride,
>you use an integer mul instruction and then compose your address from
>the result. Where we are less RISCy is that we optimize the common case
>of static *2/4/8/16 by muxing in the address adders, so those cases
>don't need a multiply. And the Mill is statically scheduled/exposed
>pipeline, so there's no issue of "stopping the pipeline" for any ops,
>addressing included.

That's fine by me. It doesn't matter HOW it is done, provided that
integer multiplication in the addressing pipeline is only a minor
performance problem. It's only a few codes, but those often depend
on it very heavily.

>And there's some doubt whether we will ever have hardware division.

Again, provided that it performs acceptably, so what?


Regards,
Nick Maclaren.

Dennis

unread,
Jan 13, 2016, 3:56:58 PM1/13/16
to
On 01/12/2016 07:53 PM, krw wrote:

>
> THen there was the IBM 1800 (IIRC) that didn't have an add
> instruction. It was nicknamed CADET (can't add, didn't even try).
> When hardware is expensive, lots of things don't get done.

It was the 1620. It used a table look up in memory for the add
operations. If you modified the table you do other (sometimes useful)
things with it - or so I was told by an x-ray crystallographer who did
extensive programming on it. It used core memory that was heated to
above room temperature to minimize thermal effect on the cores.

MitchAlsup

unread,
Jan 13, 2016, 8:57:45 PM1/13/16
to
Given 48-bits as the fairly standard virtual address space these days:
adding a multiplier to the address generation path is akin to adding
the better part of a floating point multiplier in area and adding at
least 2 cycles of delay to the minimum data cache hit access. In addition
there is no obvious way to add the multiplicand to the address generation
operand specification at the instruction set level.

I suggest that 3-input addition is the best you are going to see for any
machine with high performance goals or low power goals or low chip area
goals.

krw

unread,
Jan 13, 2016, 9:39:40 PM1/13/16
to
These "stages" need not be in series, either. They can be parallel
execution units. The machine state is then updated when all of the
pieces are complete.

krw

unread,
Jan 13, 2016, 9:43:33 PM1/13/16
to
On Wed, 13 Jan 2016 14:56:55 -0600, Dennis <den...@none.none> wrote:

>On 01/12/2016 07:53 PM, krw wrote:
>
>>
>> THen there was the IBM 1800 (IIRC) that didn't have an add
>> instruction. It was nicknamed CADET (can't add, didn't even try).
>> When hardware is expensive, lots of things don't get done.
>
>It was the 1620.

You're right, of course.

Robert Wessel

unread,
Jan 14, 2016, 3:48:25 AM1/14/16
to
On 13 Jan 2016 14:36:03 GMT, Anonymous <no_e...@invalid.invalid>
wrote:
S/360 BXH/BXLE did something similar - add a value to an index, then
compare it to a limit before branching if the limit is not exceeded.

Nick Maclaren

unread,
Jan 14, 2016, 5:53:03 AM1/14/16
to
In article <l3oe9b9afmqcom4ur...@4ax.com>,
Robert Wessel <robert...@yahoo.com> wrote:
>>>
>>> Grrk. That's a CATASTROPHIC solution! I am not arguing about whether
>>> to be able to address in terms of simple, fixed object sizes - that's
>>> a separate issue and the one that actually resolves. But, once you
>>> start thinking of array sizes and strides as fixed at compile-time,
>>> you end up with (original) Pascal/C/C++ - ghastly languages for array
>>> algorithms, all.
>>
>>Yup. It is fascinating that the KDF9, delivered 52 years ago, had
>>auto-increment addressing modes that updated the loop count by 1
>>and the address/offset by a dynamically-settable stride that could be
>>as large as the largest address, and either positive or negative.
>
>S/360 BXH/BXLE did something similar - add a value to an index, then
>compare it to a limit before branching if the limit is not exceeded.

However, both of those are solutions for array operations where
strength reduction works - indeed, they ARE strength reduction,
implemented in hardware. An addressing multiply is needed for the
cases where it cannot be used and yet multi-dimensional array
access is a critical operation.

There was and is a dogma among most computer scientists that it can
always be avoided, but they demonstrably do not have a clue. And,
as with all religious fanatics, pointing them at the hard facts has
no effect whatsoever. It's not too strong to say that this dogma
was one of the factors that caused the RISC revolution to fail.


Regards,
Nick Maclaren.

Quadibloc

unread,
Jan 14, 2016, 8:25:46 AM1/14/16
to
On Thursday, January 14, 2016 at 3:53:03 AM UTC-7, Nick Maclaren wrote:
> It's not too strong to say that this dogma
> was one of the factors that caused the RISC revolution to fail.

I wasn't aware that the RISC revolution has failed.

Yes, today's "RISC" architectures don't have all the characteristics of the
original RISC designs. But they have the important ones that are relevant to
reducing the need for complicated out-of-order execution hardware - memory
references restricted to loads and stores, a large number of registers.

And, of course, multiplication is used when accessing a multidimensional array
if one is not stepping through it but accessing its contents on some other
basis. If _that_ is not obvious to people lacking your level of expertise,
we're in a lot of trouble.

John Savard

Bruce Hoult

unread,
Jan 14, 2016, 10:23:48 AM1/14/16
to
On Thursday, January 14, 2016 at 4:25:46 PM UTC+3, Quadibloc wrote:
> I wasn't aware that the RISC revolution has failed.

Indeed not!

x86 is the only proudly CISC survivor, had long since lost the overall volume leadership position, and is losing it in market after market.

> Yes, today's "RISC" architectures don't have all the characteristics of the
> original RISC designs. But they have the important ones that are relevant to
> reducing the need for complicated out-of-order execution hardware - memory
> references restricted to loads and stores, a large number of registers.

Struggling to think of what had been dropped, other than branch delay slots, and a philosophy (in some but by no means all) that if you try to use a result before it's ready then you deserve wht you get. Though this still exists in parts of DSP land.

One or two RISCy things have decided that it might be ok, after all, to have two instruction lengths, as long as you can tell from a few bits which one you have. Still a far cry from x86's anything from 1-16 bytes and you could construct instructions longer than 16 if you were allowed to.

> And, of course, multiplication is used when accessing a multidimensional array
> if one is not stepping through it but accessing its contents on some other
> basis. If _that_ is not obvious to people lacking your level of expertise,
> we're in a lot of trouble.

Arbitrary size but compile-time elements (or dimensions) only usually needs one or two shift-and add/sub instructions.

Multi-dimensional arrays done as trees of arrays of pointers don't need multiplication.

Multi-dimensional arrays with runtime sizes, packed tightly contiguously together, need multiply. But if you pad each dimension to the next power of two then you only need shifts This wastes a bit of address space (but not much for low dimension arrays), but barely wastes any physical RAM on modern general purpose machines from Android/iOS phones (or Raspberry Pi etc) and up.

James Van Buskirk

unread,
Jan 14, 2016, 11:17:42 AM1/14/16
to
"Bruce Hoult" wrote in message
news:bcef9dee-48aa-41b9...@googlegroups.com...

> Multi-dimensional arrays done as trees of arrays of pointers don't
> need multiplication.

Sure, but isn't doubling the memory traffic much worse?

> Multi-dimensional arrays with runtime sizes, packed tightly
> contiguously together, need multiply. But if you pad each
> dimension to the next power of two then you only need shifts
> This wastes a bit of address space (but not much for low
> dimension arrays), but barely wastes any physical RAM on
> modern general purpose machines from Android/iOS phones
> (or Raspberry Pi etc) and up.

The problem is that the compiler can't do this because it
violates sequence association rules, and if you really did this
you might find that it causes cache thrashing because all
accesses would be to the same address mod 2**N.

Quadibloc

unread,
Jan 14, 2016, 11:29:39 AM1/14/16
to
On Thursday, January 14, 2016 at 8:23:48 AM UTC-7, Bruce Hoult wrote:
> On Thursday, January 14, 2016 at 4:25:46 PM UTC+3, Quadibloc wrote:

> > I wasn't aware that the RISC revolution has failed.

> Indeed not!

> x86 is the only proudly CISC survivor, had long since lost the overall volume leadership position, and is losing it in market after market.

What about System z? Now there's a survivor for you!

John Savard

Nick Maclaren

unread,
Jan 14, 2016, 11:31:47 AM1/14/16
to
In article <aa5817a9-662d-49c6...@googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:
>On Thursday, January 14, 2016 at 3:53:03 AM UTC-7, Nick Maclaren wrote:
>> It's not too strong to say that this dogma
>> was one of the factors that caused the RISC revolution to fail.
>
>I wasn't aware that the RISC revolution has failed.

Then you haven't been paying attention. It failed in two ways:

I remember what it was claimed to do, and its claimed objectives,
and virtually all of those have not been delivered or have been
abandoned.

>Yes, today's "RISC" architectures don't have all the characteristics of the
>original RISC designs. But they have the important ones that are relevant to
>reducing the need for complicated out-of-order execution hardware - memory
>references restricted to loads and stores, a large number of registers.

That is revisionism, pure and simple. Those properties predate the
RISC revolution by decades, and they DON'T reduce the need for
complicated out-of-order hardware, though I acceccpt that they do
simplify it, slightly. The original claims and intent were far more
grandiose than such minor tweaks.

No, I am NOT saying that all RISC designs have failed - ARM hasn't,
though the original RISC proponents asserted that it wasn't RISC.
MIPS hasn't, entirely, and its problems were not technical, but it
had to be extensively de-RISCed to make it fly. And, if POWER is a
genuinely simpler hardware solution, what will you offer me for
London Bridge?

My point is that the REVOLUTION failed. Inter alia, it was claimed
to deliver a lot more performance, simplify the task of compiler
writers, and replace CISC systems in the marketplace. Much like the
Itanic. RISC did not fail, but its claimed revolution did.


In article <bcef9dee-48aa-41b9...@googlegroups.com>,
Bruce Hoult <bruce...@gmail.com> wrote:
>
>Multi-dimensional arrays with runtime sizes, packed tightly contiguously
>together, need multiply. But if you pad each dimension to the next power
>of two then you only need shifts ...

Yes, multiplication by a power of two is equivalent to shifting,
but that padding trick is generally counter-productive and does
NOT address all of the requirements. It interacts very badly
take a 256x256 matrix A and pass the submatrix A(::3,::3).


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Jan 14, 2016, 11:33:05 AM1/14/16
to
In article <n78hhv$u3l$1...@dont-email.me>,
James Van Buskirk <not_...@comcast.net> wrote:
>"Bruce Hoult" wrote in message
>news:bcef9dee-48aa-41b9...@googlegroups.com...
>
>> Multi-dimensional arrays done as trees of arrays of pointers don't
>> need multiplication.
>
>Sure, but isn't doubling the memory traffic much worse?

And slicing/subsectioning becomes a ridiculously expensive operation,
rather than the cheap one that is needed for the many algorithms
that should use it.


Regards,
Nick Maclaren.

Quadibloc

unread,
Jan 14, 2016, 11:33:56 AM1/14/16
to
On Thursday, January 14, 2016 at 8:23:48 AM UTC-7, Bruce Hoult wrote:
> But if you pad each dimension to the next power of two then you only need
> shifts This wastes a bit of address space (but not much for low dimension
> arrays), but barely wastes any physical RAM on modern general purpose machines
> from Android/iOS phones (or Raspberry Pi etc) and up.

This assumes you can afford the time for your indexed addresses to go through a
TLB and/or page table. This is still a level of indirection, even if it's not
explicit like using pointers.

People who are number-crunching at the edge of hardware capability need to do
everything in their power to be efficient.

John Savard

Quadibloc

unread,
Jan 14, 2016, 11:40:05 AM1/14/16
to
On Thursday, January 14, 2016 at 9:31:47 AM UTC-7, Nick Maclaren wrote:
> RISC did not fail, but its claimed revolution did.

I'm happy to agree that _that_ RISC revolution failed. But if I were to say that the RISC revolution failed, I think I would confuse people, since it's obvious RISC won the war with CISC.

I would instead say that "purist RISC" (i.e. no floating point, all
instructions in one cycle) is dead. But I wouldn't necessarily mean the same
thing as you if I did.

And I realize that my ideas concerning the relationship of RISC and what OoO is
used for are apparently controversial. I may indeed be mistaken, or it may just
be I need to qualify my comments to say that Tomasulo OoO, OoO with register
rename, is what RISC (with enough registers) can obviate - with _scoreboard_
OoO (as in the Motorola 88000) being sufficient to handle the remaining issue
of L1 cache misses.

But that could be horribly mistaken for technical reasons to complicated to
attempt to enlighten me about.

John Savard

Nick Maclaren

unread,
Jan 14, 2016, 11:48:34 AM1/14/16
to
In article <2dd6eaea-9c78-4384...@googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:
>On Thursday, January 14, 2016 at 9:31:47 AM UTC-7, Nick Maclaren wrote:
>> RISC did not fail, but its claimed revolution did.
>
>I'm happy to agree that _that_ RISC revolution failed. But if I were to
>say that the RISC revolution failed, I think I would confuse people,
>since it's obvious RISC won the war with CISC.

Yer, whaa? RISC was claimed and intended to replace CISC; so which
has replaced the other on servers and workstations? ARM, which
was claimed to be not really RISC, dominates the low-end embedded
market - but CISC never really had a presence there.

>
>I would instead say that "purist RISC" (i.e. no floating point, all
>instructions in one cycle) is dead. But I wouldn't necessarily mean the same
>thing as you if I did.
>
>And I realize that my ideas concerning the relationship of RISC and what OoO is
>used for are apparently controversial. I may indeed be mistaken, or it may just
>be I need to qualify my comments to say that Tomasulo OoO, OoO with register
>rename, is what RISC (with enough registers) can obviate - with _scoreboard_
>OoO (as in the Motorola 88000) being sufficient to handle the remaining issue
>of L1 cache misses.
>
>But that could be horribly mistaken for technical reasons to complicated to
>attempt to enlighten me about.
>
>John Savard



Regards,
Nick Maclaren.

Anton Ertl

unread,
Jan 14, 2016, 12:06:24 PM1/14/16
to
n...@wheeler.UUCP (Nick Maclaren) writes:
>I remember what it was claimed to do, and its claimed objectives,
>and virtually all of those have not been delivered or have been
>abandoned.

Such as?

>No, I am NOT saying that all RISC designs have failed - ARM hasn't,
>though the original RISC proponents asserted that it wasn't RISC.

Citation needed.

>My point is that the REVOLUTION failed.

We still have the i386 architecture (extended to 64 bits with AMD64),
and still have the S/360 architecture (also extended to 64 bits), but
where are VAX, 68k and the lesser-known architectures of that time
(e.g. Data General Eclipse MV). Ok, so architectures die, both CISC
and RISC, but if the RISC revolution failed, new architectures would
be CISC rather than RISC. Let's see: The only new general-purpose
architecture in the last decade with significant deployment numbers is
Aarch64, and it is RISC (and quite a bit closer to the RISC mainstream
than ARM).

> Inter alia, it was claimed
>to deliver a lot more performance,

It did, for a while.

>simplify the task of compiler
>writers,

Well, at least instruction selection is simpler, but of course
compilers are as complex as the budget allows.

> and replace CISC systems in the marketplace.

HPPA, MIPS and SPARC replaced the 68k in the marketplace. Alpha
replaced the VAX in the marketplace. Admittedly, AMD64 has replaced
much of those since then.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Nick Maclaren

unread,
Jan 14, 2016, 1:08:27 PM1/14/16
to
In article <2016Jan1...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>n...@wheeler.UUCP (Nick Maclaren) writes:
>>I remember what it was claimed to do, and its claimed objectives,
>>and virtually all of those have not been delivered or have been
>>abandoned.
>
>Such as?

See my other postings. And please pay attention.

>>No, I am NOT saying that all RISC designs have failed - ARM hasn't,
>>though the original RISC proponents asserted that it wasn't RISC.
>
>Citation needed.

Look at most of the early papers. An immediate search gives:

http://home.gwu.edu/~mlancast/CS6461Reference/RISC/patterson85.RISComputers.pd.pdf

>> Inter alia, it was claimed
>>to deliver a lot more performance,
>
>It did, for a while.

Only for carefully selected benchmarks. I helped several people with
hand tuning to get their new, wonderful, five times faster RISC system
to achieve merely faster results than their old 68K ones. That was
a fairly typical experience on real-life workloads.


Regards,
Nick Maclaren.

already...@yahoo.com

unread,
Jan 14, 2016, 2:35:26 PM1/14/16
to
On Thursday, January 14, 2016 at 6:48:34 PM UTC+2, Nick Maclaren wrote:
> In article <2dd6eaea-9c78-4384...@googlegroups.com>,
> Quadibloc <jsa...@ecn.ab.ca> wrote:
> >On Thursday, January 14, 2016 at 9:31:47 AM UTC-7, Nick Maclaren wrote:
> >> RISC did not fail, but its claimed revolution did.
> >
> >I'm happy to agree that _that_ RISC revolution failed. But if I were to
> >say that the RISC revolution failed, I think I would confuse people,
> >since it's obvious RISC won the war with CISC.
>
> Yer, whaa? RISC was claimed and intended to replace CISC; so which
> has replaced the other on servers and workstations? ARM, which
> was claimed to be not really RISC, dominates the low-end embedded
> market - but CISC never really had a presence there.

In recent years, mostly already this decade, ARM started to make inroad into genuine low end. Yes, the processors it replace here can't be called CISC with a straight face, but they are not RISC either.

Also ARM absolutely wiped the floor in hand-held computing devices. Yes, it always had significant presence here, but in the past it was one of the many players, some of which were other RISCs, but many were CISCs, like 68K in Palms or 386 in several models of Nokia Communicators and early Blackberry phones.
By now ARM owns the market all to itself, with Intel just trying to scratch few percents on the 2-in-1 tablets high end.

Bruce Hoult

unread,
Jan 14, 2016, 2:57:57 PM1/14/16
to
On Thursday, January 14, 2016 at 7:17:42 PM UTC+3, James Van Buskirk wrote:
> "Bruce Hoult" wrote in message
> news:bcef9dee-48aa-41b9...@googlegroups.com...
>
> > Multi-dimensional arrays done as trees of arrays of pointers don't
> > need multiplication.
>
> Sure, but isn't doubling the memory traffic much worse?

Depends on the relative latency of a multiply and a (probably) L1 cache hit.

Bruce Hoult

unread,
Jan 14, 2016, 3:02:26 PM1/14/16
to
How so? it's very cheap. The array/slice descriptor has a pointer to the top level array of pointers, the number of dimensions, and for each dimension the offset to add to pointers and of course the bounds.

Bruce Hoult

unread,
Jan 14, 2016, 3:13:17 PM1/14/16
to
I don't think it's proudly CISC! A good number of registers, but a very constrained number of instruction sizes. Not load/store, but only one memory operand (except strings?). And, if I recall correctly (and more importantly), only one TLB hit per operand.

Stephen Fuld

unread,
Jan 14, 2016, 3:24:55 PM1/14/16
to
On 1/14/2016 12:13 PM, Bruce Hoult wrote:
> On Thursday, January 14, 2016 at 7:29:39 PM UTC+3, Quadibloc wrote:
>> On Thursday, January 14, 2016 at 8:23:48 AM UTC-7, Bruce Hoult wrote:
>>> On Thursday, January 14, 2016 at 4:25:46 PM UTC+3, Quadibloc wrote:
>>
>>>> I wasn't aware that the RISC revolution has failed.
>>
>>> Indeed not!
>>
>>> x86 is the only proudly CISC survivor, had long since lost the overall volume leadership position, and is losing it in market after market.
>>
>> What about System z? Now there's a survivor for you!
>
> I don't think it's proudly CISC! A good number of registers, but a very constrained number of instruction sizes.

> Not load/store, but only one memory operand (except strings?).


And all of the decimal arithmetic instructions. And the "string"
instructions contain some pretty "complex" stuff.


And, if I recall correctly (and more importantly), only one TLB hit per
operand.


Not true of the decimal instructions, nor of course, the string
instructions.

Given that you admit that it violates some of the key aspects of what
was called RISC, I disagree with your assessment. ISTM you can't say it
is not CISC if it violates several of the primary precepts of RISC.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Bruce Hoult

unread,
Jan 14, 2016, 3:48:38 PM1/14/16
to
On Thursday, January 14, 2016 at 11:24:55 PM UTC+3, Stephen Fuld wrote:
> Given that you admit that it violates some of the key aspects of what
> was called RISC, I disagree with your assessment. ISTM you can't say it
> is not CISC if it violates several of the primary precepts of RISC.

I do not claim the S/360 is RISC!

What is said is it is not *proudly* CISC in the way that x86 is, and VAX and 68020 were. It is more CISC than RISC, but not extremely so in the core instruction set.

Stephen Fuld

unread,
Jan 14, 2016, 4:06:18 PM1/14/16
to
On 1/14/2016 12:48 PM, Bruce Hoult wrote:
> On Thursday, January 14, 2016 at 11:24:55 PM UTC+3, Stephen Fuld wrote:
>> Given that you admit that it violates some of the key aspects of what
>> was called RISC, I disagree with your assessment. ISTM you can't say it
>> is not CISC if it violates several of the primary precepts of RISC.
>
> I do not claim the S/360 is RISC!
>
> What is said is it is not *proudly* CISC in the way that x86 is, and VAX and 68020 were.


OK. I don't think an ISA can be "proud", and since the original design
far predates even the terms RISC and CISC, I think the designers just
tried to do the best they could given their constraints and couldn't
care less about nomenclature.


> It is more CISC than RISC, but not extremely so in the core instruction set.

OK.

Nick Maclaren

unread,
Jan 14, 2016, 4:24:52 PM1/14/16
to
In article <a344fc03-e796-4898...@googlegroups.com>,
Bruce Hoult <bruce...@gmail.com> wrote:
>> >
>> >> Multi-dimensional arrays done as trees of arrays of pointers don't
>> >> need multiplication.
>> >
>> >Sure, but isn't doubling the memory traffic much worse?
>>
>> And slicing/subsectioning becomes a ridiculously expensive operation,
>> rather than the cheap one that is needed for the many algorithms
>> that should use it.
>
>How so? it's very cheap. The array/slice descriptor has a pointer to the
>top level array of pointers, the number of dimensions, and for each
>dimension the offset to add to pointers and of course the bounds.

Now take a slice in the other dimension (i.e. you need to use both
A(:,N) and A(N,:)). There is simply no way to use Iliffe vectors
that make both efficient, as was discovered in the 1960s.


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Jan 14, 2016, 4:34:55 PM1/14/16
to
In article <cf0c5398-5af0-49e9...@googlegroups.com>,
<already...@yahoo.com> wrote:
>
>Also ARM absolutely wiped the floor in hand-held computing devices. Yes,
>it always had significant presence here, but in the past it was one of
>the many players, some of which were other RISCs, but many were CISCs,
>like 68K in Palms or 386 in several models of Nokia Communicators and
>early Blackberry phones.
>By now ARM owns the market all to itself, with Intel just trying to
>scratch few percents on the 2-in-1 tablets high end.

It is more accurate to say that the CISCs never established a solid
presence in that market. Yes, there were a few devices, and the
vendors were trying to establish a presence, but they failed.

Also, try reversing it. RISC systems used to have a very solid
presence in the server and workstation markets. Where are they
today? Nowhere, except for a few niches in the server market
(mainly POWER, but also SPARC).

I always did prefer RISC as a philosophy (rather than the religion
the fundamentalists proposed), and am sorry that economic politics
caused the current separated duoculture. But that's another matter
from describing what did or did not happen.


Regards,
Nick Maclaren.

MitchAlsup

unread,
Jan 14, 2016, 6:08:40 PM1/14/16
to
On Thursday, January 14, 2016 at 3:34:55 PM UTC-6, Nick Maclaren wrote:

> I always did prefer RISC as a philosophy (rather than the religion
> the fundamentalists proposed), and am sorry that economic politics
> caused the current separated duoculture. But that's another matter
> from describing what did or did not happen.

In almost all large battles, the general with the biggest, best equipped
army wins. ARM being the exception to the rule, preferring to win ground
the others did not want until it was too late.

Terje Mathisen

unread,
Jan 14, 2016, 8:39:06 PM1/14/16
to
Bruce Hoult wrote:
> Multi-dimensional arrays done as trees of arrays of pointers don't
> need multiplication.

No, but they require more memory accesses, which is rapidly becoming
more expensive than one or mor MULs.
>
> Multi-dimensional arrays with runtime sizes, packed tightly
> contiguously together, need multiply. But if you pad each dimension
> to the next power of two then you only need shifts This wastes a bit
> of address space (but not much for low dimension arrays), but barely
> wastes any physical RAM on modern general purpose machines from
> Android/iOS phones (or Raspberry Pi etc) and up.
>
Here is where I disagree violently:

Padding to power of two sizes will very often cause pessimally bad cache
behavior!

I.e. as soon as you stride along a column in such a padded array, you
end up accessing the same small number of cache lines on every access,
and a single L1 miss will usually cost you as much as several MULs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,
Jan 14, 2016, 8:48:45 PM1/14/16
to
We've discussed this multiple times previously, it is pretty clear by
now that arrays should be passed with descriptors, not just base address
pointers.

For C(++) programs there's just a tiny bit of existing code and
interfaces that would hinder a full conversion. :-(

Terje Mathisen

unread,
Jan 14, 2016, 9:06:54 PM1/14/16
to
If you use a flat descriptor instead of arrays of pointers to arrays of
pointers to arrays etc, then slicing is cheap, otherwise it is quite
expensive, i.e. you need to replicate the entire tree structure if the
base access pattern assumes C-style pointers to the start of each array.

Terje Mathisen

unread,
Jan 14, 2016, 9:14:53 PM1/14/16
to
The big step is from x86 to VAX/68020, not from (say) MIPS to x86.

When you remove instruction decoding (which is basically a solved
problem these days), the only real CISCy x86 feature is the fast load-op
instructions.

You still limit the number of TLB accesses per instruction, except for
string ops, but those are architecturally really small subroutine loops
which can be interrupted at any point because all the temporary state is
stored in regular registers.

James Van Buskirk

unread,
Jan 14, 2016, 9:17:14 PM1/14/16
to
"Bruce Hoult" wrote in message
news:5f4a5fce-b51d-4ca3...@googlegroups.com...
I have to admit to being superstitious about extra memory traffic.
I wrote some radix-2 FFT code on a 21164 and it had the problem
that it was load/store limited, even out of L1 cache. But when a
load from L2 cache was necessary, the instruction got replayed.
This sort of put the processor out of whack so that if the next
load came too soon afterward, that instruction would also be
replayed, even if it was from L1. Sort of like a 1-cycle loop on a
Pentium Classic, once the train falls off the rails it never gets
back on track again. I had to switch to a higher radix; needed
radix-64 to be lower in opcount than split radix.

There was also an example with a Core 2 Duo. That processor
only had 2-register instructions so you needed to slop a lot of
MOVAPS instructions in there, but the problem was that the
processor could and did issue them to any port, even the
computational ports, getting in the way of useful work. So
I took a MOVAPS between registers out of the instruction
stream and replaced it with a store and load, and sure enough
the program got a little bit faster. Trouble was, AMD chips
weren't smart enough to figure out that the write to that
memory location wasn't trampling on data that it was always
loading so other loads would stall until that store retired,
making the program a lot slower. I had to revert to the secret
instruction that emulated a MOVAPS but executed outside
the computational ports.

So, through superstition, I am guided to avoid unnecessarily
stressing even L1 cache bandwidth because it can and
sometimes does create problems in ways that are not so
easy to predict from the documentation. Of course, if one
only programs in high level language, the compiled code
that one gets is slow enough that these compound stall
issues don't arise with perceptible frequency.

Quadibloc

unread,
Jan 14, 2016, 11:20:44 PM1/14/16
to
On Thursday, January 14, 2016 at 2:34:55 PM UTC-7, Nick Maclaren wrote:

> Also, try reversing it. RISC systems used to have a very solid
> presence in the server and workstation markets. Where are they
> today? Nowhere, except for a few niches in the server market
> (mainly POWER, but also SPARC).

But, as you point out in your next paragraph, that has nothing to do with the
x86 being architecturally superior, and everything to do with the fact that the
x86 is subsidized by the volume sales generated by its being the processor that
runs Windows.

John Savard

Quadibloc

unread,
Jan 14, 2016, 11:21:23 PM1/14/16
to
On Thursday, January 14, 2016 at 4:08:40 PM UTC-7, MitchAlsup wrote:

> In almost all large battles, the general with the biggest, best equipped
> army wins. ARM being the exception to the rule, preferring to win ground
> the others did not want until it was too late.

Niche markets are (almost) always the way for upstarts to get a foothold.

John Savard

Anton Ertl

unread,
Jan 15, 2016, 5:56:48 AM1/15/16
to
n...@wheeler.UUCP (Nick Maclaren) writes:
>In article <2016Jan1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>n...@wheeler.UUCP (Nick Maclaren) writes:
>>>I remember what it was claimed to do, and its claimed objectives,
>>>and virtually all of those have not been delivered or have been
>>>abandoned.
>>
>>Such as?
>
>See my other postings. And please pay attention.

I called your bluff and you have nothing to show.

>>>No, I am NOT saying that all RISC designs have failed - ARM hasn't,
>>>though the original RISC proponents asserted that it wasn't RISC.
>>
>>Citation needed.
>
>Look at most of the early papers. An immediate search gives:
>
>http://home.gwu.edu/~mlancast/CS6461Reference/RISC/patterson85.RISComputers.pd.pdf

This paper does not even mention ARM, which is not surprising given
that the paper came out in January 1985, while the first samples of
ARM silicon were only received and tested by Acorn in April 1985.
They look at three CPUs (IBM 801, RISC II, and Stanford MIPS) that
have 4 traits in common, and looking at ARM (or Thumb) it has three of
the same characteristics; the characteristic it does not have is
branch delay slots; Power and Alpha also don't have that
characteristic.

>>> Inter alia, it was claimed
>>>to deliver a lot more performance,
>>
>>It did, for a while.
>
>Only for carefully selected benchmarks. I helped several people with
>hand tuning to get their new, wonderful, five times faster RISC system
>to achieve merely faster results than their old 68K ones. That was
>a fairly typical experience on real-life workloads.

My experience was that on those things I did, HPPA-based HP machines
were a lot faster than 68030-based machines, MIPS-based DecStations
were faster than 68k-based Apollos, the 88K-based Aviions were faster
then the 386 box we also had. The difference became smaller with the
486 and the Pentium; for
<https://www.complang.tuwien.ac.at/franz/latex-bench>, it vanished
already with the Pentium, for SPEC with the Pentium Pro.

From what you write, it seems that your performance problem was with
codes that used lots of integer multiplies, and many early RISCs did
not have multiply instructions, based on the observation that most
code does not do much integer multiplication.

Of course, if your code does not fit the pattern that they optimize
for, their CPUs are probably not a good fit for you. Then you need to
benchmark the machines with your code rather than the standard
benchmarks (at the time Dhrystone and Whetstone). Did you benchmark
MIPS-based machines? MIPS did have a multiply instruction already in
MIPS I.

Anton Ertl

unread,
Jan 15, 2016, 9:48:14 AM1/15/16
to
While CISCs are not as close to each other as RISCs are to each other,
the S/360 is relatively close to the PDP-11, the 386, and somewhat to
the 68000. All except 68000 are register machines (machines that
mostly use general-purpose registers); all are two-address machines
with load-op and read-modify-write instructions. The VAX is further
away, with three addresses that all could reference memory (with
indirect addressing modes), and the 68020 moved a bit in the direction
of the VAX.

Nick Maclaren

unread,
Jan 16, 2016, 4:41:46 AM1/16/16
to
>Bruce Hoult <bruce...@gmail.com> writes:
>>On Thursday, January 14, 2016 at 7:29:39 PM UTC+3, Quadibloc wrote:
>>> What about System z? Now there's a survivor for you!
>>
>>I don't think it's proudly CISC! A good number of registers, but a very
>constrained number of instruction sizes. Not load/store, but only one
>memory operand (except strings?). And, if I recall correctly (and more
>importantly), only one TLB hit per operand.
>
>While CISCs are not as close to each other as RISCs are to each other,
>the S/360 is relatively close to the PDP-11, the 386, and somewhat to
>the 68000. All except 68000 are register machines (machines that
>mostly use general-purpose registers); all are two-address machines
>with load-op and read-modify-write instructions. The VAX is further
>away, with three addresses that all could reference memory (with
>indirect addressing modes), and the 68020 moved a bit in the direction
>of the VAX.

The mind boggles. The System/370 and PDP-11 addressing models are
wildly different, and the addressing model is one of the most
important parts of an architecture.

I didn't respond to your nonsense earlier, because your fanaticism is
clearly making debate impossible, but I happened to notice that even
ARM say that their architecture is only SIMILAR to RISC :-)

http://www.arm.com/products/processors/instruction-set-architectures/

If anyone is still following this, I don't happen to believe that
the CISC/RISC distinction ever was useful - some architectures are
simpler than others, and simplicity is always a benefit. I regard
arguing over what constitutes RISC and which architectures are what
as equivalent to arguing how many angels can dance on the head of
a pin. As with the similar System V / Berkeley Unix 'debate', the
future will have aspects of both.


And, on another matter, no, accessing data beyond the end of an
object is never harmless, not even if the value is never inspected.
The main fundamental risk is that the object is at the end of an
authorisation domain. Page faults can be ignored, but that is
really, really bad news for RDMA and memory-mapped I/O. And
remember that, in C, objects can be of any size.


Regards,
Nick Maclaren.

Megol

unread,
Jan 16, 2016, 9:25:58 AM1/16/16
to
On Friday, January 15, 2016 at 3:48:14 PM UTC+1, Anton Ertl wrote:
> Bruce Hoult <bruce...@gmail.com> writes:
> >On Thursday, January 14, 2016 at 7:29:39 PM UTC+3, Quadibloc wrote:
> >> What about System z? Now there's a survivor for you!
> >
> >I don't think it's proudly CISC! A good number of registers, but a very constrained number of instruction sizes. Not load/store, but only one memory operand (except strings?). And, if I recall correctly (and more importantly), only one TLB hit per operand.
>
> While CISCs are not as close to each other as RISCs are to each other,
> the S/360 is relatively close to the PDP-11, the 386, and somewhat to
> the 68000. All except 68000 are register machines (machines that
> mostly use general-purpose registers); all are two-address machines
> with load-op and read-modify-write instructions. The VAX is further
> away, with three addresses that all could reference memory (with
> indirect addressing modes), and the 68020 moved a bit in the direction
> of the VAX.

X86 (i386) and 68k are very closely related, two address machines with load execute and load execute store support. Normal instructions have up to one memory reference and a few have two (x86: MOVSx, 68k: MOVEx mem, mem). Both have 8 register that can be used as general (the 68k have an additional 8 address registers). If x86 is a register machine the 68k is one too.

Nick Maclaren

unread,
Jan 16, 2016, 10:12:19 AM1/16/16
to
In article <a13d6dd0-cd24-49f8...@googlegroups.com>,
Megol <gole...@gmail.com> wrote:
>
>X86 (i386) and 68k are very closely related, two address machines with
>load execute and load execute store support. Normal instructions have up
>to one memory reference and a few have two (x86: MOVSx, 68k: MOVEx mem,
>mem). Both have 8 register that can be used as general (the 68k have an
>additional 8 address registers). If x86 is a register machine the 68k is
>one too.

Again, look at their addressing models. I agree that, if you think
of only the 80386 modes, it could be regarded as fairly close to
the 68000. Remember that architecture is more about the design
concepts than the details.


Regards,
Nick Maclaren.

Terje Mathisen

unread,
Jan 16, 2016, 11:15:45 AM1/16/16
to
Nick Maclaren wrote:
> If anyone is still following this, I don't happen to believe that
> the CISC/RISC distinction ever was useful - some architectures are
> simpler than others, and simplicity is always a benefit. I regard
> arguing over what constitutes RISC and which architectures are what
> as equivalent to arguing how many angels can dance on the head of
> a pin. As with the similar System V / Berkeley Unix 'debate', the
> future will have aspects of both.
>

Right, the future simply belongs to the pragmatists; whatever works,
works. We know that some features, like memory indirection and other
instructions that can generate multiple TLB faults should be avoided as
the plague.
>
> And, on another matter, no, accessing data beyond the end of an
> object is never harmless, not even if the value is never inspected.
> The main fundamental risk is that the object is at the end of an
> authorisation domain. Page faults can be ignored, but that is
> really, really bad news for RDMA and memory-mapped I/O. And
> remember that, in C, objects can be of any size.

Here is where I sort of disagree: I think the CPU ABI should bless read
only accesses up to the minimum size protection domain, as long as the
access cannot cross such a domain!

This effectively makes it legal to process a block of data using aligned
SIMD loads, even when the last load straddles the allocated object boundary:

Please notice that this is a platform contract, not a general language
feature, and it is perfectly valid (and in many ways a great feature) to
have byte granularity on those protection domains.

What I'm hoping for is a written promise that what we currently do in
high performance code because we know that there is no way for the HW to
detect that we are in fact going past the C(++) allocation, will stay
safe in the future as long as you use CPUID or something similar to
verify what the minimum protection granularity is.

Terje Mathisen

unread,
Jan 16, 2016, 11:21:39 AM1/16/16
to
Right, the original 68000 was OK, while the 68020 added lots of
VAX-style "how many TLB misses can we generate in a single instruction?"
operations.

Anton Ertl

unread,
Jan 16, 2016, 11:45:57 AM1/16/16
to
Megol <gole...@gmail.com> writes:
>X86 (i386) and 68k are very closely related, two address machines with load=
> execute and load execute store support. Normal instructions have up to one=
> memory reference and a few have two (x86: MOVSx, 68k: MOVEx mem, mem).

Yes.

>Bot=
>h have 8 register that can be used as general (the 68k have an additional 8=
> address registers). If x86 is a register machine the 68k is one too.

So you consider the D registers of the 68000 to be general-purpose,
even though one cannot use them for addressing? I don't, but of
course that is a matter of degree. One cannot use all the 386
registers for holding bytes, or as shift counts, for mixed
multiplication or division, so one might argue that the 386 registers
are not general-purpose, either. My experience, though is that I
constantly use addressing (I used more A registers than D registers on
the 68000), while I rarely use byte operations, shifts, mixed
multiplications and divisions. That's why the 386 registers have a
general-purpose feeling to me, while the 68000 D or A registers don't.

Anton Ertl

unread,
Jan 16, 2016, 12:37:27 PM1/16/16
to
n...@wheeler.UUCP (Nick Maclaren) writes:
>In article <2016Jan1...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>Bruce Hoult <bruce...@gmail.com> writes:
>>>On Thursday, January 14, 2016 at 7:29:39 PM UTC+3, Quadibloc wrote:
>>>> What about System z? Now there's a survivor for you!
>>>
>>>I don't think it's proudly CISC! A good number of registers, but a very
>>constrained number of instruction sizes. Not load/store, but only one
>>memory operand (except strings?). And, if I recall correctly (and more
>>importantly), only one TLB hit per operand.
>>
>>While CISCs are not as close to each other as RISCs are to each other,
>>the S/360 is relatively close to the PDP-11, the 386, and somewhat to
>>the 68000. All except 68000 are register machines (machines that
>>mostly use general-purpose registers); all are two-address machines
>>with load-op and read-modify-write instructions. The VAX is further
>>away, with three addresses that all could reference memory (with
>>indirect addressing modes), and the 68020 moved a bit in the direction
>>of the VAX.
>
>The mind boggles. The System/370 and PDP-11 addressing models are
>wildly different, and the addressing model is one of the most
>important parts of an architecture.

It's so important that I have never heard of it. When I google for
"adressing model computer architecture", it gives me pages about
memeory models and about addressing modes, so apparently Google has
never heard about "addressing model", either. Maybe you can elucidate
us about what you mean with that term.

>I didn't respond to your nonsense earlier, because your fanaticism is
>clearly making debate impossible,

Do you really have so few arguments that you need to resort to
insults?

> but I happened to notice that even
>ARM say that their architecture is only SIMILAR to RISC :-)
>
>http://www.arm.com/products/processors/instruction-set-architectures/

In particular, that page says:

|The ARM architecture is similar to a Reduced Instruction Set Computer
|(RISC) architecture, as it incorporates these typical RISC
|architecture features:
|
| A uniform register file load/store architecture, where data
| processing operates only on register contents, not directly on
| memory contents.
|
| Simple addressing modes, with all load/store addresses determined
| from register contents and instruction fields only.
|
|Enhancements to a basic RISC architecture enable ARM processors to
|achieve a good balance of high performance, small code size, low power
|consumption and small silicon area.

Not really a claim that ARM is not RISC. And given that the R in ARM
stands for RISC, the designers of the architecture certainly
considered the ARM to be a RISC.

As for enhancements, one could discuss whether Thumb is a RISC.
Interestingly, among the early research RISCs there were also 16-bit
instructions (IBM 801) or putting two instructions in one 32-bit word
(Stanford MIPS).

>And, on another matter, no, accessing data beyond the end of an
>object is never harmless, not even if the value is never inspected.
>The main fundamental risk is that the object is at the end of an
>authorisation domain.

Yes, I mentioned that.

>Page faults can be ignored, but that is
>really, really bad news for RDMA and memory-mapped I/O.

If by memory-mapped I/O you mean I/O hardware accessed with memory
access instructions, normal user applications do not have this
hardware mapped into their address space. If you mean memory-mapped
files (i.e., the mmap() system call), I don't see any real harm. I am
not familier with RDMA, but from what I read, it emulates a memory
access; might be a performance bug, but is it a correctness bug?

As for ignoring page faults, that's not what I suggest. So if such an
access hits a page fault, ok, the programmer notices the out-of-bounds
access, and can work on fixing it. But "optimizing" a bounded loop
into an endless loop and "optimizing" away everything inside the loop,
including the memory access, is not that helpful.

Quadibloc

unread,
Jan 16, 2016, 1:10:44 PM1/16/16
to
On Saturday, January 16, 2016 at 10:37:27 AM UTC-7, Anton Ertl wrote:
> n...@wheeler.UUCP (Nick Maclaren) writes:

> >The mind boggles. The System/370 and PDP-11 addressing models are
> >wildly different, and the addressing model is one of the most
> >important parts of an architecture.
>
> It's so important that I have never heard of it. When I google for
> "adressing model computer architecture", it gives me pages about
> memeory models and about addressing modes, so apparently Google has
> never heard about "addressing model", either. Maybe you can elucidate
> us about what you mean with that term.

Well, there's the _memory_ model, which is something familiar to any x86
programmer: the x86 has more than one of them.

With a PDP-11, and with a System/360, you can have a register-to-register
instruction that's 16 bits long, and a potentially indexed memory-to-register
instruction that's 32 bits long.

But on the PDP-11, the address is just a 16-bit field that references the whole
64K byte address space of the PDP-11.

On a System/360, the address space is all of 16 megabytes. So the address
consists of a four-bit field identifying a base register, and a 12-bit
displacement. One has to first load a register with the starting address of a
4K byte area in memory, and then use that register as a base register for
instructions that access that portion of memory.

That definitely is a significant architectural difference between the
System/360 and the PDP-11.

John Savard

Quadibloc

unread,
Jan 16, 2016, 1:11:14 PM1/16/16
to
On Saturday, January 16, 2016 at 10:37:27 AM UTC-7, Anton Ertl wrote:
> n...@wheeler.UUCP (Nick Maclaren) writes:

> >The mind boggles. The System/370 and PDP-11 addressing models are
> >wildly different, and the addressing model is one of the most
> >important parts of an architecture.
>
> It's so important that I have never heard of it. When I google for
> "adressing model computer architecture", it gives me pages about
> memeory models and about addressing modes, so apparently Google has
> never heard about "addressing model", either. Maybe you can elucidate
> us about what you mean with that term.

Nick Maclaren

unread,
Jan 16, 2016, 2:26:48 PM1/16/16
to
In article <n7dqbc$4j5$1...@gioia.aioe.org>,
Terje Mathisen <terje.m...@tmsw.no> wrote:
>>
>> And, on another matter, no, accessing data beyond the end of an
>> object is never harmless, not even if the value is never inspected.
>> The main fundamental risk is that the object is at the end of an
>> authorisation domain. Page faults can be ignored, but that is
>> really, really bad news for RDMA and memory-mapped I/O. And
>> remember that, in C, objects can be of any size.
>
>Here is where I sort of disagree: I think the CPU ABI should bless read
>only accesses up to the minimum size protection domain, as long as the
>access cannot cross such a domain!
>
>This effectively makes it legal to process a block of data using aligned
>SIMD loads, even when the last load straddles the allocated object boundary:
>
>Please notice that this is a platform contract, not a general language
>feature, and it is perfectly valid (and in many ways a great feature) to
>have byte granularity on those protection domains.
>
>What I'm hoping for is a written promise that what we currently do in
>high performance code because we know that there is no way for the HW to
>detect that we are in fact going past the C(++) allocation, will stay
>safe in the future as long as you use CPUID or something similar to
>verify what the minimum protection granularity is.

That has certainly got performance advantages. Being an extreme
portability person, I wouldn't bother to test the CPUID and would
simply assume that it is one. The big trouble with that is the
number of people who will assume that all the world's a System/360,
VAX, or 80386 :-(


Regards,
Nick Maclaren.

Nick Maclaren

unread,
Jan 16, 2016, 2:45:40 PM1/16/16
to
In article <2ed39f3a-5fb5-4cf6...@googlegroups.com>,
Quadibloc <jsa...@ecn.ab.ca> wrote:
>On Saturday, January 16, 2016 at 10:37:27 AM UTC-7, Anton Ertl wrote:
>
>> >The mind boggles. The System/370 and PDP-11 addressing models are
>> >wildly different, and the addressing model is one of the most
>> >important parts of an architecture.
>>
>> It's so important that I have never heard of it. When I google for
>> "adressing model computer architecture", it gives me pages about
>> memeory models and about addressing modes, so apparently Google has
>> never heard about "addressing model", either. Maybe you can elucidate
>> us about what you mean with that term.
>
>Well, there's the _memory_ model, which is something familiar to any x86
>programmer: the x86 has more than one of them.

I deliberately used a nonce term, because I was also including whether
the architecture has autoincrement and, to a lesser extent, memory
protection and other mapping mechanisms. Let's skip virtualisation,
on the grounds that it wasn't there originally in either case.

>That definitely is a significant architectural difference between the
>System/360 and the PDP-11.

As was the absence of autoincrement in the former - even in the
System/370, it was there only in MVCL/CLCL and BXH/BXLE. BCT can't
reasonably be called an autoincrement feature.

I can't be bothered to chase up the details of the memory protection
mechanisms at comparable times, but they were wildly different, too.

Oh, and the PDP 11 had memory-mapped devices, but the System/360/etc.
used an entirely different mechanism (which was usable directly from
applications).


Regards,
Nick Maclaren.

Ivan Godard

unread,
Jan 16, 2016, 3:18:21 PM1/16/16
to
On 1/16/2016 8:15 AM, Terje Mathisen wrote:
> Nick Maclaren wrote:

>> And, on another matter, no, accessing data beyond the end of an
>> object is never harmless, not even if the value is never inspected.
>> The main fundamental risk is that the object is at the end of an
>> authorisation domain. Page faults can be ignored, but that is
>> really, really bad news for RDMA and memory-mapped I/O. And
>> remember that, in C, objects can be of any size.
>
> Here is where I sort of disagree: I think the CPU ABI should bless read
> only accesses up to the minimum size protection domain, as long as the
> access cannot cross such a domain!
>
> This effectively makes it legal to process a block of data using aligned
> SIMD loads, even when the last load straddles the allocated object
> boundary:
>
> Please notice that this is a platform contract, not a general language
> feature, and it is perfectly valid (and in many ways a great feature) to
> have byte granularity on those protection domains.
>
> What I'm hoping for is a written promise that what we currently do in
> high performance code because we know that there is no way for the HW to
> detect that we are in fact going past the C(++) allocation, will stay
> safe in the future as long as you use CPUID or something similar to
> verify what the minimum protection granularity is.

You (and the field) are conflating "address" with "access". For
performance and simple code you want to get something that the SIMD
operations like parallel add can deal with when you do your SIMD load.
However you don't actually care about the data that is past the end that
you happened to have *addressed*; you do not need to *access* it.

So any system that lets you *address* over the end but fills the part
you don't need to *access* with dummies will do. +1 if the system tells
you loudly if you do in fact *access* past the end.

Languages and compilers that also conflate address with access are a
problem, especially when they turn the result into nasal demons.

Rob Warnock

unread,
Jan 17, 2016, 3:30:05 AM1/17/16
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
+---------------
| n...@wheeler.UUCP (Nick Maclaren) writes:
| >Page faults can be ignored, but that is
| >really, really bad news for RDMA and memory-mapped I/O.
|
| If by memory-mapped I/O you mean I/O hardware accessed with memory
| access instructions, normal user applications do not have this
| hardware mapped into their address space.
+---------------

Actually, there are quite a number of historical examples --
including commercial systems -- of "normal user applications"
having "hardware mapped into their address space". A few
examples I'm personally familiar with:

- The original SGI GL graphics pipeline hardware was mapped
into low address space of the user program which was
currently drawing [via the Geometry Engine] into the
frame buffer. The "Irix" operating system did lazy
switching of that permission between multiple processes
doing graphics at the same time. That is, when another
process ran which didn't touch the magic graphic virtual
address, it remained mapped in the previous process's
address space. But if the switched-to process tried to
access the graphics pipeline, it got a non-fatal page fault,
the O/S flushed the GL pipeline, unmapped the pipeline
from the previous process, mapped it into the process
that just got the fault, and restarted the faulting
process, which now accessed the pipeline wihtout error.

It provided *very*-low-latency access from graphics
programs to the hardware for the many fairly-small
transactions that early GL used, e.g., "<start polygon>
<vertex> <vertex>...<vertex> <end polygon> <fill color>
<fill>", that sort of stuff. Those were stuffed into
the mmap-ed pipeline with ordinary STORE instructions.

[Later versions played games with packing multiple STOREs
into the write buffers of the later MIPS CPUs to lower
the number of hardware bus cycles needed to push a given
set of GL transactions.]

- SGI's "GSN" [Gigabyte System Network, an implementation
of ANSI HIPPI-6400] used the same trick to accelerate
small RPC/MPI/STP/etc. transactions, with some modifications,
tha main one being that the hardware was provided with
multiple sets of identical registers in its physical
address space, one set per page. This allowed multiple
user processes to be mapped to the hardware simultaneously,
each with its own private set of hardware registers in
an mmap-ed page. [The O/S still did the same sort of lazy
juggling of the mappings, IIRC, but much less frequently.]
Which hardware page got accessed implicitly told the hardware
which user process was doing the access, so the right
per-process data -- network connections, permissions,
data buffer mappings, etc. -- could be used.

- A number of X server implementations have done mmap-ing of
frame buffers and/or control registers into the user address
space of the X server, Of course, you may well not consider
an X server to be a "normal user application"... ;-}

Anyway, it is quite possible for non-root user programs to be
given direct access to hardware via mmap-ing, and it can even
be made (relatively) "safe" if one pays some careful attention
to the design of the hardware and the O/S interfaces to it.

+---------------
| I am not familier with RDMA, but from what I read, it emulates
| a memory access...
+---------------

Not necessarily. RDMA simply means "Remote DMA", and means that
an action at the requestor end of the link causes DMA access to
the responder end of the link... *without* any software intervention
at the responder end. [Obviously, for security there must have
been an initial exchange of authentication, etc. to set up such
an association, but once that's done, remote DMA reads/writes
can proceed without either local or remote systems software
action.]

What kind of local action triggers the remote DMA can vary
based on type of system, O/S, type of link, type of NIC, etc.
It *can* emulate a memory access on the local end, and in fact
a lot of the initial hype for Infiniband promoted that mode,
but these days it's more on a transaction/block/page/buffer-
at-a-time basis. Think of treating some portion of remote user
address space as a "disk" or a "frame buffer", then apply the
tricks mentioned above to lower latency.

For example, the GSN link mentioned above supported RDMA for
STP [Scheduled Transfer Protocol] connections between hosts.
Once the connection was open, a local user process could
directly poke an RDMA transaction request into the mmap-ed
GSN hardware control registers, and the GSN device would
do a local DMA Read, wrap the data in STP, ship it to the
remote host, unwrap the STP protocol, and (R)DMA Write it
into the remote host's user program buffer that was previously
authenticated for it to write into. [And similarly for
remote Reads.] Lots of careful protocol & O/S work is needed
to make this all both efficient *and* safe, but it can be done...


-Rob

-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <http://rpw3.org/>
San Mateo, CA 94403

It is loading more messages.
0 new messages