Ivan Godard

unread,

Mar 11, 2019, 10:21:49 AM3/11/19

to

Yep, hot swap. Also seen in embedded work. Not hard; you just need to
use transfer vectors and have a way to recognize one. Burroughs uses tag
7 in the metadata attached to every word, as opposed to the tag 3 of
code. Mill uses a Portal bit in the protection tables. You can use
dedicated address ranges, or other ways depending on the architecture.

Terje Mathisen

unread,

Mar 11, 2019, 10:48:58 AM3/11/19

to

This is obviously using a system very like the one I advocate for JIT,
i.e. loading a complete new replacement and a (single preferably) atomic
update of the function pointer(s).

>
> Perhaps we have learned a lot less in 40 years than what we think we did.

Maybe many of us like to redo the same "interesting" mistakes?

"But it is different this time!"

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

anti...@math.uni.wroc.pl

unread,

Mar 11, 2019, 11:29:20 AM3/11/19

to

We have learned that such dynamic swap may have problems. In
particular, problem raised by Anton is incompatibility of libraries.
Dynamic swap means that problem will be much more frequent.
Which may put more pressure on library maintainers to preserve
compatibility. Or it may convince users that dynamic swap
is too risky...

--
Waldek Hebisch

Anton Ertl

unread,

Mar 11, 2019, 12:09:34 PM3/11/19

to

anti...@math.uni.wroc.pl writes:

>MitchAlsup <Mitch...@aol.com> wrote:
>> The Boroughs Systems could install new dynamically linked libraries on
>> the fly. Any call currently going on would continue using the old code,
>> any new calls made would goto the new code.
>>
>> Perhaps we have learned a lot less in 40 years than what we think we did.
>
>We have learned that such dynamic swap may have problems. In
>particular, problem raised by Anton is incompatibility of libraries.

Actually, one problem is that, despite the memcpy in old glibc being
more capable than the memcpy in the new glibc, the dynamic linker does
not link it, but prefers to report an error; this is caused by the
fear of incompatibility of libraries.

>Dynamic swap means that problem will be much more frequent.
>Which may put more pressure on library maintainers to preserve
>compatibility.

Which would be a good thing.

But it seems that the glibc maintainers thought that they could get
away with breaking compatibility by having a new symbol version for
the incompatible variant. Unfortunately, this has lead to the dynamic
linking problem mentioned above.

In any case, it's cool that the Borroughs people could already do
this, but the problem I discussed is something else. And most of the
time, the people involved get it right, otherwise we would not just
have 4 symver lines (memcpy, pow, exp, log), but hundreds. So it
seems that we know how to do it right, most of the time; but not
always; or not all of us.

BGB

unread,

Mar 11, 2019, 2:05:46 PM3/11/19

to

I find it kinda sad that Linux still fails to manage something that
people tend to be able to take for granted on Windows:
The main APIs / ABIs are pretty much frozen, so (within reason, *) one
can compile code on one version of the OS and have it work on another,
with both the older and newer versions.

So, we have a window of ~ 25 years where binaries still typically work,
vs Linux being "maybe a few years, and only on a single distro".

Granted, the main offender here tends to be glibc and friends, so
potentially things would be better if either glibc would freeze its ABI,
or compilers would default to statically linking the C library.

*: Observing Win16 vs Win32 vs Win64, and targeting older Windows may
potentially require using an older version of MSVC (eg: VS2015 seems to
only be able to target Win7 and newer; VS2008 can target WinXP and
newer; ...).

Then again, not like I am one to talk recently:
My (not yet frozen) ISA design changes enough to where binaries wont
generally work after more than a few months.

Though, I expect I may eventually reach a point where I could start
"freezing" parts of the design, at which point "breaking changes" would
no longer be allowed.

Though it doesn't help that, partly because of its incremental
development, this design still has a certain amount of hair (some amount
unavoidable short of a fairly significant redesign; such as the whole F0
block layout doesn't match the F1 and F2 blocks; Or the design needing 3
read ports for GPRs rather than many other RISC's using 2 ports, *2; ...).

*2: Changing this would likely require dropping support for
"(Reg+Reg*Sc)" addressing and similar though, which would be kinda sad.
I guess the point of debate is whether it is "worth it", or better to
save some cost and be happy with "(Reg+Disp)" as the only available
addressing mode and similar (like RISC-V). OTOH, ARM gets fancier here,
and I am pretty sure ARM would also need 3 read ports (absent the use of
microcode or similar; my designs thus far do not use microcode).

In other news:
Though I still don't have a good use-case, I have decided to try doing a
"reboot" effort of my CPU core project, and see if (maybe with luck) I
can get something that passes timing easier and doesn't eat FPGA
resources so badly.

Things like GPRs, ALU, ... will now work considerably different from
before (it is pretty close to a full redesign).

For example, here is an (still untested) new design for an ALU module:
https://pastebin.com/NWXawwwd

But, I have little idea yet if this will be able to pass timing and similar.

Josh Vanderhoof

unread,

Mar 11, 2019, 6:39:58 PM3/11/19

to

Actually, I haven't had many problems running old binaries. I believe
the scenario you describe would actually work the same way on most Linux
systems.

The problem is when you compile on your Fedora 29 system you're making a
binary for a Fedora 29 system that will not run on anything older.
They'll say they guarantee backwards compatibility but not forwards
compatibility. Can't say I like it but I have to deal with the hand
I've been dealt and use the cross compiler.

Josh Vanderhoof

unread,

Mar 11, 2019, 6:53:26 PM3/11/19

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:

> Apart from the symbol versioning stuff (memcpy and a few other
> symbols), I have had trouble with different versions of libltdl: Old
> systems have only libltdl.so.3, new systems only libltdl.so.7. For my
> purposes producing a link from libltdl.so.7 to libltdl.so.3 seems to
> be enough for our usage, but installing a current libtool on the old
> system from source also solved the problem. Static linking of libltdl
> would probably do it, too, and would eliminate the need for the users
> to do something about libtool.so. If libltdl.so.3 was available on
> the newer system (as it should), the way to go would be to link such
> that it uses libltdl.so.3 (but I have to find out how to do that).
>
> Another problem was that starting a binary built on Debian 9 produced
> a floating-point exception on Debian 4. Fortunately, thanks to
> nneonneo's answer to
> <https://stackoverflow.com/questions/12570374/floating-point-exception-sigfpe-on-int-main-return0>,
> this could be fixed by compiling with
>
> -Wl,--hash-style=both
>
> So, there are ways to build the binaries such that they work on older
> systems. It's just that they are neither default, nor at least easy
> to reach. But they should.

I definitely remember using the hash style option before going with the
cross compiler. It just got to the point where there were so many
things to chase that the cross compiler was easier. I was always having
problems before the cross compiler and no trouble at all since. This is
for an OpenGL game so it might not be typical compared to other stuff.

Bruce Hoult

unread,

Mar 11, 2019, 7:34:08 PM3/11/19

to

On Monday, March 11, 2019 at 11:05:46 AM UTC-7, BGB wrote:
> I find it kinda sad that Linux still fails to manage something that
> people tend to be able to take for granted on Windows:
> The main APIs / ABIs are pretty much frozen, so (within reason, *) one
> can compile code on one version of the OS and have it work on another,
> with both the older and newer versions.
>
> So, we have a window of ~ 25 years where binaries still typically work,
> vs Linux being "maybe a few years, and only on a single distro".

Back when I was actively competing in the ICFP programming contest teams I led won prizes in the 2001 (2nd), 2003 (Judges' Prize), and 2005 (2nd *and* Judges' Prize) contests. The contest organisers (different each year) had some flavour of Linux machine and installed packages for compilers or interpreters for popular languages (up to and including Haskell) and teams submitted source code along with a driver script to build (if necessary) and run their source.

The language my team used (Dylan) was not available in packaged form, so for our submissions we included both source code and a compiled binary. I don't remember what the judges used but in 2001 I think I was using Redhat, 2003 Slackware, and 2005 Kubuntu. I just added -static to the link and we never had any problems.

> Though it doesn't help that, partly because of its incremental
> development, this design still has a certain amount of hair (some amount
> unavoidable short of a fairly significant redesign; such as the whole F0
> block layout doesn't match the F1 and F2 blocks; Or the design needing 3
> read ports for GPRs rather than many other RISC's using 2 ports, *2; ...).
>
> *2: Changing this would likely require dropping support for
> "(Reg+Reg*Sc)" addressing and similar though, which would be kinda sad.
> I guess the point of debate is whether it is "worth it", or better to
> save some cost and be happy with "(Reg+Disp)" as the only available
> addressing mode and similar (like RISC-V). OTOH, ARM gets fancier here,
> and I am pretty sure ARM would also need 3 read ports (absent the use of
> microcode or similar; my designs thus far do not use microcode).

Having three read ports certainly has a benefit and can enable you to need fewer instructions per program. The problem is there is at least a minor detrimental effect on area, cost, energy usage, and possibly frequency. If a sufficiently high proportion of instructions in the program take advantage of the 3rd port then it can be worth it -- certainly enough to offset any frequency effect if you only care about ultimate speed and not so much area and energy.

FP programs usually have a high proportion of FMA, so 3 ports are almost certainly worth it there.

Once you have an FPU your core is already big enough and hungry enough that the percentage cost of having three read ports on the integer register file too is probably pretty minor (assuming split register files).

Quadibloc

unread,

Mar 11, 2019, 11:17:58 PM3/11/19

to

On Monday, March 11, 2019 at 9:29:20 AM UTC-6, anti...@math.uni.wroc.pl wrote:

> MitchAlsup <Mitch...@aol.com> wrote:

> > The Boroughs Systems could install new dynamically linked libraries on
> > the fly. Any call currently going on would continue using the old code,
> > any new calls made would goto the new code.

> > Perhaps we have learned a lot less in 40 years than what we think we did.

> We have learned that such dynamic swap may have problems. In
> particular, problem raised by Anton is incompatibility of libraries.
> Dynamic swap means that problem will be much more frequent.
> Which may put more pressure on library maintainers to preserve
> compatibility. Or it may convince users that dynamic swap
> is too risky...

Another thing is that the Burroughs machines of the 1960s could do a *lot* of
things current hardware can't do. That's because they ran at a higher level
than conventional computers. Data was tagged with its type, arrays had
descriptors, all in the hardware.

This had overhead. The microprocessors in the computer on your desktop are,
architecturally, a lot like a 360/195. (Admittedly, the closest match is the
Pentium II; today's machines have OoO integer pipelines too.)

We've decided that despite a contemporary microprocessor running at 2 GHz,
instead of the 16 MHz of the 360/195, and serving only one user, rather than an
entire organization running cutting-edge number-crunching programs... we don't
have enough extra processing power to waste on features like the Burroughs
machines offered.

If anything ever makes us change our minds about this, it might be the
intractability of vulnerabilities like Spectre, Meltdown, and Spoiler. Perhaps
the microprocessor of the future will normally look like a Burroughs machine in
some ways, and in others like the infamous Intel 432, and run an operating
system like the Qubes OS.

But I suspect there will also be a way to lock out access to certain resources
with external hardware, and then switch the chip into a less-secure full-speed
mode for trusted applications, so that when the computer is not braving the
wilds of the Internet, it can get real work done. *Or* to make hardware design
simpler, there just might be two different kinds of chips, the insecure ones
going into supercomputers and game consoles, and the secure but slow ones going
into personal computers.

John Savard

Stephen Fuld

unread,

Mar 12, 2019, 1:02:57 AM3/12/19

to

On 3/11/2019 7:01 AM, MitchAlsup wrote:

Yes, but, I think I remember someone telling me that Burroughs enforced
a scheme of relatively recent recompilations. It did this by putting
the release level into the executable, and the OS not being willing to
execute any program compiled more than one major level back from its
(the OS's) current level. Thus, they essentially forced a
recompilatiopn every two major levels and "eliminated" the problem of
old executables.

I am not sure about this, and I hope someone with more knowledge of
these systems can shed more light on the issue.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Mar 12, 2019, 1:58:38 AM3/12/19

to

On 3/11/2019 6:34 PM, Bruce Hoult wrote:
> On Monday, March 11, 2019 at 11:05:46 AM UTC-7, BGB wrote:
>> I find it kinda sad that Linux still fails to manage something that
>> people tend to be able to take for granted on Windows:
>> The main APIs / ABIs are pretty much frozen, so (within reason, *) one
>> can compile code on one version of the OS and have it work on another,
>> with both the older and newer versions.
>>
>> So, we have a window of ~ 25 years where binaries still typically work,
>> vs Linux being "maybe a few years, and only on a single distro".
>
> Back when I was actively competing in the ICFP programming contest teams I led won prizes in the 2001 (2nd), 2003 (Judges' Prize), and 2005 (2nd *and* Judges' Prize) contests. The contest organisers (different each year) had some flavour of Linux machine and installed packages for compilers or interpreters for popular languages (up to and including Haskell) and teams submitted source code along with a driver script to build (if necessary) and run their source.
>
> The language my team used (Dylan) was not available in packaged form, so for our submissions we included both source code and a compiled binary. I don't remember what the judges used but in 2001 I think I was using Redhat, 2003 Slackware, and 2005 Kubuntu. I just added -static to the link and we never had any problems.
>

OK.

I guess this can work, given that at least the kernel interface doesn't
really change all that much.

>
>> Though it doesn't help that, partly because of its incremental
>> development, this design still has a certain amount of hair (some amount
>> unavoidable short of a fairly significant redesign; such as the whole F0
>> block layout doesn't match the F1 and F2 blocks; Or the design needing 3
>> read ports for GPRs rather than many other RISC's using 2 ports, *2; ...).
>>
>> *2: Changing this would likely require dropping support for
>> "(Reg+Reg*Sc)" addressing and similar though, which would be kinda sad.
>> I guess the point of debate is whether it is "worth it", or better to
>> save some cost and be happy with "(Reg+Disp)" as the only available
>> addressing mode and similar (like RISC-V). OTOH, ARM gets fancier here,
>> and I am pretty sure ARM would also need 3 read ports (absent the use of
>> microcode or similar; my designs thus far do not use microcode).
>
> Having three read ports certainly has a benefit and can enable you to need fewer instructions per program. The problem is there is at least a minor detrimental effect on area, cost, energy usage, and possibly frequency. If a sufficiently high proportion of instructions in the program take advantage of the 3rd port then it can be worth it -- certainly enough to offset any frequency effect if you only care about ultimate speed and not so much area and energy.
>

Yeah, the main use-case for it is mostly scaled-index memory stores and
similar, which need 3 registers.

Scaled-index load is less of an issue, as it could be made to work with
2 ports, but loads scaled-index load operations without the
corresponding store operations would break symmetry.

Doing it with 2 ports, with minimal ISA level changes (eg, still having
LEA), would likely require transforming it into:
LEA.x (Rn, Ri), R0 //LEA only needs 2 ports
MOV.x Rm, (R0) //Likewise, can use 2 ports

Internally, most other memory ops also use 3 registers, as the address
calculation currently does pretty much all memory accesses as
(Rm+Ri*Sc), with the 3rd port being used for the value to be stored.
Displacement cases currently handled by an 'IMM' register.

Some info I have found is conflicting as to whether ARM is 2R+1W or 3R+1W.

I would presume instructions like:
STR R4, [R1, R2, LSL #2]
To also require 3 read ports.

Addressing modes in ARM get a lot fancier than those I have in BJX2 though.

FWIW, the number of unique addressing modes which exist in BJX2 is
actually fewer than existed in SuperH. Granted, more modes exist than
exist in MIPS or RISC-V.

Well, and also still have some amount of encoding space left in 16-bit
land. Ex:
0znm: Load/Store ops
1znm: ALU ops
2zdd: BRA/BSR and friends, misc
3znz: Various 1R and 0R ops (eg: "POP Rn" and "RTS")
4znm: MOV to/from CRs, and LEA
5znm: Load ops (zero-extending), more ALU ops
6zzz: Still Unused (*1)
7zzz: Still Unused
8dnm: Load/Store, "MOV.L (Reg, Disp3)"
9zzz: FPU Ops (*2)
Ajjj: LDIZ #imm12, R0
Bjjj: LDIN #imm12, R0
Cnjj: ADD #imm8s, Rn
Dzzz: Predicated / Conditional Execution (32/48-bit Ops, *3)
Enii: MOV #imm8, Rn (*3)
Fzzz: 32/48-bit Ops

*1: I did the 16-bit ISA, realized I wasn't out of encoding space yet...
(Unlike SH, where pretty much no encoding space remained).

*2: Unlike SH, I was able to avoid mode-changing in the case of FPU ops.

*3: While aesthetically, it would probably have been better to swap
these, doing so would effectively break all my existing binaries, and
4'b11z1 and 4'b111z are "close enough".

Dzzz had still been unused (until a few days ago), and then I figured I
would use it for predicated ops. Unlike ARM, only True and False
condition-codes exist (and some instructions lack predicated
equivalents, ...).

I had briefly tried using Dzzz for a few other uses (such as "BRA
disp12" or "MOV #imm8s, Rk" and similar), but generally these didn't
save enough space to be worthwhile (so were soon dropped).

So, I wont complain too much about it eventually ending up being used
for predicated ops, which at least add something meaningful to the ISA.

Some number of instructions were dropped relative to SH, and some
encodings use fewer bits. For example, SH's 1nmd and 5nmd blocks were
merged into 8dnm, with 3 rather than 4 bits for the displacement.

Technically, fixed-length subsets are possible with both 16 and 32-bit
instruction forms, though I decided to make the subset with 32-bit
instructions canonical as it gets better performance and better matches
with the C ABI.

> FP programs usually have a high proportion of FMA, so 3 ports are almost certainly worth it there.
>
> Once you have an FPU your core is already big enough and hungry enough that the percentage cost of having three read ports on the integer register file too is probably pretty minor (assuming split register files).
>

Probably true...

I am using an FPU design which has two read ports and doesn't do FMA
(mostly to try to keep costs down).

I suspect it is pretty close to the minimum it can be and "still be
useful for something" (lots of creative corner-cutting in this one).

The goal of the current FPU is to try to be reasonably cheap, as most of
my current use-cases are dominated by integer math (but, programs like
Quake and similar use the FPU enough that FP emulation hurts performance
pretty badly).

Also it has fewer FPRs than GPRs (16 vs 32), but IME, GPR register
pressure tends to be somewhat higher than FPR register pressure.

Bruce Hoult

unread,

Mar 12, 2019, 2:38:49 AM3/12/19

to

On Monday, March 11, 2019 at 10:58:38 PM UTC-7, BGB wrote:
> Also it has fewer FPRs than GPRs (16 vs 32), but IME, GPR register
> pressure tends to be somewhat higher than FPR register pressure.

Ohhh .. I think the opposite. You can almost always get away ok with 16 GPRs, it's FP where you quickly need 32 FP registers (or vector registers) once you start to unroll things to get the FLOPs up.

Anton Ertl

unread,

Mar 12, 2019, 3:06:17 AM3/12/19

to

Bruce Hoult <bruce...@gmail.com> writes:
>I just added -static to the link and we nev=
>er had any problems.

Not a solution for building on a new system and running on an old one:

Build and run on Debian 9:

[~/tmp:49657] gcc -static hello.c
[~/tmp:49658] a.out
hello, world

Attempt to run on Debian 4:

[~/tmp:49595] a.out
FATAL: kernel too old
Segmentation fault

I think that glibc is the culprit here. They apparently don't
maintain fallback paths containing code that does not use the new
kernel features.

Anton Ertl

unread,

Mar 12, 2019, 4:27:12 AM3/12/19

to

BGB <cr8...@hotmail.com> writes:
>I find it kinda sad that Linux still fails to manage something that
>people tend to be able to take for granted on Windows:
>The main APIs / ABIs are pretty much frozen, so (within reason, *) one
>can compile code on one version of the OS and have it work on another,
>with both the older and newer versions.
>
>So, we have a window of ~ 25 years where binaries still typically work,
>vs Linux being "maybe a few years, and only on a single distro".

Running old binaries on new systems works, to some extent:

On a Debian 8 AMD64 system, here are some IA-32 binaries in /usr/local/bin:

-rwxr-xr-x 1 root root 32768 Apr 17 1997 gforth-0.3.0*
-rwxr-xr-x 1 root root 8196 May 9 1997 dspserver*
--ws--s--x 1 root root 98308 May 10 1997 ldefx*
-rwxr-xr-x 1 root root 740372 May 8 1998 xv*
-rws--x--x 1 root root 676482 Jun 23 1998 ssh1*
-rwxr-xr-x 1 root root 9938774 Jun 23 1998 Mosaic-linux-static-2.7b5*
-rwxr-xr-x 1 root root 100156 Jul 3 1998 ghostview-1.5*
-rwxr-xr-x 1 root root 33125 Jul 7 1998 lcc*
-rwxr-xr-x 1 root root 39740 Dec 28 1998 gforth-0.4.0*

[~:105479] gforth-0.3.0
Segmentation fault
[~:105480] dspserver
Segmentation fault
[~:105481] ldefx
ldefx: can't load dynamic linker '/lib/ld.so nor /usr/i486-linux/lib/ld.so'
[~:105482] xv
xv: error while loading shared libraries: libX11.so.6: cannot open shared object file: No such file or directory
[~:105484] ssh1 localhost
Secure connection to localhost refused; reverting to insecure method.
Using rsh. WARNING: Connection will not be encrypted.
ssh: connect to host localhost port 22: Connection refused
[~:105485] Mosaic-linux-static-2.7b5
Warning: translation table syntax error: Unknown keysym name: osfActivate
Warning: ... found while parsing ':<Key>osfActivate: ManagerParentActivate()'
...

but it displayed my hotlist.

[~:105486] ghostview-1.5
ghostview-1.5: error while loading shared libraries: libXaw.so.6: cannot open shared object file: No such file or directory
[~:105487] lcc
lcc [ option | file ]...
except for -l, options are processed left-to-right before files
unrecognized options are taken to be linker options
... more help text ...
[~:105488] gforth-0.4.0
GForth 0.4.0, Copyright (C) 1998 Free Software Foundation, Inc.
GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit

Those that just segfault, and ldefx are QMAGIC (a.out) binaries, the
others are ELF. And when I run the QMAGIC binaries as superuser, thy
produce the same result as ldefx (which is suid). Maybe I can get
them to work by making ld.so available.

For those where a library is missing, I have previously usually
succeeded in reactivating them by making the library available in a
place where the dynamic linker looks for it.

I have read that Windows programs solve the library problem by always
supplying the DLLs with the program, which is equivalent to static
linking.

I have also had a case where a program that worked nicely on WXP and
W7 just did not work at all on W8.1, for no obvious reason. And given
that I use very few Windows programs, that's not such a great record
for Windows, either.

>Granted, the main offender here tends to be glibc and friends, so
>potentially things would be better if either glibc would freeze its ABI,
>or compilers would default to statically linking the C library.

glibc does ok for running old binaries:

[~:105491] ldd /usr/local/bin/gforth-0.4.0
linux-gate.so.1 (0xf779a000)
libdl.so.2 => /lib32/libdl.so.2 (0xf7778000)
libm.so.6 => /lib32/libm.so.6 (0xf7732000)
libc.so.6 => /lib32/libc.so.6 (0xf7585000)
/lib/ld-linux.so.2 (0x5655f000)
[~:105492] ls -l /lib32/libc.so.6
lrwxrwxrwx 1 root root 12 Jun 18 2017 /lib32/libc.so.6 -> libc-2.19.so*

So a binary from 1998 uses a glibc released in 2014.

It would be great if the other libraries were as compatible (in one
direction) as glibc. And of course, it would be great if glibc was as
compatible in the other direction, too.

As to why: Windows thrives on independently distributed binaries,
while for Linux the dominant model is software provided in source
code, that is then compiled to binaries by distributors. So if a
library makes a binary-incompatible change (say, changing a structure
layout), the distributors produce a new .so version for the library,
recompile the packages that use the library, and, for them, all is
well. Independently distributed binaries ideally don't exist, and at
least to some extent, the distributors behave as if they did not. And
give security reasons for behaving that way.

Concerning old systems, the Linux distributors as well as Microsoft
wish that they did not exist, and produce incentives for switching to
new systems. Apparently in the Linux world these work so well that
Linux distributors can get away with making it hard to build binaries
that work on old systems.

For Windows, they don't work so well; people are still using WXP years
after its support has ended; companies are paying extra to extend the
support for W7. I wonder what makes it so hard for Microsoft to get
users to switch to new systems. Maybe the binary compatibility is not
so great on Windows, either, and unlike the Linux world, independently
distributed binaries are a much bigger factor in the Windows world.

BGB

unread,

Mar 12, 2019, 4:28:45 AM3/12/19

to

This could depend a bit on workload.

Most of the code I am dealing with is primarily integer code, with some
amount of fixed-point and floating-point thrown in. Most of the "in the
wild" code I have come across, has also been primarily integer code.

Decided to leave out a more detailed description, but in my CNC lathe
project (where stuff runs in an emulator on top of an ARM device), most
of the math is fixed-point with a certain amount of double-precision
thrown in.

For the most part, things are Q12.20 fixed-point, representing an
epsilon of ~ 1 millionth of an inch (with coordinates typically
expressed with a precision of 0.0001", or "tenths"). In G-Code, values
might be expressed as either 'X1.2345' (inch) or 'X12345' (tenths),
which are then converted into fixed-point coords for subsequent work.

Float isn't accurate enough for the use-case, and double is generally
used sparingly (even if, granted, FPU performance on the underlying
ARM11 seems to be pretty solid, vs say, the relative lack of FP
performance an ARM9).

Actually, one subtle feature of running some things in an emulator is
that it can also allow some of the same binaries to be run on both the
ARM device and on my main Windows PC. Some parts of the emulator are
also reused, if for slightly different purposes (such as the color-cell
display buffer being re-purposed as a way to stream a UI over UDP; with
delta messages from the console being used to redraw the UI over a LAN,
...).

As for the ISA design:
I am not currently going to do FPU SIMD, as this would require a
resource budget beyond what is realistic on something like an Arty...

Anton Ertl

unread,

Mar 12, 2019, 4:53:13 AM3/12/19

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>On a Debian 8 AMD64 system, here are some IA-32 binaries in /usr/local/bin:
>
>-rwxr-xr-x 1 root root 32768 Apr 17 1997 gforth-0.3.0*

...

>[~:105479] gforth-0.3.0
>Segmentation fault

...

>Those that just segfault, and ldefx are QMAGIC (a.out) binaries, the
>others are ELF. And when I run the QMAGIC binaries as superuser, thy
>produce the same result as ldefx (which is suid). Maybe I can get
>them to work by making ld.so available.

Yes, that worked, at least when running the binary as root:

# gforth-0.3.0
GForth 0.3.0, Copyright (C) 1994-1996 Free Software Foundation, Inc.

GForth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit

# prtgif

Tgif File Name to Print> ^C
# ls -l prtgif
-rwxr-xr-x 1 root root 8192 Sep 4 1995 prtgif

For some other QMAGIC binaries, libraries are missing.

Even older ZMAGIC binaries segfault.

The QMAGIC binaries don't work for non-root:

[~:105467] gforth-0.3.0
Segmentation fault
[~:105468] strace /usr/local/bin/gforth-0.3.0
execve("/usr/local/bin/gforth-0.3.0", ["/usr/local/bin/gforth-0.3.0"], [/* 41 vars */]) = -1 EPERM (Operation not permitted)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
+++ killed by SIGSEGV +++
Segmentation fault

I wonder why execve-ing a QMAGIC binary is not permitted for non-root.

already...@yahoo.com

unread,

Mar 12, 2019, 5:36:07 AM3/12/19

to

Agreed.
Except that on OoO machine with good support for immediate operands in the instruction set 16 GPRs are already in the area of diminishing return.

It seems to me that the main factors of why one need more FPRs than GPRs are less related to difference in sort of work and algorithms that one does with FP vs Integer and more related to two technicalities:
1. On typical implementations common FP ops have much higher latency*throughput product than common integer ops.
2. Poor (often non-existing) support for immediate FP operands.

For vector registers - it depends.
Today's fashion is to have vectors relatively short and evenly matched by vector execution units. In such scenario VRs are very much like FPRs and 32 VRs have non-trivial advantage over 16.
But it's not the only way to do vectors. For old-style vectors, where VRs are wider than execution units (in the past - many times wider, today I would not recommend that, today the correct ratio is either 4 or 8) 16 VRs are sufficient. But 8 VRs are probably insufficient even for old style.

Anton Ertl

unread,

Mar 12, 2019, 9:48:06 AM3/12/19

to

already...@yahoo.com writes:
>Except that on OoO machine with good support for immediate operands in the =

>instruction set 16 GPRs are already in the area of diminishing return.

In Gforth I see:

sieve bubble matrix fib fft
0.240 0.272 0.140 0.356 0.108 1 reg stack cache 1800MHz Cortex-A72
0.204 0.232 0.104 0.212 0.104 3 reg stack cache 1800MHz Cortex-A72
0.448 0.528 0.320 0.580 0.288 1 reg stack cache 1400MHz Cortex-A53
0.376 0.424 0.256 0.564 0.296 3 reg stack cache 1400MHz Cortex-A53

I.e., a speedup from 0.97-1.68 from using 3-register stack caching
over one. The relevance to the question at hand: On none of the
architectures with 16GPRs (neither AMD64 nor ARM), we managed to use
more than one register for stack caching.

>It seems to me that the main factors of why one need more FPRs than GPRs ar=
>e less related to difference in sort of work and algorithms that one does w=

>ith FP vs Integer and more related to two technicalities:

>1. On typical implementations common FP ops have much higher latency*throug=

>hput product than common integer ops.

Latency is irrelevant for logical register pressure in an OoO
implementation; you just schedule the instructions next to each other,
and OoO takes care of the latency.

My impression is that FP architectures have more register names
because they (and the corresponding microarchitectures) are designed
for matrix multiplication: In matrix multiplication, you can reduce
the memop/flop ratio by keeping more stuff in registers. The
microarchitectures are designed with lots of flops (especially fmas,
see the 2 512-bit FMA units in the server Skylakes)), but relatively
few memops (2 512-bit wide loads, 1 512-bit wide store in server
Skylakes). In order to feed the FMA units (with 2*3 operands), 4
operands have to come from registers, and one of the two results has
to stay in a register. That's possible for matrix multiplication, but
you need to keep quite a bit of data in registers. IIRC, 16 registers
is almost, but not quite enough. Consequently, AVX512 got 32 register
names.

already...@yahoo.com

unread,

Mar 12, 2019, 11:06:40 AM3/12/19

to

On Tuesday, March 12, 2019 at 3:48:06 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >Except that on OoO machine with good support for immediate operands in the =
> >instruction set 16 GPRs are already in the area of diminishing return.
>
> In Gforth I see:
>
> sieve bubble matrix fib fft
> 0.240 0.272 0.140 0.356 0.108 1 reg stack cache 1800MHz Cortex-A72
> 0.204 0.232 0.104 0.212 0.104 3 reg stack cache 1800MHz Cortex-A72
> 0.448 0.528 0.320 0.580 0.288 1 reg stack cache 1400MHz Cortex-A53
> 0.376 0.424 0.256 0.564 0.296 3 reg stack cache 1400MHz Cortex-A53
>
> I.e., a speedup from 0.97-1.68 from using 3-register stack caching
> over one. The relevance to the question at hand: On none of the
> architectures with 16GPRs (neither AMD64 nor ARM), we managed to use
> more than one register for stack caching.
>

I still don't understand what are you trying to say.

> >It seems to me that the main factors of why one need more FPRs than GPRs ar=
> >e less related to difference in sort of work and algorithms that one does w=
> >ith FP vs Integer and more related to two technicalities:
> >1. On typical implementations common FP ops have much higher latency*throug=
> >hput product than common integer ops.
>
> Latency is irrelevant for logical register pressure in an OoO
> implementation; you just schedule the instructions next to each other,
> and OoO takes care of the latency.

For very common patterns of long multiply-accumulate chains you can not.
On something like HSWL, if you want max. throughput in GEMM/long-convolution, you need 10 sw-visible registers just for accumulators.

>
> My impression is that FP architectures have more register names
> because they (and the corresponding microarchitectures) are designed
> for matrix multiplication: In matrix multiplication, you can reduce
> the memop/flop ratio by keeping more stuff in registers. The
> microarchitectures are designed with lots of flops (especially fmas,
> see the 2 512-bit FMA units in the server Skylakes)), but relatively
> few memops (2 512-bit wide loads, 1 512-bit wide store in server
> Skylakes). In order to feed the FMA units (with 2*3 operands), 4
> operands have to come from registers, and one of the two results has
> to stay in a register. That's possible for matrix multiplication, but
> you need to keep quite a bit of data in registers. IIRC, 16 registers
> is almost, but not quite enough. Consequently, AVX512 got 32 register
> names.
>

16 registers are sufficient for GEMM, even on HSWL (SKL is a bit easier), but you need cooperation from compiler.
The trick is to multiply 5 rows by 2 columns in the inner loop.
4 rows by 3 columns is theoretically a little better, but then you'll have to do it in asm, register allocation algorithms in compilers like gcc or clang or VisulalC are not good enough.

I spend a fair amount of time playing with these things couple of years ago.
As usual with my pet projects, completely lost interest thereafter, but results are still accessible on github.
Probably I should have completed it, because for medium-sized matrices my routines were significantly faster than OpenSSL, but when I am losing interest it's like a brick wall, I can't help myself.

Anne & Lynn Wheeler

unread,

Mar 12, 2019, 12:28:29 PM3/12/19

to

Quadibloc <jsa...@ecn.ab.ca> writes:
> Another thing is that the Burroughs machines of the 1960s could do a *lot* of
> things current hardware can't do. That's because they ran at a higher level
> than conventional computers. Data was tagged with its type, arrays had
> descriptors, all in the hardware.
>
> This had overhead. The microprocessors in the computer on your desktop are,
> architecturally, a lot like a 360/195. (Admittedly, the closest match is the
> Pentium II; today's machines have OoO integer pipelines too.)

370/195 trivia: early 70s, I got sucked into effort to hyperthread 195
(which never ships). ... this mentions multithreaded here
https://people.cs.clemson.edu/~mark/acs_end.html

195 didn't have branch prediction and speculative execution ... so
conditional branches drained pipeline ... and unless carefully tuned,
most codes ran at half 195 speed. doing two instruction streams
simulating two processors ... each running at half speed would
get full 195 throughput.

--
virtualization experience starting Jan1968, online at home since Mar1970

Quadibloc

unread,

Mar 12, 2019, 12:59:24 PM3/12/19

to

On Tuesday, March 12, 2019 at 2:28:45 AM UTC-6, BGB wrote:
> On 3/12/2019 1:38 AM, Bruce Hoult wrote:

> > Ohhh .. I think the opposite. You can almost always get away ok with 16 GPRs, it's FP where you quickly need 32 FP registers (or vector registers) once you start to unroll things to get the FLOPs up.

> This could depend a bit on workload.

> Most of the code I am dealing with is primarily integer code, with some
> amount of fixed-point and floating-point thrown in. Most of the "in the
> wild" code I have come across, has also been primarily integer code.

I think the assumption is that while lots of programs don't use floating-
point operations much, or at all, the programs that are highly
computationally-intensive... games and scientific number-crunching... use
them a lot.

So the question isn't are there programs that use mostly integer
arithmetic... but are there programs that would bring an eight-core OoO
chip to its knees that use mostly integer arithmetic.

Other than your typical Microsoft operating system, that is.

John Savard

already...@yahoo.com

Not much to add, no vector registers in this case...

There was a GPR-SIMD mechanism, but this did SIMD using GPRs, and was
closer to MMX than it was to SSE.

already...@yahoo.com

unread,

Mar 12, 2019, 3:51:56 PM3/12/19

to

On Tuesday, March 12, 2019 at 7:52:10 PM UTC+2, Anton Ertl wrote:
> already...@yahoo.com writes:
> >On Tuesday, March 12, 2019 at 3:48:06 PM UTC+2, Anton Ertl wrote:
> >> already...@yahoo.com writes:
> >> >Except that on OoO machine with good support for immediate operands in the =
> >> >instruction set 16 GPRs are already in the area of diminishing return.
> >>
> >> In Gforth I see:
> >>
> >> sieve bubble matrix fib fft
> >> 0.240 0.272 0.140 0.356 0.108 1 reg stack cache 1800MHz Cortex-A72
> >> 0.204 0.232 0.104 0.212 0.104 3 reg stack cache 1800MHz Cortex-A72
> >> 0.448 0.528 0.320 0.580 0.288 1 reg stack cache 1400MHz Cortex-A53
> >> 0.376 0.424 0.256 0.564 0.296 3 reg stack cache 1400MHz Cortex-A53
> >>
> >> I.e., a speedup from 0.97-1.68 from using 3-register stack caching
> >> over one. The relevance to the question at hand: On none of the
> >> architectures with 16GPRs (neither AMD64 nor ARM), we managed to use
> >> more than one register for stack caching.
> >>
> >
> >I still don't understand what are you trying to say.
>
> At a speedup of up to 1.68, having more than 16 GPRs is quite helpful
> for Gforth performance, not just "diminishing returns".
>

I was talking about compiled programs.
But even for interpreter, like yours, I would think that if multiple-register cache stack is so beneficiary for performance then, may be, your register allocation is sub-optimal? May be, you better have 2 or 3 registers dedicated to register stack cache even at cost spilling something else from register to memory? In my experience, distribution of the frequency of access for local integer variables is *always* uneven, so least used of 15 or 16 registers has to be accessed by something like 2-3% of instructions.

> >> Latency is irrelevant for logical register pressure in an OoO
> >> implementation; you just schedule the instructions next to each other,
> >> and OoO takes care of the latency.
> >
> >For very common patterns of long multiply-accumulate chains you can not.
>
> Why not?

Because dependency chains through accumulators are true (a.k.a. RaW) dependencies. And chains are much longer than reordering window or the size of rename registers pool.

>
> >On something like HSWL, if you want max. throughput in GEMM/long-convolution, you need 10 sw-visible registers just for accumulators.
>
> Sure, as I wrote below, register names are useful for matrix
> multiplication. But that has nothing to do with
>
> >16 registers are sufficient for GEMM, even on HSWL (SKL is a bit easier)
>
> That is Haswell, and Skylake, right?
>
> >The trick is to multiply 5 rows by 2 columns in the inner loop.
>
> Ok, then maybe I miscalculated when I did the work. I thought you
> needed at least 4x4 to get 2 FMAs/cycle rather than being
> load/store-limited, and that would be just beyond 16 registers.
>

Loads: 5+2=7.
FMA: 5*2=10.
7 < 10

4x4 is only necessary when you have 2 FMAs and 1 load per cycle.

However the main reason to want 10 accumulators is what I mentioned above - latency*throughput product of FMA.
Otherwise 4x2 (8 loads per 8 FMAs) would have been doing fine.

MitchAlsup

unread,

Mar 12, 2019, 6:05:37 PM3/12/19

to

Yes, but this is something the ARCHITECTURE can greatly ameliorate, at
vanishingly small harm to the microarchitecture:

Take the Inner Product calculation::

sum = a[0]*b[0]
for( i=1, i < MAX; i++ )
sum += a[i]*b[i]

If you have an instruction that performs multiply and accumulate but
does not deliver the result, you code this as::

sum = a[0]*b[0]
for( i=1, i < MAX-1; i++ )
sum + a[i]*b[i]; // notice no "=" in this line
sum += a[i]*b[i] // final multiply and deliver result

By leaving the sum in the FMAC unit each new product is accumulated
significantly wider than any result that can be delivered (168-bits
wide in typical FAMC units.)

You save the register in the encoding space (but eat an OpCode encoding)
But in almost any implementation, the 3-operand subspace are more
crowded than the 2-operand subspace.

You save in register file power and porting, forwarding, ...

And finally, instead of having to wait (4 cycles typically) for the FAMC to
deliver its result, you can lob a new accumulation into the FAMC unit every
cycle (quadrupling perf.)

Can SW get around this, probably, but only by eating lots of registers,
and losing accuracy. It is never clear that:

sum0 = 0;
sum1 = 0;
sum2 = 0;
sum3 = 0;
for( i=0, i < MAX; i+=4 )
{
sum0 += a[i+0]*b[i+0]
sum1 += a[i+1]*b[i+1]
sum2 += a[i+2]*b[i+2]
sum3 += a[i+3]*b[i+3]
}
sum = sum0+sum1+sum2+sum3

Has any accuracy advantage, and often does worse if the data has modulo-4
characteristics. Here we are consuming 12-FP registers whereas before we
were only consuming 3; and potentially losing perf and accuracy at the same
time.

Sometimes it does not even have a perf advantage, especially if memory is
not "doing well".

This is not free, but something ISA architects should think their own way
through and come to a better conclusion that those in the past have done.

BGB

unread,

Mar 12, 2019, 9:35:05 PM3/12/19

to

Possible, but this could introduce some other issues/complexities vs
staying with the use of 3 read ports.

> But, I digress.....
>
> Bit Field Insert requires 3 operands,
> Integer Multiply and accumulate require 3 operands,
> and my Multiplex instruction require 3 operands;
> Along with floating point FMAC.
>
> Now, one can do integer IMUL, UMUL, IDIV, UDIV over in the FMAC unit,
> and one can do floating point compare (along with FMAX, FMIN, FABS) over
> in the Integer Compare unit (along with IMAX, UMAX, IMIN, UMIN)
>

OK.

I don't have dedicated instructions for a lot of these cases.

> Once you have significant portions of the ISA crossing from FPU-land
> into INT land and vice versa the separation of register files makes
> less and less sense MICROarchitecturally.
>
> At the Architecture level, one does not need to replicate memory reference
> instructions for FP, or find silly ways to chop significant bits off the
> end of FP values for various useful purposed. Try doing:
>
> AND R9,R8,~(-1<<27)
> FADD R10,R8,R9
>
> R9 has the top 26-bits of significance, while R10 has the bottom 27-bits of
> significance (used in exact FP arithmetics). Try doing this kind of thing
> with separate register files.
>

While I could merge FPRs and GPRs, as-is this would still be a pretty
drastic design change.

There are ops to MOV values between GPRs and FPRs when needed, and these
are generally sufficient.

The way I had handled FMOV.x before was that some inputs were mirrored
between the FPU and EX unit, and these operations involve both units
operating in parallel. The EX unit behaves mostly as it would for a
normal DWORD or QWORD operation.

> So given all of this, one has 1 register file, and one configures a low
> end 1-wide machine with 3 read ports and 1 write port. The vast majority
> of instructions use 1 or 2 read ports, so there is a very high probability
> that the store pipeline (above) will get its read satisfied quickly. But
> of course, any time there is a store being issued the store pipeline will
> necessarily make forward progress.
>
> At this point one needs to recognize 2 (or 3) things. Branches do not write
> the register file, stores do not write the register file so these can be
> CoIssued most of the time. There are certain idioms that can be made to
> elide a register write when Inst[k+].rd = Inst[k].rd. and many times these
> can also be CoIssued.
>
> These 3 things go a long way to 2-wide SuperScalar perf at minimal additional
> costs over a simple in-order 1-wide machine. About the only thing one needs
> is a small Instruction buffer so that the CoIssue instructions arrive early
> enough to be detected and CoIssued.

My idea for WEX support had involved a 4R+2W design, mostly just with
the restriction that Store operations could not execute in parallel
(they would occupy both lanes).

Also Load/Store would only exist in Lane 1, with Lane 2 limited mostly
to ALU ops.

So, we have two lanes, with ops:
OP1 Rm1, Ro1, Rn1 | OP2 Rm2, Ro2, Rn2
And GPR ports:
Rs=Rm1, Rt=Ro1, Ru=Rm2, Rv=Ro2

But a store op:
MOV.x Rm, (Rn, Ri)
Would use the ports:
Rs=Rn, Rt=Ri, Ru=ZZR, Rv=Rm

One still-unresolved issue is how to cheaply implement support for two
write ports.

It is possible to do all of the registers individually, and do updates like:
regValR3 <= (regIdRp2==REG_R3)?regValRp2:
(regIdRq2==REG_R3)?regValRq2:regValR3;

But, this is fairly expensive.

A lower-cost option for the 1W case is to use an array:
regValArr[regIdRp2] <= regValRp2;

But:
regValArr[regIdRp2] <= regValRp2;
regValArr[regIdRq2] <= regValRq2;

Doesn't work out nearly so well...

Where:
Rp1: Output from Lane 1, EX1
Rp2: Output from Lane 1, EX2
Rq1: Output from Lane 2, EX1
Rq2: Output from Lane 2, EX2

The output from EX2 being what is written back to the register file.

>>
>> Scaled-index load is less of an issue, as it could be made to work with
>> 2 ports, but loads scaled-index load operations without the
>> corresponding store operations would break symmetry.
>>
>>
>> Doing it with 2 ports, with minimal ISA level changes (eg, still having
>> LEA), would likely require transforming it into:
>> LEA.x (Rn, Ri), R0 //LEA only needs 2 ports
>> MOV.x Rm, (R0) //Likewise, can use 2 ports
>>
>> Internally, most other memory ops also use 3 registers, as the address
>
> Note: I make it the habit of only calling software visible store registers.
> So the above sentence should read something like:
>
>> Internally, most other memory ops also use 3 operands, as the address
>

I don't quite understand this distinction...

> But I would also point out that one of those operands is invariably a constant
> (Displacement in particular) and is in no way associated with a register file.
>

In my case:
MOV.x Rm, (Rn, Ri)
and:
MOV.x Rm, (Rn, Disp)

Are different instruction forms, with Ri basically doing an x86-style
scaled-index mode, but with the scale generally hard-wired to the
element size (unlike x86 or ARM, which have a flexible scale), except
when the base register is PC or GBR (which are unscaled).

In my case I sort of ended up routing immediate values and displacements
through the register interface, so they still used a register port.

Generally, the special (internal) IMM register is only meaningful for
the 'Rt' port, with the other ports treating it as ZZR (Zero).

In a 4R+2W design, 'Rt' would return the immediate from Lane 1, and 'Rv'
from Lane 2.

Similarly, there were PC and GBR, but these would only be valid for
'Rs'. PC would give the address of the following instruction.

So: PC, GBR, IMM, and ZZR exist as special "internal" register numbers.

This was mostly so at the 'EX' stage, operations would not need to care
whether the inputs came from registers or immediate values or similar.

>> calculation currently does pretty much all memory accesses as
>> (Rm+Ri*Sc), with the 3rd port being used for the value to be stored.
>> Displacement cases currently handled by an 'IMM' register.
>>
>>
>> Some info I have found is conflicting as to whether ARM is 2R+1W or 3R+1W.
>>
>> I would presume instructions like:
>> STR R4, [R1, R2, LSL #2]
>> To also require 3 read ports.
>
> But one can steel (?borrow?) the data read port later in the pipeline.

OK, dunno then.

Anton Ertl

unread,

Mar 13, 2019, 6:41:31 AM3/13/19

to

already...@yahoo.com writes:
>On Tuesday, March 12, 2019 at 7:52:10 PM UTC+2, Anton Ertl wrote:
>> already...@yahoo.com writes:
>> >On Tuesday, March 12, 2019 at 3:48:06 PM UTC+2, Anton Ertl wrote:
>> >> already...@yahoo.com writes:

>> >> >Except that on OoO machine with good support for immediate operands i=
>n the =3D
>> >> >instruction set 16 GPRs are already in the area of diminishing return=
>.
>> >>=20
>> >> In Gforth I see:
>> >>=20

>> >> sieve bubble matrix fib fft
>> >> 0.240 0.272 0.140 0.356 0.108 1 reg stack cache 1800MHz Cortex-A72
>> >> 0.204 0.232 0.104 0.212 0.104 3 reg stack cache 1800MHz Cortex-A72
>> >> 0.448 0.528 0.320 0.580 0.288 1 reg stack cache 1400MHz Cortex-A53
>> >> 0.376 0.424 0.256 0.564 0.296 3 reg stack cache 1400MHz Cortex-A53

>> >>=20

>> >> I.e., a speedup from 0.97-1.68 from using 3-register stack caching
>> >> over one. The relevance to the question at hand: On none of the
>> >> architectures with 16GPRs (neither AMD64 nor ARM), we managed to use
>> >> more than one register for stack caching.

>> >>=20

>> >
>> >I still don't understand what are you trying to say.

>>=20

>> At a speedup of up to 1.68, having more than 16 GPRs is quite helpful
>> for Gforth performance, not just "diminishing returns".

>>=20

>
>I was talking about compiled programs.

Gforth is a compiled program. The benchmarks run on top of it, but
they don't see the machine code registers.

>But even for interpreter, like yours, I would think that if multiple-regist=
>er cache stack is so beneficiary for performance then, may be, your registe=
>r allocation is sub-optimal? May be, you better have 2 or 3 registers dedic=
>ated to register stack cache even at cost spilling something else from regi=
>ster to memory?

Of course the register allocation is suboptimal: Gforth is a compiled
program. We help the compiler with explicit register variables, but
the compiler limits what is possible (IIRC we can use only
callee-saved registers for explicit register variables that survive
calls). We have played around quite a bit with different explicit
register allocations, and while (looking at
<http://git.savannah.gnu.org/cgit/gforth.git/tree/arch/amd64/machine.h>
I can see a change that would benefit these benchmarks, it would hurt
programs that use locals.

>In my experience, distribution of the frequency of access f=
>or local integer variables is *always* uneven, so least used of 15 or 16 re=

>gisters has to be accessed by something like 2-3% of instructions.

That's possible, but if you replace these register accesses by memory
accesses, a single-cycle add becomes a 6-cycle load-add-store
sequence, and if these 2% instructions are bunched together (or the
alias predictor thinks there is an alias), doing the memory accesses
may cause a several-times slowdown for the pieces of code containing
these instructions.

Looking at the results above, it is interesting to see that, even
though both were running the same code, fib hardly benefited from the
3-register stack cache on the in-order Cortex-A53, but benefitted
greatly on the OoO Cortex-A72.

>> >> Latency is irrelevant for logical register pressure in an OoO
>> >> implementation; you just schedule the instructions next to each other,
>> >> and OoO takes care of the latency.
>> >
>> >For very common patterns of long multiply-accumulate chains you can not.

>>=20
>> Why not?
>
>Because dependency chains through accumulators are true (a.k.a. RaW) depend=
>encies. And chains are much longer than reordering window or the size of re=
>name registers pool.

Yes, for chains that exceed the reordering capacity of the OoO engine,
the latency plays a role. But 1) I don't think that such chains are
"very common", and 2) when they occur, there is no automatic way to
alleviate the problem by using more registers.

I guess, though that what you have in mind is to turn such a long
chain into several parallel long chains by applying the associative
law (which, however, works only approximately for FP operations), and
there the number of chains you use depends on the latency; and n
chains also require n accumulators.

>4x4 is only necessary when you have 2 FMAs and 1 load per cycle.

Yes, that probably was what I was thinking about. I thought that
server Skylake (like client Skylake) has 256-bit wide load-store
units, but according to
<https://en.wikichip.org/wiki/intel/microarchitectures/skylake_%28server%29>,
it is 512 bits wide.

>However the main reason to want 10 accumulators is what I mentioned above -=
> latency*throughput product of FMA.=20

>Otherwise 4x2 (8 loads per 8 FMAs) would have been doing fine.

Can you explain that? In my experiments, I processed the blocks to
avoid long dependency chains. What did you do?

already...@yahoo.com

unread,

Mar 13, 2019, 8:05:08 AM3/13/19

to

For sort of numeric problems that I tackle most, such chains are very common.
Typically, algorithms can be changed in a way that long chains goes away, but only at cost of significant increase in # of memory operations, esp. stores.
The end results of such transformations tend to be worse, speed-wise, then originals.

> and 2) when they occur, there is no automatic way to
> alleviate the problem by using more registers.
>

Automatic in a sense "done by compiler" ?
I don't know. I never really tried. I always do it manually.

Theoretically, "automatic by compiler" should be impossible for standard C, for reasons you mentioned below, but possible for Fortran and for non-standard C, as gcc -fast-math, which essentially changes C semantics into Fortran semantics.

> I guess, though that what you have in mind is to turn such a long
> chain into several parallel long chains by applying the associative
> law (which, however, works only approximately for FP operations), and
> there the number of chains you use depends on the latency; and n
> chains also require n accumulators.

Yes, except it depends on latency*throughput product rather than on latency in isolation.

>
> >4x4 is only necessary when you have 2 FMAs and 1 load per cycle.
>
> Yes, that probably was what I was thinking about. I thought that
> server Skylake (like client Skylake) has 256-bit wide load-store
> units, but according to
> <https://en.wikichip.org/wiki/intel/microarchitectures/skylake_%28server%29>,
> it is 512 bits wide.
>
> >However the main reason to want 10 accumulators is what I mentioned above -=
> > latency*throughput product of FMA.=20
> >Otherwise 4x2 (8 loads per 8 FMAs) would have been doing fine.
>
> Can you explain that? In my experiments, I processed the blocks to
> avoid long dependency chains. What did you do?

It was two years ago, so I don't remember an exact length of the chains in the inner loop. But they were rather long chains.
Let's look at repository... 200. Of course, the number only applies to big matrices. For smaller matrices the length of chain would be equal to the dimension K.
I do remember that [for big matrices] shorter chains gave consistently slower results. I don't remember, how much slower. I vaguely remember that on Haswell (E3-1271 v3) it made bigger difference than on Broadwell-with-Crystall Well (I don't remember for sure what it was. Most likely i7-5775R) or Kaby Lake (i3-7100U).

If you are interested, you can experiment with it yourself.
My github repository is public. It's called mkl_tst1_2017_04_06.
The parameter in question is K_STEP in file fma256_noncblas_sgemm_n5.c

MitchAlsup

unread,

Mar 13, 2019, 12:02:05 PM3/13/19

to

The word "register" is so overloaded, that one reading this word does not
know if it means: a) SW visible store, b) HW pipeline data staging store,
or c) a kind of interface that captures and delivers on a clock edge
(Registered SRAM). So, I only use the word register for a.

>
>
> > But I would also point out that one of those operands is invariably a constant
> > (Displacement in particular) and is in no way associated with a register file.
> >
>
> In my case:
> MOV.x Rm, (Rn, Ri)
> and:
> MOV.x Rm, (Rn, Disp)
>
> Are different instruction forms, with Ri basically doing an x86-style
> scaled-index mode, but with the scale generally hard-wired to the
> element size (unlike x86 or ARM, which have a flexible scale), except
> when the base register is PC or GBR (which are unscaled).

I can use:

LDD R7,[Rbase+Rindex<<size+DISP]

3 operands, 2 registers. 3-input adder with 4-position shifter.

>
>
> In my case I sort of ended up routing immediate values and displacements
> through the register interface, so they still used a register port.

I route mine through the forwarding logic.

>
> Generally, the special (internal) IMM register is only meaningful for
> the 'Rt' port, with the other ports treating it as ZZR (Zero).
>
> In a 4R+2W design, 'Rt' would return the immediate from Lane 1, and 'Rv'
> from Lane 2.

I can do 1.3 IPC <peak> with 3r1w

>
> Similarly, there were PC and GBR, but these would only be valid for
> 'Rs'. PC would give the address of the following instruction.
>
> So: PC, GBR, IMM, and ZZR exist as special "internal" register numbers.
>
>
> This was mostly so at the 'EX' stage, operations would not need to care
> whether the inputs came from registers or immediate values or similar.

Precisely.

a...@littlepinkcloud.invalid

unread,

Mar 13, 2019, 12:29:28 PM3/13/19

to

already...@yahoo.com wrote:

> Except that on OoO machine with good support for immediate operands
> in the instruction set 16 GPRs are already in the area of
> diminishing return.

Here's my take on it: In a RISC (i.e. a strict load/store
architecture) 16 is just about enough as long as you don't have any
fixed registers, but in practice it's useful to have a few things such
as a frame pointer, the base of the heap, the current thread, and so
on. Once you take away those from a 16-entry register set, the
compiler's register allocator is starting to chafe and you end up with
a lot of spills. It's also nice to reserve a few callee-saved
registers so you don't have to spill everything around a call: arm64
has 8 of them.

So, 32 is registers probably slightly more than you need, but more
than 16 is good. x86_64 sort-of gets away with fewer registers by
having some instructions that take memory operands.

Andrew.

Anton Ertl

unread,

Mar 13, 2019, 1:32:39 PM3/13/19

to

a...@littlepinkcloud.invalid writes:
>x86_64 sort-of gets away with fewer registers by
>having some instructions that take memory operands.

That saves only one register.

Having a significant portion of in-order cores (as for Aarch64) costs
several, though.

Stephen Fuld

unread,

Mar 13, 2019, 1:40:49 PM3/13/19

to

On 3/12/2019 12:13 AM, Anton Ertl wrote:

snip

>
> Concerning old systems, the Linux distributors as well as Microsoft
> wish that they did not exist, and produce incentives for switching to
> new systems. Apparently in the Linux world these work so well that
> Linux distributors can get away with making it hard to build binaries
> that work on old systems.

Or the customers (primarily enterprises with mission critical systems)
who want long term support use distributions such as Red Hat Enterprise,
which has fewer updates and longer support cycles.

> For Windows, they don't work so well; people are still using WXP years
> after its support has ended; companies are paying extra to extend the
> support for W7.

Yup.

OK.

I was using it for:
one of the registers (R0..R31 or FR0..FR15 or similar);
An input or output to an instruction going to one of the above.

But, yeah, I guess it gets ambiguous with all the stuff that can be
called a "register" in the context of Verilog code or similar.

>>
>>
>>> But I would also point out that one of those operands is invariably a constant
>>> (Displacement in particular) and is in no way associated with a register file.
>>>
>>
>> In my case:
>> MOV.x Rm, (Rn, Ri)
>> and:
>> MOV.x Rm, (Rn, Disp)
>>
>> Are different instruction forms, with Ri basically doing an x86-style
>> scaled-index mode, but with the scale generally hard-wired to the
>> element size (unlike x86 or ARM, which have a flexible scale), except
>> when the base register is PC or GBR (which are unscaled).
>
> I can use:
>
> LDD R7,[Rbase+Rindex<<size+DISP]
>
> 3 operands, 2 registers. 3-input adder with 4-position shifter.

FWIW: my former BJX1 ISA had an addressing mode more like this early on,
but it was later dropped due to being rarely used and fairly expensive
to support.

This could be handled instead via something like:
LEA.x (Rm, Ri), R2
MOV.x (R2, Disp), R7
Or, alternatively:
LEA.x (Rm, Disp), R2
MOV.x (R2, Ri), R7

In SH, one would have done something more like:
MOV Ri, R0
ADD #disp, R0
SHLL2 R0
MOV.x @(Rm, R0), R7

>>
>>
>> In my case I sort of ended up routing immediate values and displacements
>> through the register interface, so they still used a register port.
>
> I route mine through the forwarding logic.

OK.

>>
>> Generally, the special (internal) IMM register is only meaningful for
>> the 'Rt' port, with the other ports treating it as ZZR (Zero).
>>
>> In a 4R+2W design, 'Rt' would return the immediate from Lane 1, and 'Rv'
>> from Lane 2.
>
> I can do 1.3 IPC <peak> with 3r1w

Hmm...

In my case 3R+1W can give ~ 0.6 .. 0.7 IPC ( a lot of cycles being spent
on cache-misses and branch latency ).

I expect a 4R+2W design with WEX (or superscalar) could maybe get ~ 1.2
IPC or so (estimate). Assuming I can make it work within the resource
budget.

In the former case, 100MHz gives usable speeds for Doom (~ 20-30 fps),
and Quake speeds which are "borderline playable" (pretty laggy but not
quite a slide-show).

1.2 IPC would give performance in Quake which is still pretty laggy, but
at least with fps values mostly staying in double-digits territory.

But, from what I can gather, this should generally be sufficient for
typical real-time motor-control tasks or similar.

Unlike Doom or Quake continually redrawing the screen, the CNC
controller mostly uses it like a character-cell display, and might
occasionally draw wireframe graphics (Bresenham's Algorithm), ...

Other than UI, it also manages things like telling the motors how to
move, polls input from limit switches and an optical encoder, interprets
G-Code programs, ...

But, Doom and Quake are useful sort of like the "canary in the mine", in
that they barf and die if something goes amiss (their failure modes are
more visible and immediate).

My general survey of this area mostly turns up the use of either aging
processor designs (eg: M68EC020) or 32-bit microcontrollers (such as ARM
Cortex-M3 and similar), along with SH and MIPS and ARM9 and similar...

A few products apparently manage to make effective use of an AVR8 or
MSP430 for their CNC controllers (... somehow ...).

Granted... An Arty dev-board costs ~ 20x as much as an ATmega128 and ~
4x as much as a RasPi B+, but in the face of building a machine which
otherwise costs ~ $3k in parts, this part isn't a huge issue.

The FPGA's by themselves are a little cheaper (~ $32 each for an XC7S25,
*1), but sadly are only really available as BGA devices (would need to
have PCBs made and get a reflow oven and similar to be able to use these).

Though, an SDcard interface and similar would also be nice (for storing
G-code programs), ...

*1:
https://www.digikey.com/product-detail/en/xilinx-inc/XC7S25-1CSGA324C/122-2118-ND/8040783

And, as can be noted, raw computational performance isn't really the
limiting factor for CNC (in terms of computational performance, the
RasPi is the heavyweight of this group...).

For a lot of people building home CNC's, late 90's or early 2K's PC's
were a popular option (mostly as they could use the parallel port as a
poor man's GPIO). Limiting factor here is mostly needing something with
a parallel port.

Apparently though, the reliability of going the "re-purposed old desktop
PC" route isn't particularly great though. Apparently even running an
RTOS on the PC, it is hard to get the timing all that reliable (it is
apparently a pretty mixed bag here).

Or such...

a...@littlepinkcloud.invalid

unread,

Mar 14, 2019, 1:45:09 PM3/14/19

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> a...@littlepinkcloud.invalid writes:
>>x86_64 sort-of gets away with fewer registers by having some
>>instructions that take memory operands.
>
> That saves only one register.
>
> Having a significant portion of in-order cores (as for Aarch64) costs
> several, though.

I don't understand this comment.

Andrew.

MitchAlsup

unread,

Mar 14, 2019, 2:29:42 PM3/14/19

to

Which when you have a bigger machine, you will recognize as an idiom and
smash it back into a single calculation. Why not start with it properly
expressed?

>
> In SH, one would have done something more like:
> MOV Ri, R0
> ADD #disp, R0
> SHLL2 R0
> MOV.x @(Rm, R0), R7
>

My operand data path is built around 3 busses in order to support FMAC (and
IMAC, Bit-field INSert, and a few others). So the register file and forwarding
ports are already present as are the busses to deliver operand to the AGEN
unit. The AGEN unit is actually shared with the Integer Adder but takes
3-inputs.

The the expense of my 66100 having is minus one 64-bit adder.

Since the 3rd port is not used "all that often" I allow the Store instructions
to read data from the reg file through this port.

>
> >>
> >>
> >> In my case I sort of ended up routing immediate values and displacements
> >> through the register interface, so they still used a register port.
> >
> > I route mine through the forwarding logic.
>
> OK.
>
>
> >>
> >> Generally, the special (internal) IMM register is only meaningful for
> >> the 'Rt' port, with the other ports treating it as ZZR (Zero).
> >>
> >> In a 4R+2W design, 'Rt' would return the immediate from Lane 1, and 'Rv'
> >> from Lane 2.
> >
> > I can do 1.3 IPC <peak> with 3r1w
>
> Hmm...
>
> In my case 3R+1W can give ~ 0.6 .. 0.7 IPC ( a lot of cycles being spent
> on cache-misses and branch latency ).

Let me clarify:: My 66130 has a peak CoIssue rate of 1.3 IPC.

With 64K I cache and 64K D cache a 6 cycle L2 and 100 cycle DRAM and
CoIssue of STs and Branches I get 1.1 CPI at the end of the pipeline.
Whereas without CoIssue I get 1.37 CPI.
Converting into IPC

CoIssue: 0.91
StdRISC: 0.73

And this is done with a 3r1w register file.

Stefan Monnier

unread,

Mar 14, 2019, 4:20:19 PM3/14/19

to

> Concerning old systems, the Linux distributors as well as Microsoft
> wish that they did not exist, and produce incentives for switching to
> new systems.

If by "old systems" you mean "old operating systems", then I agree.
But w.r.t old hardware, GNU/Linux distributions like Debian are pretty
good at making efforts to keep supporting them.

Stefan

BGB

unread,

Mar 15, 2019, 1:30:20 AM3/15/19

to

If I have an instruction for something, and it is actually used (thus
needing to be supported), there is an issue of how to execute it
effectively.

Trying to do two adds in a single clock-cycle poses challenges for
timing. Most other options would add complexity and require it to be a
multi-cycle operation.

In terms of encoding, it would probably need to be a 48-bit op, eg:
FC0e_XXnm_osdd MOV.x (Rm, Ro, Sc, Disp9)

I did recently go and make a few big "breaking changes" to the ISA:
Switched the F0 block from the F0eo_XXnm layout to F0nm_XeoX;
This makes the F0 block consistent with the F1 and F2 blocks.
F1: Fznm_Xedd
F2: Fznm_Xejj
Swapped Dzzz and Ezzz blocks, so now Ezzz is the predicated ops.
Since I was "already breaking stuff anyways".
Mostly aesthetic, but Ezzz and Fzzz are more obviously adjacent.

This did require pretty much recompiling everything, but oh well.
The advantage is that it allows a fair bit of simplification to the
Verilog code for the 32-bit instruction decoder.

A lot of logic for handling two semi-redundant sets of registers could
be collapsed down, and a lot of semi-redundant logic and special cases
could go away (the module in question went from ~2300 lines to ~1500
lines; hopefully it also got cheaper...).

The FC and FD blocks have also been split off into their own decoder module.

So:
DecOp: Multiplexes sub-decoder output, handles predication.
DecOpBz: Handles 16-bit ops.
DecOpFz: Handles 32-bit ops.
DecOpFC: Handles 48-bit ops.

>>
>> In SH, one would have done something more like:
>> MOV Ri, R0
>> ADD #disp, R0
>> SHLL2 R0
>> MOV.x @(Rm, R0), R7
>>
>

FWIW: I suspect this sort of thing is part of the reason for the
comparably lackluster performance of SH2 and SH4.

SH2A added some 32-bit instruction encodings which were pretty helpful
here (such as loads/stores with 12-bit displacements), and the ability
to load slightly larger 20-bit constant values without needing load them
from memory.

> My operand data path is built around 3 busses in order to support FMAC (and
> IMAC, Bit-field INSert, and a few others). So the register file and forwarding
> ports are already present as are the busses to deliver operand to the AGEN
> unit. The AGEN unit is actually shared with the Integer Adder but takes
> 3-inputs.
>
> The the expense of my 66100 having is minus one 64-bit adder.
>
> Since the 3rd port is not used "all that often" I allow the Store instructions
> to read data from the reg file through this port.

Mine has 3 ports mostly for sake of memory store.

The AGU uses the same inputs as the ALU, but differs in terms of behavior.

Most of the units operate in parallel, just the EX stage logic takes the
output from the unit it cares about for a given operation.

>>
>>>>
>>>>
>>>> In my case I sort of ended up routing immediate values and displacements
>>>> through the register interface, so they still used a register port.
>>>
>>> I route mine through the forwarding logic.
>>
>> OK.
>>
>>
>>>>
>>>> Generally, the special (internal) IMM register is only meaningful for
>>>> the 'Rt' port, with the other ports treating it as ZZR (Zero).
>>>>
>>>> In a 4R+2W design, 'Rt' would return the immediate from Lane 1, and 'Rv'
>>>> from Lane 2.
>>>
>>> I can do 1.3 IPC <peak> with 3r1w
>>
>> Hmm...
>>
>> In my case 3R+1W can give ~ 0.6 .. 0.7 IPC ( a lot of cycles being spent
>> on cache-misses and branch latency ).
>
> Let me clarify:: My 66130 has a peak CoIssue rate of 1.3 IPC.
>
> With 64K I cache and 64K D cache a 6 cycle L2 and 100 cycle DRAM and
> CoIssue of STs and Branches I get 1.1 CPI at the end of the pipeline.
> Whereas without CoIssue I get 1.37 CPI.
> Converting into IPC
>
> CoIssue: 0.91
> StdRISC: 0.73
>
> And this is done with a 3r1w register file.

OK.

If scalar, mine wont break 1 IPC.

The WEX feature was hoped to help, but isn't exactly a low-cost feature.
If I can get costs down, maybe it can be made reasonable on an XC7S50
(what is on the board I currently have), but is probably asking a bit
much for an XC7S25 or similar.

My general cache layout (thus far):
1K L1 I$
1K L1 D$
64K L2

On an XC7S50, I can afford 128K of L2, but the XC7S25 doesn't really
have enough BRAM to for this to be reasonable.

It is also possible the L1's could be made a little bigger, such as 4K
or 8K, to reduce the miss rate. Or two big L1's with no L2, ...

Say: 64+8+8 = 80K of CPU cache (still leaving an decent amount for
"everything else").

An XC7S15 is a little cheaper, but I suspect this is a bit too small to
fit a 64-bit BJX2 core onto; I would likely need to use a 32-bit ISA on
this (probably with no FPU or similar). It would also require a
considerably smaller cache, ...

However, there are dev-boards for this chip which are closer to being
price-competitive with a RasPi (or with a TI MSP432); unlike the Arty
boards, which cost significantly more than a RasPi.

I guess it could make sense to evaluate other options as well, like the
Cyclone IV / DE0-Nano or similar (cost is in between that of XC7S15 and
XC7S25 based boards). Not entirely sure how they compare capability wise.

In my new "core redo", I am moving the L2 Cache outside of the main CPU
module, but the L1's will remain inside. The cache system is also
getting a pretty major redesign.

I am still on the fence as to whether it is better to send both L2
requests and MMIO over a single bus interface, or to use two separate
bus interfaces. The bus to the L2 would be half-duplex and move data in
128-bit chunks (each representing a pair of 64-bit QWORDs).

...

Anton Ertl

unread,

Mar 15, 2019, 8:48:47 AM3/15/19

to

Explicit instruction-level parallelism (instruction scheduling,
software pipelining) usually increases register pressure. E.g., take
the stride-1 variant of the daxpy loop

for (i=0; i<n; i++)
dy[i] = dy[i] + da*dx[i];

On an OoO implementation, the loop body can be compiled to

# da in f1, i in r1, n in r2
l1:
f2 = dx[r1]
f3 = dy[r1]
f3 = f3+f2*f1 # fma
dy[r1] = f3
r1 = r1+1
if (r1 < r2) goto l1

# plus loop overhead

And the OoO engine will parallelize this up to one iteration per
cycle, resources permitting.

On an in-order implementation, this code will see all the latencies of
the involved instructions and will be slow. A compiler can
software-pipeline it to:

#assumes 2-cycle load, 3-cycle FMA latency, and enough resources:
#ramp-up code left away, here's the steady-state
l1:
f2 =dx[r1]; f3 =dy[r1]; f11=f11+f10*f1; dy[r1-5]=f5 ; r3=r1+1; if r1>=r2 goto l2
f4 =dx[r3]; f5 =dy[r3]; f13=f13+f12*f1; dy[r3-5]=f7 ; r1=r3+1; if r3>=r2 goto l3
f6 =dx[r1]; f7 =dy[r1]; f3 =f3 +f2 *f1; dy[r1-5]=f9 ; r3=r1+1; if r1>=r2 goto l4
f8 =dx[r3]; f9 =dy[r3]; f5 =f5 +f4 *f1; dy[r3-5]=f11; r1=r3+1; if r3>=r2 goto l5
f10=dx[r1]; f11=dy[r1]; f7 =f7 +f6 *f1; dy[r1-5]=f13; r3=r1+1; if r1>=r2 goto l6
f12=dx[r3]; f13=dy[r3]; f9 =f9 +f8 *f1; dy[r3-5]=f3 ; r1=r3+1; if r3< r2 goto l1

So, instead of 3 FP registers, this needs 13 FP registers. And instead
of 2 integer registers, this needs 3 registers (plus maybe another one
for the address of dy[-5]).

a...@littlepinkcloud.invalid

unread,

Mar 15, 2019, 11:13:18 AM3/15/19

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> a...@littlepinkcloud.invalid writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> a...@littlepinkcloud.invalid writes:
>>>>x86_64 sort-of gets away with fewer registers by having some
>>>>instructions that take memory operands.
>>>
>>> That saves only one register.
>>>
>>> Having a significant portion of in-order cores (as for Aarch64) costs
>>> several, though.
>>
>>I don't understand this comment.
>
> Explicit instruction-level parallelism (instruction scheduling,
> software pipelining) usually increases register pressure. E.g., take
> the stride-1 variant of the daxpy loop
>

> ...

>
> So, instead of 3 FP registers, this needs 13 FP registers. And instead
> of 2 integer registers, this needs 3 registers (plus maybe another one
> for the address of dy[-5]).

I know what Ooo is, and why it helps. But the sentence

>>> Having a significant portion of in-order cores (as for Aarch64)
>>> costs several, though.

doesn't make any sense to me. I haven't used any AArch64 systems that
have a significant portion of in-order cores. I guess you're talking
about BIG.little, but that is a mobile phone thing, nothing to do with
AArch64 as such. There are Intel BIG.little machines too.

Andrew.

already...@yahoo.com

unread,

Mar 15, 2019, 11:26:12 AM3/15/19

Yes. And I think that's also true for Windows, to some extent: If
they want people to upgrade to new OS versions, they need to support
old hardware. Otherwise people will stick with the old OS until they
upgrade the hardware.

a...@littlepinkcloud.invalid

unread,

Mar 15, 2019, 2:51:13 PM3/15/19

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> a...@littlepinkcloud.invalid writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>I guess you're talking about BIG.little, but that is a mobile phone
>>thing, nothing to do with AArch64 as such.
>
> Right, compiler instruction scheduling is not about architecture (on
> most recent architectures, at least), but about optimizing for a
> microarchitecture.

OK, I'll buy that. As an aside, though, the standard advice from Arm
architects to compiler writers is to schedule for the in-order cores
and let the out-of-order ones look after themselves.

> But there is an interaction between these two: An architecture with
> few registers will be less appropriate for in-order
> microarchitectures; conversely, if you plan to have significant
> in-order microarchitectures, you will design the architecture with
> more registers.

I see. So, unless I misunderstand, it's your contention that if Arm
had been designing for higher-end cores they might have decided to
have fewer registers. It would have been a tough sell.

It partly depends on how sophisticated a compiler you have. JIT
compilation is always a trade-off between the effort of compiling and
the expected gain.

> And, guess what, after 32-bit ARM with 16 registers, they designed
> Aarch64 with 32 registers. So whatever the reason, they obviously
> don't agree with the claim that 16 registers are enough.

Indeed not, but the original Arm was very much constrained by the
technology of the day. 16 registers seemed luxurious at the time; not
so much now.

Andrew.

MitchAlsup

unread,

Mar 15, 2019, 4:36:08 PM3/15/19

to

On Friday, March 15, 2019 at 1:51:13 PM UTC-5, a...@littlepinkcloud.invalid wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> > a...@littlepinkcloud.invalid writes:
> >>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >>I guess you're talking about BIG.little, but that is a mobile phone
> >>thing, nothing to do with AArch64 as such.
> >
> > Right, compiler instruction scheduling is not about architecture (on
> > most recent architectures, at least), but about optimizing for a
> > microarchitecture.
>
> OK, I'll buy that. As an aside, though, the standard advice from Arm
> architects to compiler writers is to schedule for the in-order cores
> and let the out-of-order ones look after themselves.

Pretty much good advice. The GBOoO machines have enough resources to
reschedule the instruction stream either statically in the thing that
smells like and ICache or dynamically in reservation stations, dispatch
stacks, and the like.

>
> > But there is an interaction between these two: An architecture with
> > few registers will be less appropriate for in-order
> > microarchitectures; conversely, if you plan to have significant
> > in-order microarchitectures, you will design the architecture with
> > more registers.
>
> I see. So, unless I misunderstand, it's your contention that if Arm
> had been designing for higher-end cores they might have decided to
> have fewer registers. It would have been a tough sell.
>
> It partly depends on how sophisticated a compiler you have. JIT
> compilation is always a trade-off between the effort of compiling and
> the expected gain.
>
> > And, guess what, after 32-bit ARM with 16 registers, they designed
> > Aarch64 with 32 registers. So whatever the reason, they obviously
> > don't agree with the claim that 16 registers are enough.
>
> Indeed not, but the original Arm was very much constrained by the
> technology of the day. 16 registers seemed luxurious at the time; not
> so much now.

I question this. ARM followed:: Berkeley SPARC, Stanford MIPS, Sun SPARC,
MIPS MIPS, Mc 88K, and probably HP Snakes: all having 32 registers, more
than one implemented in gate arrays. The only other RISC with 16 registers
was Clipper.

>
> Andrew.

BGB

unread,

Mar 15, 2019, 4:56:38 PM3/15/19

to

On 3/15/2019 1:27 PM, Anton Ertl wrote:
> Stefan Monnier <mon...@iro.umontreal.ca> writes:
>>> Concerning old systems, the Linux distributors as well as Microsoft
>>> wish that they did not exist, and produce incentives for switching to
>>> new systems.
>>
>> If by "old systems" you mean "old operating systems", then I agree.
>
> Yes.
>
>> But w.r.t old hardware, GNU/Linux distributions like Debian are pretty
>> good at making efforts to keep supporting them.
>
> Yes. And I think that's also true for Windows, to some extent: If
> they want people to upgrade to new OS versions, they need to support
> old hardware. Otherwise people will stick with the old OS until they
> upgrade the hardware.
>

In recent years things have been stretched out a bit as well, as often
the "new" hardware isn't that much different than the "old" hardware,
and people are occasionally still selling "new" PCs with hardware that
looks like something from a decade ago (Eg: "Core 2 Duo with 4GB of RAM
and a 1TB HDD" and similar), it is just "new" because they stuck Windows
10 on it...

It is possible OSes dropping support for older hardware would also face
a backlash from OEMs (and/or cause the lower end of the desktop PC
market to fall off).

But, then again I remember having difficulty at one point (some-odd
years ago) trying to get Fedora 17 to work on an older laptop (with a
1.3 GHz Celeron M), as apparently it refused to work on something with
256MB of RAM. Was able to get more RAM for it (was able to replace the
1x 256MB module with 2x 512MB; but also discovered it wouldn't boot with
1GB modules).

Luckily, 1GB of RAM was sufficient to make the Fedora installer happy...

paul wallich

unread,

Mar 15, 2019, 5:20:31 PM3/15/19

to

On 3/15/19 4:36 PM, MitchAlsup wrote:
> On Friday, March 15, 2019 at 1:51:13 PM UTC-5, a...@littlepinkcloud.invalid wrote:
>> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

[...]

>>> And, guess what, after 32-bit ARM with 16 registers, they designed
>>> Aarch64 with 32 registers. So whatever the reason, they obviously
>>> don't agree with the claim that 16 registers are enough.
>>
>> Indeed not, but the original Arm was very much constrained by the
>> technology of the day. 16 registers seemed luxurious at the time; not
>> so much now.
>
> I question this. ARM followed:: Berkeley SPARC, Stanford MIPS, Sun SPARC,
> MIPS MIPS, Mc 88K, and probably HP Snakes: all having 32 registers, more
> than one implemented in gate arrays. The only other RISC with 16 registers
> was Clipper.

I think that there's a distinction here because of the market segments.
the original Acorn RISC Machine was intended for PC use, originally home
computers and reportedly had all of 25K transistors. All of the other
CPUs you're citing were headed for something more like
workstation/server use, and had transistor counts several times that and
up.

So not exactly luxurious compared to "real" CPUs, but luxurious when the
comparison was then-prevalent 80XX6/Z8* or jumped-up 6502's or 6800X.

paul

MitchAlsup

unread,

Mar 15, 2019, 7:58:12 PM3/15/19

to

On Friday, March 15, 2019 at 4:20:31 PM UTC-5, paul wallich wrote:
> On 3/15/19 4:36 PM, MitchAlsup wrote:
> > On Friday, March 15, 2019 at 1:51:13 PM UTC-5, a...@littlepinkcloud.invalid wrote:
> >> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> [...]
> >>> And, guess what, after 32-bit ARM with 16 registers, they designed
> >>> Aarch64 with 32 registers. So whatever the reason, they obviously
> >>> don't agree with the claim that 16 registers are enough.
> >>
> >> Indeed not, but the original Arm was very much constrained by the
> >> technology of the day. 16 registers seemed luxurious at the time; not
> >> so much now.
> >
> > I question this. ARM followed:: Berkeley SPARC, Stanford MIPS, Sun SPARC,
> > MIPS MIPS, Mc 88K, and probably HP Snakes: all having 32 registers, more
> > than one implemented in gate arrays. The only other RISC with 16 registers
> > was Clipper.
>
> I think that there's a distinction here because of the market segments.
> the original Acorn RISC Machine was intended for PC use, originally home
> computers and reportedly had all of 25K transistors.

In 1982 the 68020 was announced and it contained 190,000 transistors.
This was before any of the RISC machines got outside of academia.
Think Mac II.

But Wikipedia says: "The first samples of ARM silicon worked properly when
first received and tested on 26 April 1985."

So this puts ARM well past the point in time where most RISCs had 32
registers. So, you can't argue that technology held them back from 32
registers, you can only argue that they chose 16 registers by some
kind of analysis.

Later when migrating to 64-bits the chose a different path forward.

> All of the other
> CPUs you're citing were headed for something more like
> workstation/server use, and had transistor counts several times that and
> up.

The only thing of import that separated a workstation from a PC is
the software that is loaded on it, and possibly access to a network.

a...@littlepinkcloud.invalid

unread,

Mar 16, 2019, 7:35:05 AM3/16/19

to

Ah, maybe I should have said "seemd very luxurious to me!" It really
did, there's no question about that.

Those machines were for workstations, only bought by institutions or
the wealthy, but almost anyone could buy an Archimedes. The Arm was a
low-budget CPU done by a small company with very few resources, 25k
transistors. If you look at the die, the register array is a
substantial part of the area.

https://en.wikichip.org/wiki/acorn/microarchitectures/arm1

Andrew.

to

Stanford MIPS, Hennessy's project in 1981,
was 32-bit with 16 registers, no byte/half-word load & store,
with 1 branch delay slot (though they also evaluated 2 slots).

The commercial version MIPS R2000 in 1986 had 32 registers
and put in the byte/half-word signed & unsigned load & store
instructions having learned from porting difficulties
on Stanford MIPS.

Berkeley RISC (aka RISC-I), Patterson's project in 1981,
was 32-bit with 32 registers (with register window on r10:r31),
byte & half-word signed & unsigned load & store,
with 1 branch delay slot.

Anton Ertl

unread,

Mar 16, 2019, 12:57:30 PM3/16/19

to

Tom Gardner <spam...@blueyonder.co.uk> writes:
>Delayed branches are one way to avoid pipeline spills, but
>Wilson noted that many inner-loop branches were only a
>small distance. That meant the predicated instructions
>in conjunction with the barrel shifter were pretty effective
>at avoiding spills in many cases.

I assume you mean pipeline stalls. I don't understand what the role
of the barrel shifter is in this sentence.

Tom Gardner

unread,

Mar 16, 2019, 1:11:28 PM3/16/19

to

On 16/03/19 16:55, Anton Ertl wrote:
> Tom Gardner <spam...@blueyonder.co.uk> writes:
>> Delayed branches are one way to avoid pipeline spills, but
>> Wilson noted that many inner-loop branches were only a
>> small distance. That meant the predicated instructions
>> in conjunction with the barrel shifter were pretty effective
>> at avoiding spills in many cases.
>
> I assume you mean pipeline stalls.

Ahem. Yes.

> I don't understand what the role
> of the barrel shifter is in this sentence.

A reasonable question.

From 35yo (gulp) memory... Basically the designers were
taking knowledge of the coding practices they had found
in the 6502 machines. Predicated instructions helped,
barrel shifter helped, and the combination was often
even more effective. They published examples illustrating
that.

Now it is quite possible that their (ahem) "training
set" was limited and skewed. Remember this was being
done in the '83-'84 timeframe, and the RISC concepts
were nowhere as well documented nor understood as
they have become.

Anton Ertl

unread,

Mar 16, 2019, 1:16:20 PM3/16/19

to

paul wallich <p...@panix.com> writes:
>My impression is that the hardware market has stagnated for the most
>part -- machines from 3-5 years ago are "good enough" for most purposes,
>and the newer processors don't have features that make it easy for a new
>OS to be impossible to run on the old machines.

Sure, there should be no obstacle to running a current OS on an Athlon
64 from 2003, except maybe RAM (mine has 1GB).

>(In my experience as a purchaser/donee of superannuated boxes, the
>corporate upgrade cycle may now be partly driven by the lifespan of
>motherboard backup batteries. If you wait for failures before replacing
>them, you get not only the direct cost but also the risk of delaying
>some crucial project.)

My impression is that the CMOS batteries don't get discharged when the
machines are connected to the mains (as they usually are in corporate
environments). Maybe you get them with discharged batteries because
they waited for some time in some warehouse.

And when the boards are not connected, the discharge rate depends on
the individual board (probably on the leakage rate of the components
in the CMOS power circuit, which varies a lot). I have had one that
consumed a battery every 6 months on being off the mains, and another
that was in service for 15 years with a large portion of off-mains
time, and I don't remember having to replace the CMOS battery.

If CMOS battery discharge was an issue, the way to go would be to
replace them on a schedule. That's certainly less work for the IT
staff than to replace the box (including setup).

>> And maybe now that the performance increases have slowed down for
>> mobile phones, too, we will see longer cycles in that market, too.
>
>The cycles in that market are complicated by having multiple policy
>makers with different goals. OS updates may run on old hardware for
>quite a while, but if the carriers who effectively control (almost all)
>the phones refuse to push those updates, that's irrelevant. And the
>carriers currently want people to buy new hardware whenever the old
>hardware's contract runs out.

They want people to sign another two-year contract, and the carrot has
been to get a new mobile phone. Now that mobile phones don't get much
better with each generation anymore, I doubt that will work for much
longer. They may wish that the OS updates will be the new carrot, but
I have my doubts.

BGB

unread,

Mar 16, 2019, 1:17:21 PM3/16/19

to

On 3/16/2019 8:09 AM, Anton Ertl wrote:
> BGB <cr8...@hotmail.com> writes:
>> and people are occasionally still selling "new" PCs with hardware that
>> looks like something from a decade ago (Eg: "Core 2 Duo with 4GB of RAM
>> and a 1TB HDD" and similar), it is just "new" because they stuck Windows
>> 10 on it...
>
> Looking for it, I indeed found one system with that old components,
> but it's only from one dealer, and I doubt that such products are
> really relevant in the market.
>

I occasionally see stuff like this advertised on TV for local computer
stores and similar (in the rare times when it isn't commercials for
"pre-owned vehicle" dealerships or similar; the city in which I live is
kinda overrun with used-car lots).

Walmart PC's are also usually still pretty underwhelming for the prices,
though usually with newer hardware (albeit still low-end).

Parents bought one a few years ago, which came with an AMD APU; thing
was rather underpowered even for the sorts of things parents were using
it for (mostly web-browsing and Netflix).

>> It is possible OSes dropping support for older hardware would also face
>> a backlash from OEMs (and/or cause the lower end of the desktop PC
>> market to fall off).
>
> Unlikely. I think that, on the contrary, OEMs might prefer that
> Microsoft drops support for both old hardware and old Windows
> versions, because that's a reason for people to buy new hardware from
> OEMs. After all, it works for mobile phones.
>

I am mostly thinking of the sub $500 PC market.
This could then leave only the higher-end PCs if support for old/low-end
hardware were dropped.

Or, it could become like what the laptop market became, where pretty
much anything worth buying is fairly expensive, and most of what is
available (under ~ $1k) is glorified netbooks and "ultrabooks" and
similar; vs the older style "more useful but bulkier" laptops, which are
generally above $1k.

> OTOH, Microsoft's enterprise customers want stability, so that's why
> we see longer support periods for Windows than for mobile phones.
>
> And maybe now that the performance increases have slowed down for
> mobile phones, too, we will see longer cycles in that market, too.
>

It could be better. Bigger problem I have found is that one can't
upgrade the OS to newer versions or similar, and the Android development
tools fairly quickly drop support for prior OS versions.

So, not got into mobile development for the main reason that it seems
like too much cost/effort to try to keep up with what the SDK / NDK
still supports (more so than anything to do with the hardware itself).

Robert Wessel

unread,

Mar 16, 2019, 1:59:33 PM3/16/19

to

On Fri, 15 Mar 2019 13:36:05 -0700 (PDT), MitchAlsup
<Mitch...@aol.com> wrote:

>
>I question this. ARM followed:: Berkeley SPARC, Stanford MIPS, Sun SPARC,
>MIPS MIPS, Mc 88K, and probably HP Snakes: all having 32 registers, more
>than one implemented in gate arrays. The only other RISC with 16 registers
>was Clipper.

If memory serves, the 801 had 16 GPRs.

MitchAlsup

unread,

Mar 16, 2019, 2:03:56 PM3/16/19

to

On Saturday, March 16, 2019 at 10:22:56 AM UTC-5, paul wallich wrote:
> On 3/16/19 9:09 AM, Anton Ertl wrote:
> > BGB <cr8...@hotmail.com> writes:
> >> and people are occasionally still selling "new" PCs with hardware that
> >> looks like something from a decade ago (Eg: "Core 2 Duo with 4GB of RAM
> >> and a 1TB HDD" and similar), it is just "new" because they stuck Windows
> >> 10 on it...
> >
> > Looking for it, I indeed found one system with that old components,
> > but it's only from one dealer, and I doubt that such products are
> > really relevant in the market.
> >
> >> It is possible OSes dropping support for older hardware would also face
> >> a backlash from OEMs (and/or cause the lower end of the desktop PC
> >> market to fall off).
> >
> > Unlikely. I think that, on the contrary, OEMs might prefer that
> > Microsoft drops support for both old hardware and old Windows
> > versions, because that's a reason for people to buy new hardware from
> > OEMs. After all, it works for mobile phones.
> >
> > OTOH, Microsoft's enterprise customers want stability, so that's why
> > we see longer support periods for Windows than for mobile phones.
>
> My impression is that the hardware market has stagnated for the most
> part -- machines from 3-5 years ago are "good enough" for most purposes,
> and the newer processors don't have features that make it easy for a new
> OS to be impossible to run on the old machines.

I am currently using a machine from 2011 and it is still fast enough and big
enough for the things I need to do. I do open up the door and blow out all
the dust once a year to help with longevity. It was a $650 machine when
purchased.

>
> (In my experience as a purchaser/donee of superannuated boxes, the
> corporate upgrade cycle may now be partly driven by the lifespan of
> motherboard backup batteries. If you wait for failures before replacing
> them, you get not only the direct cost but also the risk of delaying
> some crucial project.)

At Samsung, they used laptops for everything (except compiler development).
Our group had a budget for laptops that was insufficient compared to the
failure rate of the laptops. Leading some of us waiting for weeks before
a replacement would be AUTHORIZED ! Imagine the cost effectivity of a high
$$$ engineer not being able to work for more than a week.

>
> > And maybe now that the performance increases have slowed down for
> > mobile phones, too, we will see longer cycles in that market, too.

I am still happy with my Galaxy 3 (yes 3) 6.5 years after purchase.
The screen still has no cracks or scratches.

already...@yahoo.com

unread,

Mar 16, 2019, 2:42:27 PM3/16/19

to

On Saturday, March 16, 2019 at 1:58:12 AM UTC+2, MitchAlsup wrote:
> On Friday, March 15, 2019 at 4:20:31 PM UTC-5, paul wallich wrote:
> > On 3/15/19 4:36 PM, MitchAlsup wrote:
> > > On Friday, March 15, 2019 at 1:51:13 PM UTC-5, a...@littlepinkcloud.invalid wrote:
> > >> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> > [...]
> > >>> And, guess what, after 32-bit ARM with 16 registers, they designed
> > >>> Aarch64 with 32 registers. So whatever the reason, they obviously
> > >>> don't agree with the claim that 16 registers are enough.
> > >>
> > >> Indeed not, but the original Arm was very much constrained by the
> > >> technology of the day. 16 registers seemed luxurious at the time; not
> > >> so much now.
> > >
> > > I question this. ARM followed:: Berkeley SPARC, Stanford MIPS, Sun SPARC,
> > > MIPS MIPS, Mc 88K, and probably HP Snakes: all having 32 registers, more
> > > than one implemented in gate arrays. The only other RISC with 16 registers
> > > was Clipper.
> >
> > I think that there's a distinction here because of the market segments.
> > the original Acorn RISC Machine was intended for PC use, originally home
> > computers and reportedly had all of 25K transistors.
>
> In 1982 the 68020 was announced and it contained 190,000 transistors.

That tells more about Motorala's custom to announce early, than about anything else.
'020 was released in 1984, but the first computer based on it that shipped in significant volumes (Sun-3 and HP 9000/320) didn't appear until 2nd half of 1985.

> This was before any of the RISC machines got outside of academia.
> Think Mac II.

1987

paul wallich

unread,

Mar 16, 2019, 9:53:39 PM3/16/19

to

On 3/16/19 12:57 PM, Anton Ertl wrote:
[getting offtopic, except insofar as design for RAS]

> If CMOS battery discharge was an issue, the way to go would be to
> replace them on a schedule. That's certainly less work for the IT
> staff than to replace the box (including setup).

Around here, corporate IT staff have imaging tools that make setting up
a new box pretty much painless. In comparison, properly replacing the
battery in a modern small form-factor PC involves disassembling much of
the machine, including parts that may be bonded by heatsink compound. So
replacing the whole cohort of machines before they start climbing up the
far side of the failure curve is apparently an attractive option. (The
companies get a nice tax deduction as well if they donate those
machines. In the year since I've been taking care of such donated
machines about half have needed fixes.)

John Levine

unread,

Mar 17, 2019, 5:58:47 PM3/17/19

to

In article <f9eq8e1mbdqqc8vrh...@4ax.com>,

Robert Wessel <robert...@yahoo.com> wrote:

>If memory serves, the 801 had 16 GPRs.

It did. For some reason they were 24 bits, not 32.

By the time the 801 evolved into the ROMP used in the RT PC, the
registers grew to 32 bits and the instruction set added halfword
memory accesses.

It is my impression that reduced instruction set designs sometimes add
more registers, and invariably reintroduce back partial word loads and
stores once they realize how much time they're spending shifting and
masking to fake byte accesses.

--
Regards,
John Levine, jo...@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

MitchAlsup

unread,

Mar 17, 2019, 8:09:03 PM3/17/19

to

On Sunday, March 17, 2019 at 4:58:47 PM UTC-5, John Levine wrote:
> In article <f9eq8e1mbdqqc8vrh...@4ax.com>,
> Robert Wessel <robert...@yahoo.com> wrote:
>
> >If memory serves, the 801 had 16 GPRs.
>
> It did. For some reason they were 24 bits, not 32.
>
> By the time the 801 evolved into the ROMP used in the RT PC, the
> registers grew to 32 bits and the instruction set added halfword
> memory accesses.
>
> It is my impression that reduced instruction set designs sometimes add
> more registers, and invariably reintroduce back partial word loads and
> stores once they realize how much time they're spending shifting and
> masking to fake byte accesses.

In addition, most of us no believe that a misaligned Ld/St model is best.
It gets rid of so many things compilers are not good at, at a small cost in HW.

John Levine

unread,

Mar 17, 2019, 8:54:46 PM3/17/19

to

In article <109d0ec4-a301-4d70...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>> It is my impression that reduced instruction set designs sometimes add
>> more registers, and invariably reintroduce back partial word loads and
>> stores once they realize how much time they're spending shifting and
>> masking to fake byte accesses.
>

>In addition, most of us now believe that a misaligned Ld/St model is best.

>It gets rid of so many things compilers are not good at, at a small cost in HW.

Quite right, should have mentioned that.

Tasks that need predictable kinds of shifting and masking work better
in hardware, since fixed shifting and masking is really easy to do in
hardware.

EricP

unread,

Mar 17, 2019, 10:21:33 PM3/17/19

to

John Levine wrote:
> In article <109d0ec4-a301-4d70...@googlegroups.com>,
> MitchAlsup <Mitch...@aol.com> wrote:
>>> It is my impression that reduced instruction set designs sometimes add
>>> more registers, and invariably reintroduce back partial word loads and
>>> stores once they realize how much time they're spending shifting and
>>> masking to fake byte accesses.
>> In addition, most of us now believe that a misaligned Ld/St model is best.
>> It gets rid of so many things compilers are not good at, at a small cost in HW.
>
> Quite right, should have mentioned that.
>
> Tasks that need predictable kinds of shifting and masking work better
> in hardware, since fixed shifting and masking is really easy to do in
> hardware.

In the case of Alpha they added load unsigned byte and word, and stores.
If you wanted to load a signed byte or word it required an extra
SEXTB or SEXTW instruction to explicitly sign extend it.

LDL loaded 32 bits sign extend to 64 bits, so to load
an unsigned 32-bits required a ZAP to zero the upper bytes.

EricP

unread,

Mar 18, 2019, 2:44:11 AM3/18/19

to

And I always wondered why Alpha did this,
not having symmetric signed and unsigned loads, as others had.

I can see that sign extending a byte/word/long to 64 bits
means the sign bit drives up to 56 loads plus a bunch
of muxes to select the appropriate version,
but they already had LDL sign extend 32->64 so adding
byte and word doesn't seem that much worse.
And 3 more instruction decodes: LDBS, LDWS, LDLU to combine
load with the sign/zero extend doesn't strike me as much.

To this casual external observer it looks overly RISC philosophical.

Terje Mathisen

unread,

Mar 18, 2019, 4:10:07 AM3/18/19

to

MitchAlsup wrote:
> On Sunday, March 17, 2019 at 4:58:47 PM UTC-5, John Levine wrote:
>> It is my impression that reduced instruction set designs sometimes add
>> more registers, and invariably reintroduce back partial word loads and
>> stores once they realize how much time they're spending shifting and
>> masking to fake byte accesses.
>
> In addition, most of us no believe that a misaligned Ld/St model is best.
> It gets rid of so many things compilers are not good at, at a small cost in HW.

At the very minimum, you have to support naturally aligned element
load/store for vectors, i.e. a 128/256/512-bit line of 2/4/8 doubles
must be loadable from any double-aligned address, otherwise you cannot
expect compilers to auto-vectorize.

This approach very quickly (via float and 16-bit to arrays of byte) end
up needing arbitrary alignment for vector load/store, at which point
doing the same for regular load/store becomes a no-brainer.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bruce Hoult

unread,

Mar 18, 2019, 6:21:13 AM3/18/19

to

On Saturday, March 16, 2019 at 11:03:56 AM UTC-7, MitchAlsup wrote:
> I am currently using a machine from 2011 and it is still fast enough and big
> enough for the things I need to do. I do open up the door and blow out all
> the dust once a year to help with longevity. It was a $650 machine when
> purchased.

Both my big (17" quad core MBP) laptop and travel (11" dual core Air) laptop are i7's from 2011, and fine for most uses. The big one turbos to 3.3 GHz and the small one to 2.9.

I do however like a bit more performance than that for compiling compilers. MHz hasn't increased all that much since then, but IPC has. So I currently do most of my work on a NUC from about this time last year with a 15W CPU that turbos to 4.2 GHz single core and does around 3.4 all cores for a while before slowly throttling back to 2.8. But if I'm doing a serious amount of recompiling then I kick things up to a remote 16 or 32 core machine.

> At Samsung, they used laptops for everything (except compiler development).
> Our group had a budget for laptops that was insufficient compared to the
> failure rate of the laptops. Leading some of us waiting for weeks before
> a replacement would be AUTHORIZED ! Imagine the cost effectivity of a high
> $$$ engineer not being able to work for more than a week.

I was pretty surprised that San Jose uses laptops. In Samsung Moscow I was given an i7-3770 tower when I started in last 2014, and got upgraded to an i7-6700 tower in early 2017. A bit slow considering I had a 6700K machine at home a year earlier. In October 2017 I persuaded my boss that certain things with SGPU compiler development would go a lot faster with more cores and got an 18 core i9-9980XE approved. It finally got delivered December 30, just in time for me to go away for a 5 week vacation -- and then two weeks after getting back I resigned to join a RISC-V startup. Sad, because that thing is a total beast -- it kills an i7 with the same MHz even on 1 to 4 thread tasks because it has so much more cache.

MitchAlsup

unread,

Mar 18, 2019, 11:37:19 AM3/18/19

to

On Monday, March 18, 2019 at 1:44:11 AM UTC-5, EricP wrote:
> EricP wrote:
> > John Levine wrote:
> >> In article <109d0ec4-a301-4d70...@googlegroups.com>,

unread,

Mar 19, 2019, 2:31:43 PM3/19/19

to

On 3/18/2019 6:29 PM, Bruce Hoult wrote:
> On Monday, March 18, 2019 at 12:48:20 PM UTC-7, BGB wrote:
>> I can't compare exactly. The fastest Intel HW I have is still some older
>> rack servers with dual Xeon E5410's. I got them mostly because they were
>> getting thrown out.
>
> Oh, wow, that's pretty old. 2.33 GHz, pre Nehalem. At least eight real cores though.
>

Yeah. The thing was generally a little bit slower than my FX for
single-threaded tasks, but under multi-threaded workloads there was no
real per-thread slowdown, so it could actually go pretty fast at this.

With my FX, much past two or three threads, the per-thread performance
would start to drop off quickly, limiting its max performance.

So, with 8 threads, the E5410's had a pretty massive speed advantage.

With my Ryzen, a single thread is ~ 3x as fast as before, but something
weird is going on with scheduling that I have not figured out:
One thread is pretty fast;
Two threads is slightly faster;
Beyond two threads is detrimental (within a single process).

However, the process remains limited to ~8% CPU load, regardless of the
number of threads created. Usually only one of the cores is active in
Resource Monitor.

If I fire up two processes, each runs at 8%, and each runs at roughly
the full speed of the first process. Each process runs on a different core.

I have yet to figure out what is going on with this, nor a workaround,
but for now I have been living with it, as two threads (at 8% load) on
my Ryzen is faster than running my FX at 100% CPU load...

I haven't found anything that mentions this behavior, nor a cause nor a
workaround (at least, short of spawning multiple processes).

Digging around in the Windows API, it appears:
The process affinity is already 0xFFFF (all processors);
There is only a single Processor Group, which contains all 16 cores;
Thread affinity is already set to use all cores;
...

So, at least according to the Win32 API, a process should already be
able to max out the CPU.

So, something curious is going on, that I can't quite seem to figure
out. In any case, I suspect it is a software issue (probably some
"clever" OS feature designed to try to keep programs from hogging the
CPU or similar).

>
>> I had thought NUC was basically Intel's equivalent of a RasPi (just more
>> expensive)?...
>
> The first NUCs were pretty low end ... atom I think ... but I don't think they were ever quite as slow as a Pi :-)
>

When I first saw them, it was people talking about them as a sort of
Intel equivalent to the RasPi.

There was also some complaint that the GPIO was using a somewhat
non-standard 1.0 volt GPIO interface, which made it problematic to
interface with most peripheral hardware (though, some other info online
says it uses a more common 3.3v signaling, and some other people saying
that it doesn't have any GPIO pins).

Either way, it appears it has a far less GPIO than the 40-pin header on
the RasPi.

Seems to be board specific, but the headers only have 6 pins, usable
either as GPIO, or as I2C and two PWM outputs.

So, yeah, still wouldn't be terribly useful for the kinds of things I am
doing with using RasPi boards (which generally make much more extensive
use of GPIO).

Though, I guess one could use shift-registers to multiplex the IO pins
or similar.

> My NUC has a Core i7-8650U. 8th gen, quad core, hyperthreading, up to 4.2 GHz, 32 GB DDR4-2400, 500 GB Samsung Pro M.2 PCIe SSD (reused from the 6700K tower I left behind in Moscow when I moved). In April last year when I bought it, it was faster than any laptop. A few months later, laptop manufacturers started using the same generation of CPUs, though most top out with the i7-8550U (100 MHz slower). In practice I find the NUC is significantly faster because today's "thin and light" laptops thermal throttle much more and much sooner than the NUC.
>
> There are a couple of NUC models with even beefier processors, but they are also something like twice as large.
>
> https://pbs.twimg.com/media/DlfqLMJUUAE5quy.jpg
>
> It's pretty good for all the travel and living in different places I've been doing in the last year, as long as I have a decent monitor available in each place. (I like 32" 4K monitors)
>

OK.

I have a desktop PC in my case.

General description:
Ryzen 2700X 3.7 GHz
32GB DDR4 2933
4 HDDs, 1 SSD, and a DVD-RW drive
750W PSU
GeForce GTX 970
...

Case is a decent sized full-tower, PC weighing approx 60 lbs.

Annoyingly, the 500VA UPS I have (with a relatively new battery) can't
keep this thing running during a blackout (dumps power very quickly). I
think I need a bigger UPS.

>
>> But, meh, my efforts are sort of a hobby project that has managed to eat
>> a pretty absurd amount of time over the past few years (and has drifted
>> pretty far from how it started out).
>
> Designing a complete instruction set and hardware implementation of it is a BIG task -- almost ridiculous for just one person to do. It's very impressive how much you get done.
>

Yeah.

I have much of my life obsessing on computers and similar as experience.

Decided to omit a personal history, but I had been messing around a bit
with compiler and VM technology over the course of my life (among other
things), and over the past several years, have started branching out
into ISA design.

Though, in some sense, ISA design is sort of like VM design, just with
much more restrictive design constraints (more so when trying to design
something which is reasonably usable both as a VM and as a CPU ISA).

A lot of my past VM designs would be unsuitable for implementation in
hardware; and many hardware ISA's have features which are rather
problematic for use as a VMs.

For example, x86 is difficult to decode, and both x86 and ARM make heavy
use of ALU condition-codes which are difficult to emulate efficiently, ...

>
>> The other parts of my life are:
>> Working IRL in making various parts out of wood and metal and similar;
>> Hoping (in some vain sense), that someone might eventually actually hire
>> me for a job, but no luck here (I otherwise still live with parents,
>> never having managed to achieve any real sense of independence, ...).
>>
>> Granted, I am autistic, so maybe expectations are lower; at least when
>> not being judged by people for being useless, and being dismissed by HR
>> people and staffing agencies as "unemployable", ...
>
> This is awful!
>
> You've clearly got very useful skills in machine code programming, CPU architecture, Verilog for goodness sake. I would say you are easily employable by a company doing that kind of thing -- and there are many opportunities to work remotely at your own pace on a fixed price contract (or just monthly rate with progress evaluated from time to time).
>
> The hardest thing might be maintaining interest in someone else's project.
>

I had looked at a few freelance websites, but was pretty discouraged by
most of what I saw:
People wanting huge/elaborate things, paying little or nothing;
People wanting things which were legally dubious or illegal;
...

Most of the rest were things that didn't really align with my skill-set
(eg: web development stuff). Pretty much everything I can find locally
(for 'developer' jobs) is looking for web developers. I have also
determined that they don't want to hire me either.

Had also tried getting jobs as a machine operator or doing industrial
maintenance/repair (pretty big locally, *1), but they don't want to hire
me for these sorts of things either.

*1: Main local industries:
Oilfield stuff (drilling, refinery stuff, ...);
Mining related stuff (coal, etc);
Manufacturing industries (paper products, aviation parts, ...);
Farming industry (corn fields);
Used car lots (so many used car lots);
...

> Maybe you should contact me privately.
>

Maybe, will see.

>> Well, and similarly no GF, because (in the view of females), ones' value
>> as a person is dependent primarily on having a job and money and other
>> related things (car, house, ...). As much as people say otherwise,
>> having gone through enough interactions in these areas, this seems to be
>> a pretty major determining factor, and otherwise I am pretty much seen
>> as being useless.
>
> Ahhh .. being a young guy sucks, or at least it sucked for me. Young women get all the attention they want and more, but not guys unless they're captain of the football team or something. Things started to turn around for me once I hit 30, and especially 35 to 45 were great years. Living in Moscow from 52 to 55 was the best time I've ever had.
>

I am 35 now. Not really young anymore either...

At my age, I mostly just get looked down on and judged, sort of idea is
that anyone who was actually worth anything would have become
financially successful already, vs just siting around at home
(occasionally applying to jobs but pretty much universally not getting
any response).

Local area is also part of what is called the "Bible Belt", so some
amount of people who are anti-alcohol and very fussy about "King James
Only" and similar. Have determined that I can't really interact
meaningfully with them. I am non-denominational and my views are
generally a bit more lax. My views are also more that individuals are
responsible for their own morality, or lack thereof, rather than it
being good or right to cast shame and judgement on those who don't live
up to ones own standards.

Similarly, can't really get along with people who are overly
liberal/"progressive" either, as their ideology tends to run counter to
my own, and they seem to actively despise anyone who doesn't buy into
their particularly ideology (eg, "doesn't buy into their
ideology"="fascist" and similar).

But, no, I am not exactly a huge fan of DJT either, I am more in the
stance of "well, maybe he was the lesser of two evils", a lot of them
saw HRC as an almost messiah-like figure, and anyone who didn't vote for
HRC as an enemy. My dad was a pretty big supporter of DJT though, and he
would have been pretty upset had I voted otherwise. Will not exactly
mourn DJT leaving office though.

I suspect though that both fundamentalists and liberals seemingly live
in a reality which is somehow pretty different from the one in which I live.

Decided to leave out going into the whole topic of morals/ethics based
on systems of rules and good/bad via accumulation (both fundamentalists
and progressives often seem inclined towards this; differing more in
their particular systems of rules, than in terms of whether or not they
try to follow them and/or impose them on others). Both groups also tend
to have a bit of an "us versus them" thing going on, ...

Contrast is "just trying to do whatever brings the most net benefit for
the parties involved" or similar.

This is a hairy issue, and often a point of conflict.
The end results aren't terribly dramatic though.

Similarly, maximizing net cost/benefit generally still seems to lean in
a relatively conservative direction. (Subject to possible debate over
specifics).

And, as noted, there is also the whole job/money/house/car thing.

...

It was bad enough that I mostly gave up talking to anyone...

The few times I tried talking to anyone IRL in the past few years,
things didn't go well.