primitives vs cleverness vs readability

DavidM

unread,

Aug 3, 2008, 6:45:30 PM8/3/08

to

As we all know, all VMs regardless of the language can impose huge run-
time performance penalties, compared to native coding in native compiled
languages.

An STC-based Forth with optimal hand-coded assembler primitives can evade
this cost to a large degree, so I won't be talking about that here. I'm
thinking more of DTC, ITC and TTC-based forths.

What I am interested in is: when is it better to sweat over the coding of
forth words, squeezing every last ounce of speed out of them at the very
likely cost of readability/maintainability, as opposed to just taking the
time critical stuff and coding it as C and/or assembler primitives?

Cheers
Dave

roger...@gmail.com

unread,

Aug 3, 2008, 8:27:24 PM8/3/08

to

I'd say ... you shouldn't sweat over Forth code at all. Do what's
necessary and no more. You want to finish your application (if you're
writing one), not torture yourself.

I'd choose readability and ease of coding over speed. At the same
time I like speed too so I use fast native external libraries and an
STC Forth.

Many times I've read about determining the most often executed
routines and optimizing those into assembly. I have a handful in my
project that are candidates for this, but I'm waiting until I need
more speed, not anticipating it. Or when I'm bored and just want to
code SOMETHING. Which I've done, but I usually feel an odd sense of
having wasted my time after.

So it goes, a lot of people say not to optimize prematurely. Oops. :)

On the bright side, algorithm-redesign can speed up things. I've
reduced routines to 25% their original size just through a redesign of
an algorithm that was meant to simplify things, and I wasn't even
looking for speed.

Hope those helped.

Roger

Jonah Thomas

unread,

Aug 3, 2008, 10:41:10 PM8/3/08

to

DavidM <nos...@nowhere.com> wrote:

> What I am interested in is: when is it better to sweat over the coding
> of forth words, squeezing every last ounce of speed out of them at the
> very likely cost of readability/maintainability, as opposed to just
> taking the time critical stuff and coding it as C and/or assembler
> primitives?

If your code is already fast enough running on your particular Forth on
your particular hardware, then you don't need to do either one, you're
done.

So, you have your Forth code that works but it isn't fast enough. Step
back and notice whether you see some other method that would run
faster. You can try out new methods faster in Forth. See which one looks
like it's actually doing the least work. If you find a better algorithm
and it's fast enough, then you're done.

OK, so your best method is still too slow. Look at it carefully for ways
to speed it up with better Forth. Don't do things that make it
unreadable. They aren't worth it. Something will go wrong and it will be
extra trouble to fix it. But you might as well look in case you've
missed something that would speed it up. You can do some profiling to
see where the slow stuff is, don't spend much effort on the stuff that
can't help. If you see something that lets you speed up a critical inner
loop, maybe it will be fast enough. If so then you're done.

If it's still too slow then by this time you know where the slow spots
are. Look for something that it pays to do in C or assembler. If you're
already using a good Forth optimiser that produces native code, you
might only expect to speed it up 3-4 times. Maybe less, depending. If
you need more speedup than you have any right to hope for, now is the
time to either go back and look again for a better algorithm, or else
look for faster hardware. Or you could try doing it in C or assembly
just in case. If it's fast enough at this point then you're done.

If you've already coded one bottleneck and it didn't help enough, you
probably have some idea how much speedup you can hope for from the
second bottleneck. Guess whether you can make it fast enough by assembly
coding. If it doesn't look plausible, your best choices are to find a
faster algorithm or get faster hardware. Look at the things that gobble
up the time and imagine ways to get your result without doing them.
Like, one time the slow part was to compute successive integer square
roots inside an inner loop. The solution was to not compute square
roots, but instead compute successive squares. The square root stays the
same for x iterations until you get to the new square. And if you know n
and n^2, then (n+1)^2 = n^2 + 2n + 1. Very fast.

Making your code unreadable is a mug's game. You lose the advantages of
Forth for a moderate speed improvement.

Rewriting your code in C or assembler can be a mug's game. You lose the
advantages of Forth for a moderate speed improvement. Do it if you need
to, and if you think it will work well enough.

Forth is good for prototyping alternate methods. That's worth a try. The
problem with that approach is that you can't tell ahead of time what
you'll find, and there's a chance you'll put signficant time into it and
not find anything. So alternate with other approaches. If you have a
manager who pays close attention to how you use your time, coding things
in C will look like a valid exercise. So when you spend half your time
doing that, at worst it will look like you're half as fast coding in C
as you really are. If you spend all your time looking for better
methods and you don't find them, then it might appear that you've just
been goofing off.

Writing code that's at the very edge of what your processor can do is
also a mug's game. You put a lot of effort into getting things barely
fast enough, and pretty soon the specs will change and demand more. The
first time you have to be real smart to get your code fast enough is a
big warning that you need a faster processor. They can pay you the big
bucks to do smarter and smarter tricks, and then they'll have to switch
to a faster processor anyway. All the time you spent writing optimised
assembly code for the old processor is wasted. (But the C code can be
salvaged provided the C compiler for the new chip is 100% compatible
with that for the old one.) The Forth code ought to run but the speed
tradeoffs may be different. If you spent a lot of effort writing just
exactly the sequence of Forth that was fastest, and now SWAP is seven
times as fast as it used to be but ROT is only twice as fast, any effort
you spent on fast stack juggling is wasted -- even if speed is still an
issue on the new processor.

[minor rant on] Forth is good for making code size small. You can put
some effort into byte code, you can compress source code and decompress
it a line at a time and interpret it, there are lots of ways to make
your code very small if you don't need it to be fast. It can be pretty
cheap to produce small code.

Forth is one of the best scripting languages for making code fast. Good
Forth programmers can produce relatively fast code cheaper than faster
code written in C.

Forth is one of the best choices for making code that's fast and small
both. But it isn't cheap to do that. This is not a good market niche for
Forth, even though it's a niche that Forth is good for. If somebody is
using an obsolescent processor that needs to do more than it can do, if
they want to cram more functionality in limited space and limited speed
than anybody can reasonably expect, maybe you can do it for them with
Forth. Then they bite the bullet and switch processors, and their costs
for getting you to do all that great stuff must be completely amortized
right then. Very likely they'll decide it wasn't worth it, they should
have switched earlier. Next time, or the time after, they do switch
earlier. You get bragging rights for doing this superhuman work but
repeat sales don't happen as often as you'd like.

I think it would be much better to develop the reputation for delivering
more than expected, if you can do that. If you can give a competitive
bid, and then meet the specs long before deadline, and ask "What else
would you like us to do?".... Getting the obsolete processor to deliver
a little bit longer is not quite beating a dead horse. Delivering code
that's small and reasonably fast and *correct*, before deadline, and
then offering something extra -- there ought to be a big market for
that. If you can deliver.
[rant end]

Elizabeth D Rather

unread,

Aug 3, 2008, 11:09:14 PM8/3/08

to

A story I've told here before is applicable (sorry if you've already
heard it): FORTH, Inc. was asked to recode a baggage handling system
for American Airlines. The original program was all assembler, and too
expensive to maintain. We were required to reproduce the user interface
and basic bag handling procedures, but could do whatever else seemed
appropriate. Our program was written entirely in polyFORTH (ITC),
running native on an LSI-11 (yeah, it was quite a few years ago). When
it was sufficiently complete to run some timing tests, everyone was
astonished: our system could handle 25% more bags/minute than the
previous one. polyFORTH was obviously not faster than pure assembler;
the point was that our internal design was far more efficient than its
predecessor.

The overall design of an application is a much stronger determinant of
performance than language (any language). Sweating bullets over
language benchmarks is a largely meaningless exercise.

As others have said, the important thing is to get your program running
in the most straightforward way possible. In designing your program, be
cognizant of the potentially time-critical parts, and try to come up
with a clean design implemented in clean, readable code. When your
program is running correctly, you can do timing studies and it should be
clear what sections, if any, need some kind of optimization.

Modern Forths running on modern hardware are fast enough for the vast
majority of applications. It's not worth sweating until you have a
working program and can establish that you have a timing problem (and
where it is). Then you can focus on that bottleneck.

Cheers,
Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Jerry Avins

unread,

Aug 3, 2008, 11:28:45 PM8/3/08

to

Elizabeth D Rather wrote:

...

> A story I've told here before is applicable (sorry if you've already
> heard it): FORTH, Inc. was asked to recode a baggage handling system
> for American Airlines. The original program was all assembler, and too
> expensive to maintain. We were required to reproduce the user interface
> and basic bag handling procedures, but could do whatever else seemed
> appropriate. Our program was written entirely in polyFORTH (ITC),
> running native on an LSI-11 (yeah, it was quite a few years ago). When
> it was sufficiently complete to run some timing tests, everyone was
> astonished: our system could handle 25% more bags/minute than the
> previous one. polyFORTH was obviously not faster than pure assembler;
> the point was that our internal design was far more efficient than its
> predecessor.

Which code was running last Wednesday at JFK? What a mess!
http://news.yahoo.com/s/nm/20080730/ts_nm/amr_jfk_dc

Jerry
--
Engineering is the art of making what you want from things you can get.
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

John Passaniti

unread,

Aug 4, 2008, 3:11:56 AM8/4/08

to

On Aug 3, 6:45 pm, DavidM <nos...@nowhere.com> wrote:

> As we all know, all VMs regardless of the language can impose huge run-
> time performance penalties, compared to native coding in native compiled
> languages.

The key word is "can." The slowest VM can still outperform the
fastest native code if the algorithms used are superior. What matters
is the performance of the system as a whole, not the VM. That's a
trap that you see endlessly here in comp.lang.forth-- a preoccupation
with speed. But not speed of some concrete real-world application,
but the speed of some small primitive in the system. The theory, I
guess, is that by focusing on speeding up all the primitives, the
performance of the overall system is improved. To which I say,
nonsense-- if I choose a superior algorithm, that is going to give me
a far better pay-off in terms of performance than if I manage to
reduce a routine I hardly ever call by a few cycles.

> What I am interested in is: when is it better to sweat over the coding of
> forth words, squeezing every last ounce of speed out of them at the very
> likely cost of readability/maintainability, as opposed to just taking the
> time critical stuff and coding it as C and/or assembler primitives?

The first step is to consider different algorithms and data
structures. And here, simpler isn't always better. For example,
depending on size of what you're searching, a simple linear search
will be slower (O(n)) than a more complex binary search (O(log2 N)).
And a more complex binary search is likely to be slower than an even
more complex hashing algorithm (typically O(1)). On the other hand,
if the size of what you're searching is small, then a linear search
may be fastest. It requires thought and sometimes experiment to know
how to choose algorithms and data structures.

If you're talking to hardware that has strict real-time performance
requirements that can't be met in Forth, then that is one time when
coding in a more primitive language makes sense. If you're processing
is not keeping up with input data, then coding core routines in a more
primitive language makes sense.

But before one starts down that road, they need to *measure*. That
may mean running a profiler to identify where to focus efforts. That
may mean getting out your oscilloscope or logic analyzer and
establishing a timing baseline. That may mean instrumenting the VM to
collect statistics on things like counts of certain instructions or
how much time some instructions take up. The point is that you need
to have some objective metric by which you can not only verify that
your efforts have paid off, but to understand the run-time behavior of
the system. You need that because your intuition can be wrong. You
need that because your experience can blind you to what is front of
you.

Elizabeth D Rather

unread,

Aug 4, 2008, 8:59:04 AM8/4/08

to

Jerry Avins wrote:
> Elizabeth D Rather wrote:
>
> ...
>
>> A story I've told here before is applicable (sorry if you've already
>> heard it): FORTH, Inc. was asked to recode a baggage handling system
>> for American Airlines. The original program was all assembler, and
>> too expensive to maintain. We were required to reproduce the user
>> interface and basic bag handling procedures, but could do whatever
>> else seemed appropriate. Our program was written entirely in
>> polyFORTH (ITC), running native on an LSI-11 (yeah, it was quite a few
>> years ago). When it was sufficiently complete to run some timing
>> tests, everyone was astonished: our system could handle 25% more
>> bags/minute than the previous one. polyFORTH was obviously not faster
>> than pure assembler; the point was that our internal design was far
>> more efficient than its predecessor.
>
> Which code was running last Wednesday at JFK? What a mess!
> http://news.yahoo.com/s/nm/20080730/ts_nm/amr_jfk_dc
>
> Jerry

Hah, thank you for that! No, our system was at LAX. It operated for
about 10 years before AA corporate decided to standardize on a turnkey
system provided by a company "specializing in baggage handling systems".

ISTR there was a similar snafu when the new Denver terminal opened.

Thomas Pornin

unread,

Aug 4, 2008, 9:23:00 AM8/4/08

to

According to DavidM <nos...@nowhere.com>:

> As we all know, all VMs regardless of the language can impose huge run-
> time performance penalties, compared to native coding in native compiled
> languages.

It can but it is not necessarily doomed to. A VM is a "virtual machine":
it emulates a hardware system which does not actually exist, but for
which the code was written. The VM thus converts the code for the
virtual hardware into something that the real processor can process,
through a mixture of static (translation before execution) and dynamic
(during execution) operations. For instance, if using an ITC-based
Forth, then the static part is the conversion of word names (which are
strings) into addresses (pointers to CFA) cunningly layed out for proper
execution. The dynamic part entails the double-indirections and jumps
(the famous NEXT code snippet).

Traditional Forth implementations rely on threaded code (they are all
some kind of *TC) in which the translation transforms the virtual
machine code into a sequence of addresses (or subroutine calls or
tokens) which map directly to the code structure; for instance, they
still maintain an actual data stack. With threaded code, the dynamic
part of the code conversion remains, and therein lies the perceived
performance penalty. Note that, as others have pointed out, the
performance penalty not only is quite lower than is usually expected (it
is NOT a 40 times slowdown), but most of the time it is completely
dwarfed out by algorithmics (better algorithms always win) and I/O
consideration (when the code is faster than the network bandwidth /
memory bandwidth / user reactions, then there is little point in trying
to get the code run even faster: it would just make the code wait longer
for the next piece of data to be available).

Even so, there are other VM implementation techniques which can squeeze
out more clock cycles. The code for the virtual machine can be converted
to native code, either as part of the compilation process, or at some
time during execution (this is called JIT compilation -- Just In Time).
Some VM for some languages do this with good results: even in
unrealistic microbenchmarks, where the performance of a tight loop is
measured (assuming against all evidence that in an actual application
such a tight loop could run at its top speed, with all its input data
always available), some VM with JIT compilation achieve
close-to-native-code results (I obtained up to 70% of native code speed
with some Java code(*), which relies on a VM and JIT compilation). Such
VM techniques are quite more complex to implement than simple ITC
translation, but nowadays you usually have much more RAM and CPU power
than you had in the 70s, hence alternate VM implementations may be
worthwhile.

> What I am interested in is: when is it better to sweat over the coding of
> forth words, squeezing every last ounce of speed out of them at the very
> likely cost of readability/maintainability, as opposed to just taking the
> time critical stuff and coding it as C and/or assembler primitives?

By definition, making your code unreadable and unmaintainable is the
last thing you can do with your code, since afterwards you cannot read
or maintain it. Hence, whatever you do, your code-sweating activity
occurs last.

Therefore, it is better to do it when everything else is done and
complete, i.e. when the code runs correctly, flawlessly and has been
properly documented. If your code is not fully correct and documented at
the time you begin to frankensteinize it, then it will never be,
although you may obtain, through fiendishly cunning trickeries, some
code which outputs _real fast_ a wrong result.

In practice, 99% of the time, once you have written out your code, and
it is clear and correct and duly commented, and it uses proper and
trimmed algorithms, then it turns out to be quite fast enough. Even
then, simply buying a better VM implementation (e.g. with a JIT
compiler), or waiting two months for a faster CPU to come up on the
market, delivers a cost-effective performance boost. If everything else
fails, you may possibly, at that point, locate the one or two tight
loops which need to squeeze out of the CPU all the clock cycles you can
get, at which point you may entertain the idea of rewriting those into a
dozen hand-coded assembly instructions. Note that having a clear,
readable and commented code at that point really helps in finding out
where is the actual bottleneck.

All of that is being said, of course, unless you actually _like_
hand-coded optimizations, for their own sake, and you seek some external
justification, so that you may indulge in your artistic hobby of
code-sweating while still making a living out of it.

--Thomas Pornin

Bernd Paysan

unread,

Aug 4, 2008, 9:21:04 AM8/4/08

to

John Passaniti wrote:
> The key word is "can." The slowest VM can still outperform the
> fastest native code if the algorithms used are superior. What matters
> is the performance of the system as a whole, not the VM. That's a
> trap that you see endlessly here in comp.lang.forth-- a preoccupation
> with speed. But not speed of some concrete real-world application,
> but the speed of some small primitive in the system.

You don't see the forrest in the trees. When we argue here, we are quite
often concerned about speed. You then step in and say "there's no
requirement to be performant here". That's sloppy thinking - there's no
requirement for performance here and there, so you implement CPU hogs here
and there and then wonder why your application is dog slow.

You forget that a fast, low-memory solution is often straight-forward and
clean design, too. It's easy to maintain, as well. That's the goal. If your
performance improvements are a burden for a small, clean implementation,
forget it.

After you have done that - implement something that's already sane in terms
of space, performance, and lines of codes, you can start measuring things.
You might be surprised that something that looked sane isn't, but in
general, it makes things much easier. After all, you have only to hunt
those bottlenecks that are still there despite of the preparation, and the
design is lean and clean.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Elizabeth D Rather

unread,

Aug 4, 2008, 9:59:01 AM8/4/08

to

Well, but the OP was asking where he should put his emphasis in
optimizing low-level code. John and I are both saying, in different
words, that the place to put the primary effort is in application
design, not low-level code optimization. We're not saying performance
doesn't matter, but pointing out where the main focus should be in
achieving it.

In my story about the baggage system, the polyFORTH that we used was
roughly 10x faster than FIGforths of that era, but what made the
difference was not that but the design of polyFORTH's multitasker and
the way we used it in the application (the old system was doing a lot of
polling and flag passing internally). I'm all for fast systems, but
that doesn't address the OP's issue.

Stephen Pelc

unread,

Aug 4, 2008, 12:41:38 PM8/4/08

to

It depends what the VM is for! When MPE and Forth Inc were working on
the OTA virtual machine, we found that if the high level portion of
the system (I/O, database ...) was sufficiently high-level, the
payment terminal applications spent most of their time in the high
level functions. I do not remember the numbers ... it was a long time
agao.

OTA was a token threaded 32 bit system. Underneath that, depending
on the CPU were DTC and STC Forth kernels for CPUs like 80186/V25,
8051 and 68000.

Stephen

--
Stephen Pelc, steph...@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

billy

unread,

Aug 4, 2008, 2:33:49 PM8/4/08

to

DavidM <nos...@nowhere.com> wrote:
> What I am interested in is: when is it better to sweat over the coding of
> forth words, squeezing every last ounce of speed out of them at the very
> likely cost of readability/maintainability, as opposed to just taking the
> time critical stuff and coding it as C and/or assembler primitives?

Ideally, if you're going to spend that kind of effort, it's better to
sweat about code before you settle on a design. That way your code
will clearly and readably reflect a high-performance design, rather
than being built for one design and then partially refactored into a
different one and then a bunch of changes added without refactoring
and then...

I recall an essay in which Chuck Moore was quoted as saying that he
would write many different versions of each part of his programs from
scratch, finding the one that fit the problem best. I think he was
talking about "Thoughtful Programming", but I can't find the specific
part of any essay that talks about that... I may be wrong.

> Dave

-Wm

John Passaniti

unread,

Aug 5, 2008, 6:56:35 PM8/5/08

to

Bernd Paysan wrote:
> You don't see the forrest in the trees. When we argue here, we are quite
> often concerned about speed. You then step in and say "there's no
> requirement to be performant here". That's sloppy thinking - there's no
> requirement for performance here and there, so you implement CPU hogs here
> and there and then wonder why your application is dog slow.

Speak for yourself. I rarely have any problems related to the
performance of my code because I spend the time up-front to research and
choose algorithms and data structures that are appropriate. And when I
do find something I wrote is slower than expected or desired, I don't
"wonder" anything-- I break out the tools. I measure with a profile. I
count graticule lines on a oscilloscope.

This approach-- up-front design work coupled with objective
measurement-- is superior to the mindless "I must optimize every
primitive" mindset because it focuses on the system. Spending your time
optimizing a primitive that takes up a tiny fraction of a system's
run-time makes no sense. But you only know what matters by thinking and
measuring.

> You forget that a fast, low-memory solution is often straight-forward and
> clean design, too. It's easy to maintain, as well. That's the goal. If your
> performance improvements are a burden for a small, clean implementation,
> forget it.

I enjoy your canned knee-jerk response here. It sounds well-practiced.
Too bad it doesn't apply to what I was wrote.

I have no problem with the ideal that people should write "fast,
low-memory solutions." My problem is that there are too many
programmers who reflexively seek out those solutions without carefully
considering the requirements. And then they are completely surprised
later when their "fast, low-memory solution" fails to meet performance
expectations.

The choice of using a more-sophisticated algorithm may indeed be because
the programmer doesn't know what they are doing. But it can also be
because the programmer understands the performance requirements and
knows the simpler routine can't meet those performance requirements.

But really, the majority of what I was addressing in my reply was the
notion that systems built on virtual machines are slower than native
code. And sometimes, they are. But as I wrote, the slowest VM can
out-perform the fastest native code if the algorithms are superior.
Those who don't understand how this can be probably also can't
understand how in the early days of Forth, the relatively slow
implementation mechanisms could still beat native code.

> After you have done that - implement something that's already sane in terms
> of space, performance, and lines of codes, you can start measuring things.
> You might be surprised that something that looked sane isn't, but in
> general, it makes things much easier. After all, you have only to hunt
> those bottlenecks that are still there despite of the preparation, and the
> design is lean and clean.

Thanks for the generic lecture.

roger...@gmail.com

unread,

Aug 6, 2008, 7:00:51 PM8/6/08

to

What speed profiling tools do people use?

What's a good one to use with SwiftForth?

jacko

unread,

Aug 7, 2008, 6:16:39 PM8/7/08

to

Well as cache operates somewhat faster than memory, there comes a
point where DTC will beat STC even though you'd initially think
otherwise. cache trashing can be a major bottleneck. I prefere DTC and
ITC for this reason. Optimizing low level code can provide benefits,
but often at a space cost, (more cache thrash anyone?).

To this end i decided http://nibz.googlecode.com should use DTC. ITC
was thought slightly indirect and had no apparent space reduction.

Optimizing primitives will not gain too muh, primitive pair sequences
as single words holds more potential.

cheers
jacko

Stephen Pelc

unread,

Aug 8, 2008, 8:42:42 AM8/8/08

to

On Thu, 7 Aug 2008 15:16:39 -0700 (PDT), jacko <jacko...@gmail.com>
wrote:

>Well as cache operates somewhat faster than memory, there comes a
>point where DTC will beat STC even though you'd initially think
>otherwise. cache trashing can be a major bottleneck. I prefere DTC and
>ITC for this reason. Optimizing low level code can provide benefits,
>but often at a space cost, (more cache thrash anyone?).

On conventional CPUs at least, the benchmark figures say you're
wrong by a factor of 10:1 and more. There are good and bad cache
implementations, and good and bad solutions for cache problems.

STC compilers inline and optimise, so low-level words simply don't
include many CALL/RET pairs. For size comparison, we converted a
256k byte (of binary) app from ITC to STC on the same hardware
and the STC version was about 2% smaller. Others have reported
similar results. Raw STC code may certainly suffer cache issues,
but even simple optimisations remove the problems. Fully optimising
compilers (let's call their output NCC for native compiled code)
produce code that is of similar size to DTC or ITC.

On silicon stack machines, Chuck Moore certainly doesn't agree
with you. Some Forth-machine FPGA implementations have reached
several hundreds of MIPs, but I'm not free to say more.

Jonah Thomas

unread,

Aug 8, 2008, 10:23:19 AM8/8/08

to

steph...@mpeforth.com (Stephen Pelc) wrote:

> STC compilers inline and optimise, so low-level words simply don't
> include many CALL/RET pairs. For size comparison, we converted a
> 256k byte (of binary) app from ITC to STC on the same hardware
> and the STC version was about 2% smaller. Others have reported
> similar results.

Back in the old days, we used to claim that ITC was smaller, that this
was one of the advantages of Forth.

Were we wrong then, or does your result come because the processors have
changed?

Anton Ertl

unread,

Aug 8, 2008, 10:38:10 AM8/8/08

to

32-bit ITC is twice as big as 16-bit ITC, and 64-bit ITC adds another
factor of two. If native code for ARM or i386 is smaller than 32-bit
ITC, that does not mean that native code for the 8085 is smaller than
16-bit ITC, even if it was compiled with something like VFX.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2008: http://www.euroforth.org/ef08.html

Bernd Paysan

unread,

Aug 8, 2008, 10:53:47 AM8/8/08

to

Jonah Thomas wrote:
> Back in the old days, we used to claim that ITC was smaller, that this
> was one of the advantages of Forth.

Back in the old days, we had run ITC on 8 bit processors, where a 16 bit add
was an operation that consisted of at least two loads (two bytes each), two
adds (two bytes each) and two stores (also two bytes each), all going
through the single accumulator. ITC gave you the same operation in 2 bytes
instead of 12, no wonder ITC was smaller.

> Were we wrong then, or does your result come because the processors have
> changed?

Processors have changed, and ITC on 32 bit processors already uses 4 bytes
per instruction - on 64 bits, native code size doesn't change much, but ITC
doubles size again.

Andrew Haley

unread,

Aug 8, 2008, 11:35:32 AM8/8/08

to

Bernd Paysan <bernd....@gmx.de> wrote:
> Jonah Thomas wrote:
> > Back in the old days, we used to claim that ITC was smaller, that this
> > was one of the advantages of Forth.

> Back in the old days, we had run ITC on 8 bit processors, where a 16 bit add
> was an operation that consisted of at least two loads (two bytes each), two
> adds (two bytes each) and two stores (also two bytes each), all going
> through the single accumulator. ITC gave you the same operation in 2 bytes
> instead of 12, no wonder ITC was smaller.

> > Were we wrong then, or does your result come because the processors have
> > changed?

> Processors have changed, and ITC on 32 bit processors already uses 4
> bytes per instruction - on 64 bits, native code size doesn't change
> much, but ITC doubles size again.

Well, hold on a minute. On a 64-bit processor you can use 32-bit
addressing for code, as long as you have less than 4 gigathings of
code. This is a common optimization used by many programming
languages. It's the default for gcc on AMD-64, for example.

It is not common to use 16-bit code addressing on a 32-bit processor
because no-on wants to be limited to 64 kilothings of code, but the
same reasoning doesn't apply to 64-bit processors. There's no reason
at all to use 64-bit threading on a 64-bit processor.

Andrew.

Elizabeth D Rather

unread,

Aug 8, 2008, 12:21:01 PM8/8/08

to

Anton Ertl wrote:
> Jonah Thomas <jeth...@gmail.com> writes:
>> steph...@mpeforth.com (Stephen Pelc) wrote:
>>
>>> STC compilers inline and optimise, so low-level words simply don't
>>> include many CALL/RET pairs. For size comparison, we converted a
>>> 256k byte (of binary) app from ITC to STC on the same hardware
>>> and the STC version was about 2% smaller. Others have reported
>>> similar results.
>> Back in the old days, we used to claim that ITC was smaller, that this
>> was one of the advantages of Forth.
>>
>> Were we wrong then, or does your result come because the processors have
>> changed?
>
> 32-bit ITC is twice as big as 16-bit ITC, and 64-bit ITC adds another
> factor of two. If native code for ARM or i386 is smaller than 32-bit
> ITC, that does not mean that native code for the 8085 is smaller than
> 16-bit ITC, even if it was compiled with something like VFX.
>
> - anton

Yes. In the 70's, you could fit a 16-bit ITC xt in one cell, while a
CALL took at least 3 bytes, so it seemed like a win, particularly since
the implementation itself (which was resident) remained small and
simple. With 32-bit processors, you could fit a CALL in a cell for a
very large amount of memory.

When we switched over in the 90's we were surprised at how much the
programs shrank. On a 32-bit processor we got almost 20% in one fairly
complex app. On the small micros, the use of interactive
cross-compilers as opposed to a resident Forth means the targets don't
have to bear the cost of heads, compiler, etc., and you can use fairly
sophisticated compiler strategies to deliver programs both very small
and very fast. That wasn't really an option in the 70's.

Stephen Pelc

unread,

Aug 8, 2008, 12:55:49 PM8/8/08

to

On Fri, 8 Aug 2008 10:23:19 -0400, Jonah Thomas <jeth...@gmail.com>
wrote:

>Back in the old days, we used to claim that ITC was smaller, that this
>was one of the advantages of Forth.
>
>Were we wrong then, or does your result come because the processors have
>changed?

Both. Modern CPUs tend to be much more compiler friendly. On older
CPUs, e.g. 8051 there were very few 16 bit operations and calls cost
three bytes rather than two. On more recent 8-bit CPUs, there are
more 16 bit operations, e.g. 9S12, and you can probably call an
MSP430 a 16 bit CPU.

However, with a few notable exceptions, compiling Forth to native
code compilation was not a well-known subject. Now, we're applying
what we've learnt from VFX on 32 bit CPUs to smaller CPUs. Even
an 8051 benefits from a carefully set up code generator.

The downside of NCC is that getting a good one right is quite a
big job, whereas getting a DTC or ITC Forth up and running is
a quick and easy job.

Another driver for NCC is that the jobs we're being asked to do
are simply more demanding. Even on a 60MHz ARM, I simply would not
consider writing a USB stack in a DTC Forth, whereas with VFX
Forth it was just another job.

Anton Ertl

unread,

Aug 8, 2008, 2:05:46 PM8/8/08

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:

>Bernd Paysan <bernd....@gmx.de> wrote:
>> Processors have changed, and ITC on 32 bit processors already uses 4
>> bytes per instruction - on 64 bits, native code size doesn't change
>> much, but ITC doubles size again.
>
>Well, hold on a minute. On a 64-bit processor you can use 32-bit
>addressing for code, as long as you have less than 4 gigathings of
>code. This is a common optimization used by many programming
>languages. It's the default for gcc on AMD-64, for example.
>
>It is not common to use 16-bit code addressing on a 32-bit processor
>because no-on wants to be limited to 64 kilothings of code, but the
>same reasoning doesn't apply to 64-bit processors. There's no reason
>at all to use 64-bit threading on a 64-bit processor.

There are a number of reasons:

* Nothing at all guarantees that the code is all in the lower 4G of
the address space (and on at least one platform it isn't), and the gcc
maintainers and others have a tendency to break the gcc behaviour that
we rely on; e.g., we used to rely on the code being in the lower 32M
on PowerPC, and there is no reason for it not to be there, and it used
to be there, and then one day it just was no longer there. I then did
a linker script to put it there, but that stopped working a little
later (and on reporting this as a bug I learned that one should not
use linker scripts or somesuch).

* It's just simpler to have a uniform cell size that also covers the
threaded code, especially since we need to do this for other 64-bit
platforms anyway.

* It would buy very little to support 32-bit threaded code on 64-bit
platforms. Threaded-code size does not consume much memory there
(compared to what's available), and it also does not cause many cache
misses.

BTW, I just looked at the switch tables generated by gcc, and they use
64-bit entries on AMD64, while according to you they could use 32-bit
entries. If, as you say, there's no reason to use 64-bit entries, why
do they use them?

Andrew Haley

unread,

Aug 11, 2008, 6:37:32 AM8/11/08

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >Bernd Paysan <bernd....@gmx.de> wrote:
> >> Processors have changed, and ITC on 32 bit processors already uses 4
> >> bytes per instruction - on 64 bits, native code size doesn't change
> >> much, but ITC doubles size again.
> >
> >Well, hold on a minute. On a 64-bit processor you can use 32-bit
> >addressing for code, as long as you have less than 4 gigathings of
> >code. This is a common optimization used by many programming
> >languages. It's the default for gcc on AMD-64, for example.
> >
> >It is not common to use 16-bit code addressing on a 32-bit
> >processor because no-on wants to be limited to 64 kilothings of
> >code, but the same reasoning doesn't apply to 64-bit processors.
> >There's no reason at all to use 64-bit threading on a 64-bit
> >processor.

> There are a number of reasons:

> * Nothing at all guarantees that the code is all in the lower 4G of
> the address space (and on at least one platform it isn't), and the
> gcc maintainers and others
> have a tendency to break the gcc behaviour that we rely on;

Of course, this isn't something gcc maintainers have any control over.
I don't know who "others" may be!

> e.g., we used to rely on the code being in the lower 32M on PowerPC,
> and there is no reason for it not to be there, and it used to be
> there, and then one day it just was no longer there. I then did a
> linker script to put it there, but that stopped working a little
> later (and on reporting this as a bug I learned that one should not
> use linker scripts or somesuch).

> * It's just simpler to have a uniform cell size that also covers the
> threaded code, especially since we need to do this for other 64-bit
> platforms anyway.

> * It would buy very little to support 32-bit threaded code on 64-bit
> platforms. Threaded-code size does not consume much memory there
> (compared to what's available), and it also does not cause many
> cache misses.

Fair enough. The last two, which are more or less "I can't be
bothered to change it, and it doesn't matter anyway" are rather weak,
but OK, there may be some legitimate reasons.

The claim was that ITC doubles size from 32-bit to 64-bit processors.
It doesn't need to be that way: you might choose to do it that way,
and on some operating systems you might even be forced to do it that
way, but it ain't necessarily so.

> BTW, I just looked at the switch tables generated by gcc, and they
> use 64-bit entries on AMD64, while according to you they could use
> 32-bit entries. If, as you say, there's no reason to use 64-bit
> entries, why do they use them?

I don't know, but:

It may be a bug.

The compiler may always generate switch tables as arrays of pointers,
so perhaps it's a side-effect of using generic code to generate them.

Maybe a suitable 32-bit reloc type doesn't exist for AMD-64, but I
doubt that.

Maybe a simple jump indirect instruction is used, and that instruction
always uses a 64-bit pointer in memory.

... etc.

Andrew.

jacko

unread,

Aug 11, 2008, 9:42:33 AM8/11/08

to

On 8 Aug, 13:42, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Thu, 7 Aug 2008 15:16:39 -0700 (PDT), jacko <jackokr...@gmail.com>

> wrote:
>
> >Well as cache operates somewhat faster than memory, there comes a
> >point where DTC will beat STC even though you'd initially think
> >otherwise. cache trashing can be a major bottleneck. I prefere DTC and
> >ITC for this reason. Optimizing low level code can provide benefits,
> >but often at a space cost, (more cache thrash anyone?).
>
> On conventional CPUs at least, the benchmark figures say you're
> wrong by a factor of 10:1 and more. There are good and bad cache
> implementations, and good and bad solutions for cache problems.

Yeah probbly large code pointers used mainly full of some constant in
the upper 16 bit.

> STC compilers inline and optimise, so low-level words simply don't
> include many CALL/RET pairs. For size comparison, we converted a
> 256k byte (of binary) app from ITC to STC on the same hardware
> and the STC version was about 2% smaller. Others have reported
> similar results. Raw STC code may certainly suffer cache issues,
> but even simple optimisations remove the problems. Fully optimising
> compilers (let's call their output NCC for native compiled code)
> produce code that is of similar size to DTC or ITC.

Low level words should generate at most 1 CALL. There is very little
reason not to use 16 bit addressing, a simple process of inserting
3*1/2 cells (on 32 bit) for a long jump would reduce most DTC pointers
to half a cell. (This assumes people clever enough to place small
kernals every so often in memory, or use the vocabulary system to
split code into convienient blocks).

> On silicon stack machines, Chuck Moore certainly doesn't agree
> with you. Some Forth-machine FPGA implementations have reached
> several hundreds of MIPs, but I'm not free to say more.

Yes and they probably don't use the lowest speed grade and have
pockets bigger than there heads.

> Stephen
>
> --
> Stephen Pelc, stephen...@mpeforth.com

Stephen Pelc

unread,

Aug 11, 2008, 11:11:15 AM8/11/08

to

On Mon, 11 Aug 2008 06:42:33 -0700 (PDT), jacko <jacko...@gmail.com>
wrote:

>On 8 Aug, 13:42, stephen...@mpeforth.com (Stephen Pelc) wrote:
>Yeah probbly large code pointers used mainly full of some constant in
>the upper 16 bit.

>Low level words should generate at most 1 CALL. There is very little

>reason not to use 16 bit addressing

When we started serious NCC development, we very quickly learned to
trust measured results over opinion. That's why we developed and
published a simple set of integer benchmarks
http://www.mpeforth.com/arena/benchmrk.fth
http://www.mpeforth.com/arena/xbench32.fth

You are welcome to publish your results.

>> On silicon stack machines, Chuck Moore certainly doesn't agree
>> with you. Some Forth-machine FPGA implementations have reached
>> several hundreds of MIPs, but I'm not free to say more.
>
>Yes and they probably don't use the lowest speed grade and have
>pockets bigger than there heads.

If you come to EuroForth 2008 in Vienna, you can talk to people
who have shipped silicon and used silicon stack machines for real
applications.

Stephen

--
Stephen Pelc, steph...@mpeforth.com

MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

Anton Ertl

unread,

Aug 13, 2008, 1:05:20 PM8/13/08

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >It is not common to use 16-bit code addressing on a 32-bit
>> >processor because no-on wants to be limited to 64 kilothings of
>> >code, but the same reasoning doesn't apply to 64-bit processors.
>> >There's no reason at all to use 64-bit threading on a 64-bit
>> >processor.
>
>> There are a number of reasons:
>
>> * Nothing at all guarantees that the code is all in the lower 4G of
>> the address space (and on at least one platform it isn't), and the
>> gcc maintainers and others
>> have a tendency to break the gcc behaviour that we rely on;
>
>Of course, this isn't something gcc maintainers have any control over.

The gcc maintainers do not have any control over gcc behaviour?

>I don't know who "others" may be!

In the example below, I guess it was the binutils maintainers.

>> e.g., we used to rely on the code being in the lower 32M on PowerPC,
>> and there is no reason for it not to be there, and it used to be
>> there, and then one day it just was no longer there. I then did a
>> linker script to put it there, but that stopped working a little
>> later (and on reporting this as a bug I learned that one should not
>> use linker scripts or somesuch).

I now remember the story better, and at first I tried to correct the
new text placement with the linker option -Ttext which did not work as
documented; I reported that as bug, and was told that this should not
work with ELF files, and I should write a linker script, which I then
did, and for some time it worked. Eventually we switched to hybrid
direct/indirect-threaded code, which made that code placement
unnecessary, so we retired the linker script.

>The claim was that ITC doubles size from 32-bit to 64-bit processors.
>It doesn't need to be that way: you might choose to do it that way,
>and on some operating systems you might even be forced to do it that
>way, but it ain't necessarily so.

Hmm, thinking again about it, with ITC it's necessarily so. With ITC,
the addresses you put in the threaded code are not code addresses, but
code-field addresses, i.e., general dictionary addresses. If you
restrict these to 32 bits, you restrict the dictionary to the lower
4G. What kind of 64-bit system would that be?

>> BTW, I just looked at the switch tables generated by gcc, and they
>> use 64-bit entries on AMD64, while according to you they could use
>> 32-bit entries. If, as you say, there's no reason to use 64-bit
>> entries, why do they use them?

...

>Maybe a simple jump indirect instruction is used, and that instruction
>always uses a 64-bit pointer in memory.

That's a good explanation. Let's see:

jmp *.L11(,%rax,8)

Yes, that's a good reason, but I'm not sure if the benefit of using a
single instruction instead of two is worth the higher number of cache
misses from the larger switch table. Especially since the compiler
then chooses to add another instruction that just slows things down:

ja .L25
mov %eax, %eax
jmp *.L11(,%rax,8)

I guess the MOV is there to get enough distance between the two jumps.
But does it have to use %eax (on which the next instruction depends)?
And if we want an instruction in between, we could split the JMP and
use a 32-bit switch table:

movl %eax,.L11(,%rax,4)
jmp *%rax

Thomas Pornin

unread,

Aug 13, 2008, 1:47:44 PM8/13/08

to

According to Anton Ertl <an...@mips.complang.tuwien.ac.at>:

> The gcc maintainers do not have any control over gcc behaviour?

The addresses where compiled code chunks end up in the final
executable are mostly chosen by the linker, which, in the GNU world,
is not gcc itself, but the binutils.

--Thomas Pornin

Andrew Haley

unread,

Aug 14, 2008, 4:56:13 AM8/14/08

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >> >It is not common to use 16-bit code addressing on a 32-bit
> >> >processor because no-on wants to be limited to 64 kilothings of
> >> >code, but the same reasoning doesn't apply to 64-bit processors.
> >> >There's no reason at all to use 64-bit threading on a 64-bit
> >> >processor.
> >
> >> There are a number of reasons:
> >

> >> * Nothing at all guarantees that the code is all in the lower 4G
> >> of the address space (and on at least one platform it isn't), and
> >> the gcc maintainers and others have a tendency to break the gcc
> >> behaviour that we rely on;
> >
> >Of course, this isn't something gcc maintainers have any control over.

> The gcc maintainers do not have any control over gcc behaviour?

This isn't gcc behaviour.

> >I don't know who "others" may be!

> In the example below, I guess it was the binutils maintainers.

Perhaps, but the address at which a program is loaded probably isn't
controlled by the binutils maintainers either.

> Hmm, thinking again about it, with ITC it's necessarily so. With ITC,
> the addresses you put in the threaded code are not code addresses, but
> code-field addresses, i.e., general dictionary addresses. If you
> restrict these to 32 bits, you restrict the dictionary to the lower
> 4G. What kind of 64-bit system would that be?

Obviously, that'd be a 64-bit system with code restricted to the lower
4G. The non-code part of the dictionary could be anywhere. obviously.

> >> BTW, I just looked at the switch tables generated by gcc, and they
> >> use 64-bit entries on AMD64, while according to you they could use
> >> 32-bit entries. If, as you say, there's no reason to use 64-bit
> >> entries, why do they use them?
> ...
> >Maybe a simple jump indirect instruction is used, and that instruction
> >always uses a 64-bit pointer in memory.

> That's a good explanation. Let's see:

> jmp *.L11(,%rax,8)

> Yes, that's a good reason, but I'm not sure if the benefit of using
> a single instruction instead of two is worth the higher number of
> cache misses from the larger switch table. Especially since the
> compiler then chooses to add another instruction that just slows
> things down:

> ja .L25
> mov %eax, %eax
> jmp *.L11(,%rax,8)

> I guess the MOV is there to get enough distance between the two
> jumps. But does it have to use %eax (on which the next instruction
> depends)? And if we want an instruction in between, we could split
> the JMP and use a 32-bit switch table:

> movl %eax,.L11(,%rax,4)
> jmp *%rax

Yes, it looks like a 32-bit switch table would be better.

Andrew.

Anton Ertl

unread,

Aug 14, 2008, 6:00:30 AM8/14/08

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >> >There's no reason at all to use 64-bit threading on a 64-bit
>> >> >processor.
>> >
>> >> There are a number of reasons:
>> >
>
>> >> * Nothing at all guarantees that the code is all in the lower 4G
>> >> of the address space (and on at least one platform it isn't), and
>> >> the gcc maintainers and others have a tendency to break the gcc
>> >> behaviour that we rely on;
>> >
>> >Of course, this isn't something gcc maintainers have any control over.
>
>> The gcc maintainers do not have any control over gcc behaviour?
>
>This isn't gcc behaviour.

If gcc introduces crossjumps and pointlessly reorders the basic
blocks, that's not gcc behaviour?

>> >I don't know who "others" may be!
>
>> In the example below, I guess it was the binutils maintainers.
>
>Perhaps, but the address at which a program is loaded probably isn't
>controlled by the binutils maintainers either.

So, you are saying that nobody is responsible for code placement,
nobody gives us a guarantee for it, but we should rely on it being in
the lower 4G. Hmm, OTOH, would thse unknown people be worse than the
gcc maintainers? Probably not.

>> Hmm, thinking again about it, with ITC it's necessarily so. With ITC,
>> the addresses you put in the threaded code are not code addresses, but
>> code-field addresses, i.e., general dictionary addresses. If you
>> restrict these to 32 bits, you restrict the dictionary to the lower
>> 4G. What kind of 64-bit system would that be?
>
>Obviously, that'd be a 64-bit system with code restricted to the lower
>4G. The non-code part of the dictionary could be anywhere. obviously.

With the unusual definition of "code" that includes every word header,
including those of e.g., constants and CREATEd words.

Andrew Haley

unread,

Aug 14, 2008, 6:52:09 AM8/14/08

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> >> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
> >> >> >There's no reason at all to use 64-bit threading on a 64-bit
> >> >> >processor.
> >> >
> >> >> There are a number of reasons:
> >> >
> >
> >> >> * Nothing at all guarantees that the code is all in the lower 4G
> >> >> of the address space (and on at least one platform it isn't), and
> >> >> the gcc maintainers and others have a tendency to break the gcc
> >> >> behaviour that we rely on;
> >> >
> >> >Of course, this isn't something gcc maintainers have any control over.
> >
> >> The gcc maintainers do not have any control over gcc behaviour?
> >
> >This isn't gcc behaviour.

> If gcc introduces crossjumps and pointlessly reorders the basic
> blocks, that's not gcc behaviour?

What on Earth does reordering basic blocks have to do with any of
this?

> >> >I don't know who "others" may be!
> >
> >> In the example below, I guess it was the binutils maintainers.
> >
> >Perhaps, but the address at which a program is loaded probably isn't
> >controlled by the binutils maintainers either.

> So, you are saying that nobody is responsible for code placement,

I don't know. Maybe the kernel or libc, maybe both. This sort of
thing is usually worked out by negotiation between the teams.

> nobody gives us a guarantee for it, but we should rely on it being
> in the lower 4G.

This one is guaranteed by the ABI. The "small" x86_64 model depends
on it.

> Hmm, OTOH, would thse unknown people be worse than the gcc
> maintainers? Probably not.

> >> Hmm, thinking again about it, with ITC it's necessarily so. With ITC,
> >> the addresses you put in the threaded code are not code addresses, but
> >> code-field addresses, i.e., general dictionary addresses. If you
> >> restrict these to 32 bits, you restrict the dictionary to the lower
> >> 4G. What kind of 64-bit system would that be?
> >
> >Obviously, that'd be a 64-bit system with code restricted to the lower
> >4G. The non-code part of the dictionary could be anywhere. obviously.

> With the unusual definition of "code" that includes every word header,
> including those of e.g., constants and CREATEd words.

I don't see why the whole header must be there. Just the code field,
surely.

Andrew.

Anton Ertl

unread,

Aug 14, 2008, 4:48:48 PM8/14/08

to

Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> >> >> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>> >> >> >There's no reason at all to use 64-bit threading on a 64-bit
>> >> >> >processor.
>> >> >
>> >> >> There are a number of reasons:
>> >> >
>> >
>> >> >> * Nothing at all guarantees that the code is all in the lower 4G
>> >> >> of the address space (and on at least one platform it isn't), and
>> >> >> the gcc maintainers and others have a tendency to break the gcc
>> >> >> behaviour that we rely on;
>> >> >
>> >> >Of course, this isn't something gcc maintainers have any control over.
>> >
>> >> The gcc maintainers do not have any control over gcc behaviour?
>> >
>> >This isn't gcc behaviour.
>
>> If gcc introduces crossjumps and pointlessly reorders the basic
>> blocks, that's not gcc behaviour?
>
>What on Earth does reordering basic blocks have to do with any of
>this?

It's one of the gcc behaviours that we relied on an that was broken by
the gcc maintainers.

>> So, you are saying that nobody is responsible for code placement,
>
>I don't know. Maybe the kernel or libc, maybe both.

In my experience the code is placed where the binary says it should be
placed, and the binary is produced by the linker.

>> nobody gives us a guarantee for it, but we should rely on it being
>> in the lower 4G.
>
>This one is guaranteed by the ABI. The "small" x86_64 model depends
>on it.

And if we decided to rely on it, the next version of gcc would no
longer support the small model, and would place the code beyond 4GB in
order to show the moral superiority of the standards-loving gcc
maintainers.

>> >Obviously, that'd be a 64-bit system with code restricted to the lower
>> >4G. The non-code part of the dictionary could be anywhere. obviously.
>
>> With the unusual definition of "code" that includes every word header,
>> including those of e.g., constants and CREATEd words.
>
>I don't see why the whole header must be there. Just the code field,
>surely.

Sure. In traditional Forth systems for large systems the code field
is very close to the other header fields, and that's right before the
parameter field. In such a system, 32-bit threaded code just means
limiting the dictionary to 4G.

jacko

unread,

Aug 15, 2008, 10:34:53 AM8/15/08

to

On 11 Aug, 16:11, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Mon, 11 Aug 2008 06:42:33 -0700 (PDT), jacko <jackokr...@gmail.com>

> wrote:
>
> >On 8 Aug, 13:42, stephen...@mpeforth.com (Stephen Pelc) wrote:
> >Yeah probbly large code pointers used mainly full of some constant in
> >the upper 16 bit.
> >Low level words should generate at most 1 CALL. There is very little
> >reason not to use 16 bit addressing
>
> When we started serious NCC development, we very quickly learned to
> trust measured results over opinion. That's why we developed and
> published a simple set of integer benchmarks
> http://www.mpeforth.com/arena/benchmrk.fth
> http://www.mpeforth.com/arena/xbench32.fth
>
> You are welcome to publish your results.

I will if I implement all the required word set or the benchmark, but
a result of 332 LEs for the processor testifies to its compactness.

> >> On silicon stack machines, Chuck Moore certainly doesn't agree
> >> with you. Some Forth-machine FPGA implementations have reached
> >> several hundreds of MIPs, but I'm not free to say more.
>
> >Yes and they probably don't use the lowest speed grade and have
> >pockets bigger than there heads.
>
> If you come to EuroForth 2008 in Vienna, you can talk to people
> who have shipped silicon and used silicon stack machines for real
> applications.

Yes if I developed a hole in my pocket that big maybe I'd go. Your
condicending tone makes me somewhat disinterested, Whats your personal
4 LUT benchmark?

> Stephen
>
> --
> Stephen Pelc, stephen...@mpeforth.com

> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

> web:http://www.mpeforth.com- free VFX Forth downloads

Jacko

-- Simon Jackson, BEng. jacko...@gmail.com
K Ring Technologies Semiconductor - Smaller, Less Baggage
NFA
Tel: +44 (0)7967973001 Fax: F**K OFF and use E-mail
web: ttp://nibz.googlecode.com - Free IP Core (BSD (C) Excemption Cash
Licence Available)

jacko

unread,

Aug 15, 2008, 10:45:34 AM8/15/08

to

On 14 Aug, 11:52, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:
> Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> > Andrew Haley <andre...@littlepinkcloud.invalid> writes:
> > >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> > >> Andrew Haley <andre...@littlepinkcloud.invalid> writes:
> > >> >Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

The ROMable fields?

> Andrew.- Hide quoted text -
>
> - Show quoted text -

Stephen Pelc

unread,

Aug 18, 2008, 5:08:39 AM8/18/08

to

On Fri, 15 Aug 2008 07:34:53 -0700 (PDT), jacko <jacko...@gmail.com>
wrote:

>Yes if I developed a hole in my pocket that big maybe I'd go. Your

>condicending tone makes me somewhat disinterested, Whats your personal
>4 LUT benchmark?

I did not mean to be condescending. If the objective is to use the
lowest number of FPGA blocks, why? What else will be in the FPGA?

Stephen

--
Stephen Pelc, steph...@mpeforth.com

MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

jacko

unread,

Aug 18, 2008, 12:04:23 PM8/18/08

to

On 18 Aug, 10:08, stephen...@mpeforth.com (Stephen Pelc) wrote:
> On Fri, 15 Aug 2008 07:34:53 -0700 (PDT), jacko <jackokr...@gmail.com>

> wrote:
>
> >Yes if I developed a hole in my pocket that big maybe I'd go. Your
> >condicending tone makes me somewhat disinterested, Whats your personal
> >4 LUT benchmark?
>
> I did not mean to be condescending. If the objective is to use the
> lowest number of FPGA blocks, why? What else will be in the FPGA?
>
> Stephen
>
> --

> Stephen Pelc, stephen...@mpeforth.com

> MicroProcessor Engineering Ltd - More Real, Less Time
> 133 Hill Lane, Southampton SO15 5AF, England
> tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691

> web:http://www.mpeforth.com- free VFX Forth downloads

Well in reality of budget constrants it's a CPLD with 1270 LEs, not a
mega dollar fpga.
http://nibz.googlecode.com processor, and also 2*Phase Ultrasonic DAC
(FREE for nibz site, Not BSD) with DMA sequencer, quite likely a
keyboard of some kind, and maybe even a PIO Mode 4 hard disk
controller. O yes, and some kind of display driver. It will be a tight
fit.

cheers
jacko

jacko

unread,

Aug 18, 2008, 12:25:20 PM8/18/08

to

Hi

Look at it like this, a plastic cased fpga synth, versus a simpler
CPLD synth with differing tonal character is a quality stainless case,
with ALPS treacle pots. Budget is to do with feel. Electronic
complexity is not the be all and end all.

cheers
jacko

jacko

unread,

Aug 18, 2008, 12:36:56 PM8/18/08

to

Recessed ALPS treacle pots, "better knobs in than off". ;-)

Brad Eckert

unread,

Aug 18, 2008, 1:18:50 PM8/18/08

to

On Aug 18, 9:04 am, jacko <jackokr...@gmail.com> wrote:
>
> Well in reality of budget constrants it's a CPLD with 1270 LEs, not a
> mega dollar fpga.

Altera's online store lists this CPLD at $22.70 for the cheapest one.
Compare with $12.80 for the cheapest CycloneIII.

Okay, you probably want a secure bitstream. Spartan3A has that covered
with their "Device DNA" trick. Digikey lists the XC3S50A-4TQ144C at
$10.02 and it boots from a standard serial SPI flash. All pricing is
in onesies.

This is getting pretty far off-topic but it does illustrate that
putting Forth in FPGAs is getting much more cost effective these days.
Limited internal memory will favor Forth's shrewd use of resources for
the next 10 years at least.

-Brad

jacko

unread,

Aug 18, 2008, 5:37:26 PM8/18/08

to

On 18 Aug, 18:18, Brad Eckert <nospaambr...@tinyboot.com> wrote:
> On Aug 18, 9:04 am, jacko <jackokr...@gmail.com> wrote:
>
>
>
> > Well in reality of budget constrants it's a CPLD with 1270 LEs, not a
> > mega dollar fpga.
>
> Altera's online store lists this CPLD at $22.70 for the cheapest one.
> Compare with $12.80 for the cheapest CycloneIII.

MAX II has 512*16 bit flash on board (good enough for a boot ROM),
Cyclone requires external flash. My devkit is MAX II ($100 full board
inc LCD unit).

> Okay, you probably want a secure bitstream. Spartan3A has that covered
> with their "Device DNA" trick. Digikey lists the XC3S50A-4TQ144C at
> $10.02 and it boots from a standard serial SPI flash. All pricing is
> in onesies.

Na a big feature will be JTAG reprogramming, and open VHDL (not
necessarily free). Custom reflashing is just part of the culture
possibilities. Cheap price, but the flash is external again, although
this maybe offset by external RAM cost. Low power is another MAX II
feature, getting battery powered performance. Not that it's a feature
needed by all, but headphone composition? Old SIM cards SMS space for
preset saving? (leave phone nums, as must not lose, good for sending
preset to others too.) The in system reprograming feature is also good
here, as self JTAGing between song setups.

> This is getting pretty far off-topic but it does illustrate that
> putting Forth in FPGAs is getting much more cost effective these days.
> Limited internal memory will favor Forth's shrewd use of resources for
> the next 10 years at least.

Ya it is the main reason over C.

cheers
jacko