neg in, temp
xnor in, temp, temp
popc temp, result
When I replace v8 hand-written code for find-first-bit with this
instruction sequence on an ultrasparc I, performance drops severely on
this section of code. Based on this, I suspect "popc" is not really
supported in hardware. I couldn't find any on-line documentation from
Sun one way or the other. Sun provides ffs() as a library, but
disassembly shows this to not be a clever implementation, nor does it
use "popc".
This leads to my questions:
1. Which popular ISAs support find-first-bit or pop count?
2. Which implementations do it in hardware? What does ultra-I do?
3. Can any common applications make use of it?
4. Are there clever hacks for fast find-first-bit? The pop count hack:
for (popc = 0; x != 0; x &= x - 1) popc++;
won't help because it only works for the sparse case, and the
find-first-bit trick above guarantees the population count won't
be sparse. It can be done in log steps using masks, but I'm
looking for something more clever (ie. using the carry chain, or
at least suitable for superscalar execution).
Thanks,
- Dave
--
David Stoutamire
http://www.icsi.berkeley.edu/~davids
Intel x86 supports find-first-set-bit.
or pop count?
Intel MMX supports POP COUNT.
2. Which implementations do it in hardware?
Intel's Pentium Pro processor supports FFS in hardware.
--
Andy "Krazy" Glew, gl...@ichips.intel.com, Intel,
M/S JF1-19, 5200 NE Elam Young Pkwy, Hillsboro, Oregon 97124-6497.
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer.
We're looking for a few good microarchitects for our research labs.
--- Semi-political statement ---
Medical abortion, should you find it necessary, may be legally obtained
at a professional family planning clinic. Consult your phone book.
This .sig file, which explains how you can obtain an abortion, is in
direct violation of Section 1462 of Title 18, amended, of Section 230(e)2
of the Communications Act of 1934, amended in 1996. Consult your local
representative or senator.
(By the way, if anyone cares, I am personally against abortion,
although I suppose that politically I am marginally
pro-Choice. Actually, abortion is an issue I do not care much about
either way. But I care one hell of a lot about censorship!)
An application I am working on needs to quickly identity the first set
bit in a word, starting from LSB. Sparc v9 has a population count
instruction from which find-first-bit can be synthesized:
neg in, temp
xnor in, temp, temp
popc temp, result
When I replace v8 hand-written code for find-first-bit with this
instruction sequence on an ultrasparc I, performance drops severely on
this section of code. Based on this, I suspect "popc" is not really
supported in hardware. [...] This leads to my questions:
1. Which popular ISAs support find-first-bit or pop count?
PPC has "count leading zeros" - cntlz
2. Which implementations do it in hardware? What does ultra-I do?
All PPC do it in hardware.
3. Can any common applications make use of it?
PPC gets to test vs zero and generate 0/1 result (handy for C) in 2
instructions: cntlz and rotate/mask. We get a lot of flow-free
conditional constant assignment cheap from these ops.
Cliff
--
Cliff Click, Ph.D. Compiler Researcher & Designer
RISC Software, Motorola PowerPC Compilers
cli...@risc.sps.mot.com (512) 891-7240
http://members.aol.com/mjclick1
>An application I am working on needs to quickly identity the first set
>bit in a word, starting from LSB. Sparc v9 has a population count
>instruction from which find-first-bit can be synthesized:
>
> neg in, temp
> xnor in, temp, temp
> popc temp, result
Be careful, since in = 0 and in = 2^31 yield the
same values for in ^ (-in) using 32-bit operations.
It's safer to use (in - 1) & ~in or in & (-in).
>When I replace v8 hand-written code for find-first-bit with this
>instruction sequence on an ultrasparc I, performance drops severely on
>this section of code. Based on this, I suspect "popc" is not really
>supported in hardware. I couldn't find any on-line documentation from
>Sun one way or the other. Sun provides ffs() as a library, but
>disassembly shows this to not be a clever implementation, nor does it
>use "popc".
>
>This leads to my questions:
>
> 1. Which popular ISAs support find-first-bit or pop count?
Cray (leading zero count, population count).
Intel x86 (find first bit starting from right or left).
CDC Cyber series (population count).
I think the Alliant also had these operations.
Of these, only x86 is `popular' today.
> 3. Can any common applications make use of it?
It occurs in the binary GCD algorithm (ulong = unsigned long):
ulong binary_gcd (ulong x, ulong y)
{ if (x == 0 || y == 0) {
return x | y;
} else {
ulong n1 = x >> trailing_zero_count(x);
ulong n2 = y >> trailing_zero_count(y);
while (n1 != n2) { /* Both are odd */
if (n1 > n2) {
n1 -= n2;
n1 >>= trailing_zero_count(n1);
} else {
n2 -= n1;
n2 >>= trailing_zero_count(n2);
}
}
return n1 << trailing_zero_count(x | y);
}
}
Yes, we can replace
n1 >>= trailing_zero_count(n1);
by
do {n1 >>= 1;} while (n1 & 1);
since n1 is known to be even and nonzero at that point.
But we desire an algorithm with predictable branches.
The IF-ELSE can be replaced by a branchless sequence which first
finds MIN(x, y) and MAX(x, y), on architectures with
conditional moves. It's harder to optimize
trailing_zero_count without hardware support such as
the population count function.
In a linear algebra application over the field GF(2),
I need the dot product of two binary vectors v1 and v2.
It is population_count(v1 & v2) & 1.
Using the binary method of exponentiation, computing
a power b**e takes truncated_base2_logarithm(e) squarings
and population_count(e)-1 multiplications, for integer e > 0.
In 1970, after I had taken a statistics course, and was
vending machine manager at my dormitory, I took a
survey in which everyone marked the products he liked.
Let V[v, p] = 1 if voter v likes product p,
and V[v, p] = 0 if voter v dislikes product p.
I computed sum_v V[v, p1] * V[v, p2] for all pairs (p1, p2),
as part of the formula for the correlation between p1 and p2.
A few weeks later, in an assembly language course, I learned
that the machine (a CDC 6400) lacked an integer multiply
instruction even though this was a primitive in the programming
language (Fortran) I had learned earlier.
Had I used binary AND rather than multiplication,
the program would have run faster (recall all values are 0 or 1).
This was in the days when CPU usage was expensive.
By the end of the term, I realized I could have
packed (for fixed p) all V[v, p] values into a few 60-bit words
(3 words since we had 150 voters). A few bitwise ANDs
and population counts would produce the sums I needed.
The operations of population count, finding the leftmost
nonzero bit (i.e., base 2 logarithm), and finding the rightmost
nonzero bit (i.e., finding the exponent of 2 dividing an integer)
occur frequently enough in applications that programming languages
should have these primitives and compilers should inline them.
At this time, however, their performance is not critical
for my applications.
> 4. Are there clever hacks for fast find-first-bit? The pop count hack:
One can compute
temp = in & (-in);
Then temp = 0 if in = 0; otherwise temp is a power of 2.
If p > 32 (or p > 64) is a prime for which 2 is a primitive root,
then you can compute (temp % p) and look up the answer
using a memory reference.
Alas, integer division is slow. If the high half of the
integer product is available, you can get temp/p
using one multiplication and a few simple operations.
I recall that SPARC v9 has the high half of a 32 x 32
product, but not of a 64 x 64 product.
If you are willing to use floating point, try
float f = (float) (signed long) (in | (-in));
Now look at the exponent of f (which is the negative of a
power of 2 unless in = 0).
--
Peter L. Montgomery pmon...@cwi.nl San Rafael, California
My sex is male. I am ineligible for an abortion.
> 4. Are there clever hacks for fast find-first-bit?
Look at the behavior of some of the loating-point instructions:
a colleague and I speeded up a friend's chess program by reimplementing
his (vax) hack on a Honeybun: load the bitvector into the
mantissa part of a floating point register, carefully zeroing
the exponent. Then fire a floating point normalize at it, take
the mantissa and decrement by one.
I suspect you can get a shift count to the first zero via some
similar normalization/denormalization operation.
--dave
--
David Collier-Brown, | Always do right. This will gratify some people
185 Ellerslie Ave., | astonish the rest. -- Mark Twain
Willowdale, Ontario | dav...@hobbes.ss.org, unicaat.yorku.ca
N2M 1Y3. 416-223-8968 | http://java.science.yorku.ca/~davecb
| An application I am working on needs to quickly identity the first set
| bit in a word, starting from LSB.
| This leads to my questions:
|
| 1. Which popular ISAs support find-first-bit or pop count?
The Power and PowerPC ISA's have a cntlzw instruction which counts the number
of leading zeros (and a cntlzd instruction when the 64-bit versions of the
PowerPC are available). This instruction is 1 cycle for all PowerPC's I'm
familar with. The PowerPC Compiler Writer's Guide gives the following code
sequence to count the number of trailing 0's:
addi r4,r3,-1
andc r4,r4,r3
cntlzw r4,r4
subfic r4,r4,32
| 2. Which implementations do it in hardware? What does ultra-I do?
|
| 3. Can any common applications make use of it?
|
| 4. Are there clever hacks for fast find-first-bit? The pop count hack:
|
| for (popc = 0; x != 0; x &= x - 1) popc++;
|
| won't help because it only works for the sparse case, and the
| find-first-bit trick above guarantees the population count won't
| be sparse. It can be done in log steps using masks, but I'm
| looking for something more clever (ie. using the carry chain, or
| at least suitable for superscalar execution).
Well, there are two general approaches (again from the above book), when you
know the value is 32 bits. The first involves a lookup table of 256 bytes that
gives the popcount for each byte, and then using shifts and masks to get each
value:
int nbits(unsigned int x)
{
static unsigned char popcount[256] = { ... };
int count1, count2, count3, count4;
count1 = popcount[ (x>>24) & 0xff ];
count2 = popcount[ (x>>16) & 0xff ];
count3 = popcount[ (x>>8) & 0xff ];
count4 = popcount[ x & 0xff ];
return count1 + count2 + count3 + count4;
}
The other uses no branches:
int nbits(unsigned int x)
{
unsigned long t;
x = x - ((x >> 1) & 0x55555555);
t = ((x >> 2) & 0x33333333);
x = (x & 0x33333333) + t;
x = (x + (x >> 4)) & 0x0f0f0f0f;
x = x + (x << 8);
x = x + (x << 16);
return (x >> 24);
}
--
Michael Meissner, Cygnus Support (East Coast)
Suite 105, 48 Grove Street, Somerville, MA 02144, USA
meis...@cygnus.com, 617-629-3016 (office), 617-629-3010 (fax)
I believe the decryption spooks at the National Security Agency are
very fond of the population-count operation. It seems to be so useful
to them that every US-made govt-subsidized supercomputer has included
this instruction even though there is little other use for it.
Find-first-bit is useful in the post-subtraction normalization step of
software-implemented floating point; in some data compression schemes;
and in disk or memory allocators that scan bitmaps for available slots.
magic_table[ (X ^ (X-1)) % 37 ] (on a 32-bit machine, assuming a nonzero
word)
This relies on 2 being a primitive root modulo 37. Use %67 on a 64-bitter.
--
The opinions expressed in this message are my own personal views
and do not reflect the official views of Microsoft Corporation.
In how many implementations of those ISAs does find-first-bit take fewer
than "number of leading 0 bits" cycles, or does pop count take fewer
than "number of 1 bits" cycles?
> 1. Which popular ISAs support find-first-bit or pop count?
The Tera (certainly the most popular architecture around here) has
instructions for pop-count, find-first-1 and find-first-0, from left
or right).
> 3. Can any common applications make use of it?
Pop count might be used for cardinality of sets represented as bit
vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
to support some of the common C string manipulation functions. Lets
us do things like copy or compare NULL-terminated strings at a rate of
8 bytes per 3 instructions.
Preston Briggs
The VAX architecture has FFC (Find First Clear bit) and FFS (Find First Set
bit). You can use it to find a bit in an arbitrarily long array.
CLRL R0 ; start at bit 0
10$: FFS R0,#32,ARRAY,R0 ; R0 is incremented by 32 for each
; longword searched
BEQL 10$ ; search next longword
; R0 is match bit index
> 2. Which implementations do it in hardware? What does ultra-I do?
The VAX 11/780 did everything in hardware, didn't it? 8-)
> 3. Can any common applications make use of it?
If you count VMS as a common application...
VMS used it (and may still use it for all I know) to find the next schedulable
process. VMS has 32 priorities with a queue for each priority, and a bitmask
indicating if each queue is non-empty. To find the highest priority non-empty
queue:
SCH$SCHED::
SETIPL #IPL$_SYNCH
FFS #0,#32,W^SCH$GL_COMQS,R2 ; find first full state
BEQ SCH$IDLE ; no executable process
MOVAQ W^SCH$AQ_COMH[R2],R3 ; compute queue head address
--
PJDM
Digital Equipment Corporation (Australia), Canberra, ACT
--------------------------------------------------------------------------------
These are my opinions, and have nothing to do with Digital.
I design Internet Service Providers. It's a job.
Andy, I'd beg to differ:
Yes, the 386, 486 and Pentium do have BSF and BSR (Bit Scan Forward/Reverse)
opcodes, but they are so terribly slow as to be totally unusable in the sorts of
time-critical code where you need this kind of bit twiddling.
A regular binary search or loop with a final lookup table will be much faster than
the native opcode.
>[snip]
> Intel's Pentium Pro processor supports FFS in hardware.
Finally! But unfortunately too late, I'm afraid. :-(
--
- <Terje.M...@hda.hydro.com>
Using self-discipline, see http://www.eiffel.com/discipline
"almost all programming can be viewed as an exercise in caching"
Is this a trick question?
-eric
The fact that this opcode is slow, when it could be a very useful instruction,
is another case of processor vendors optimizing to a high level language, "C".
If C had a construct such as allowing a multi-way branch on a bit field, then
this instruction would be very useful to compiler writers and it would have
been supported in hardware. If the PDP-16 had this instruction all processors
would support it today, because K&R would have used it in C.
>
There really should be some effort to extend our high level languages to
support those constructs that are well supportable in hardware. For
instance the "multi-media" instructions that everyone is coming up
with could have existed long ago if we had a high level language that
supported the notion of more than one type of integer arithmetic.
Why can't C++, Ada, and other languages add native types for saturated
arithmetic? Another useful language extension would be to incorporate
SIMD; this may be implicitly captured in languages that support array
arithmetic.
emil
> ...[stuff deleted]....
>Why can't C++, Ada, and other languages add native types for saturated
>arithmetic? Another useful language extension would be to incorporate
>SIMD; this may be implicitly captured in languages that support array
>arithmetic.
>
Well, APL's been around for donkey's years, and has a dedicated following
within some niches, but only one CPU was ever designed around its array
operations (to the best of my out-of-date knowledge). Lisp has been around
even longer and has far more users, but custom Lisp CPUs seem a bit thin on
the ground too.
It seems that having a saleable langauge that takes a certain view of the
world is not enough to move the ISA designers....
============================================================================
Ian Kemmish 18 Durham Close, Biggleswade, Beds SG18 8HZ
i...@five-d.com Tel: +44 1767 601 361 Fax: +44 1767 312 006
Info on Jaws and 5D's other products on http://www.five-d.com/5d
============================================================================
`The customer is King, but the proprietor is God'
Define "trick question".
It's a question asking which of those ISAs "support" it by doing a
simple loop, checking each bit - which, on many modern machines, is
probably not much better than doing the same loop in software - and
which support it in some other fashion, e.g. executing it in constant
time if that's possible.
I guess it counts as a "trick question" in that it's intended to dig out
which machines "support" it in a fashion that probably isn't of much
interest to those who want the operation (i.e., that do it the slow way)
and which support it in a more "real" fashion.
I suppose that if:
1) there are a series of implementations of the ISA;
2) some of them are significantly cheaper by implementing it the
slow way (either in hardware or by a sufficiently-fast trap
to a software implementation);
the support even in the slow-implementation machines might be a win, in
that you can have a binary that uses the fast hardware implementation on
those machines that have it but still have it run OK on machines that
don't, although that might also be done by an indirect call to a library
routine (albeit with procedure-call overhead).
| This leads to my questions:
|
| 1. Which popular ISAs support find-first-bit or pop count?
|
| 2. Which implementations do it in hardware? What does ultra-I do?
|
| 3. Can any common applications make use of it?
|
| 4. Are there clever hacks for fast find-first-bit? The pop count hack:
Some decades ago we used to drop something in the mantissa of a FP
number and use the normalize instruction. That's probably a lot
slower than something else by now.
I think the first bit was done by N & (-N) on two's complement
machines, but watch out for the sign bit only case on some hardware.
--
-bill davidsen (davi...@tmr.com)
"As a software development model, Anarchy does not scale well."
-Dave Welch
> I believe the decryption spooks at the National Security Agency are
> very fond of the population-count operation. It seems to be so useful
> to them that every US-made govt-subsidized supercomputer has included
> this instruction even though there is little other use for it.
Popcount is also very useful in image-processing applications. Given
the recent popularity of scanners and OCR software, I wouldn't be surprised
to see a resurgence of its popularity.
> Find-first-bit is useful in the post-subtraction normalization step of
> software-implemented floating point; in some data compression schemes;
> and in disk or memory allocators that scan bitmaps for available slots.
Also, for cheap approximations to log2(n) and image-processing.
--
www/ftp directory:
ftp://ftp.netcom.com/pub/hb/hbaker/home.html
> If C had a construct such as allowing a multi-way branch on a bit field, then
> this instruction would be very useful to compiler writers and it would have
> been supported in hardware. If the PDP-16 had this instruction all processors
> would support it today, because K&R would have used it in C.
I never heard of a PDP-16, what is it? Why would all processors support an
instruction today, just because it was included in the "PDP-16" whatever
that is, or was? How did this "PDP-16" come to have so much influence over
today's computer instruction sets?
Regards,
John Byrns
Ah. Well it is certainly possible to do constant_time hardware implementations.
A purely combinational 64-bit leading_zero should be quite doable in under 10ns
using typical 0.5um standard cell CMOS. Probably a fully combinational Pop_count
would take somewhat longer: perhaps 15ns. The real estate is not particularly
pricey either. For example, each combinational function should only require
perhaps 1K "gates" out of a typical 1M equivalent gate VLSI technology.
A straight forward combinational LZ implementation would:
1. propagate one bits to the right (from MSB to LSB) - the result is a
64-bit "mask" with zeroes to the left of the LZ-bit position and ones
to the right.
2. evaluate adjacent bit-pairs for the zero-to-one transition (i.e., the LZ
position) in the mask - of course only at most one of the 63 results will
be non-zero.
3. encode the result bits into a 7-bit position number - the LZ position
relative to the MSB is equivalent to the LZ count.
>
> I guess it counts as a "trick question" in that it's intended to dig out
> which machines "support" it in a fashion that probably isn't of much
> interest to those who want the operation (i.e., that do it the slow way)
> and which support it in a more "real" fashion.
>
> I suppose that if:
>
> 1) there are a series of implementations of the ISA;
>
> 2) some of them are significantly cheaper by implementing it the
> slow way (either in hardware or by a sufficiently-fast trap
> to a software implementation);
>
> the support even in the slow-implementation machines might be a win, in
> that you can have a binary that uses the fast hardware implementation on
> those machines that have it but still have it run OK on machines that
> don't, although that might also be done by an indirect call to a library
> routine (albeit with procedure-call overhead).
I guess "cheaper" is relative. Since these are less utilized functions, even
the 1K gates for each may be too much to spend. For CRAY, it doesn't seem to
have been.
-eric
[not speaking for CRI, SGI, or the even gas station down the street...]
I just tried to implement a few of these in Pentium asm, and it seems like this
approach, even when I implement the division (modulo) operation with a MUL, is
almost exactly the same speed as the classic unrolled mask/shift/add algorithm,
at about 19-20 cycles.
A cpu with a fast 32x32->(upper 32 bits of the result) multiply could be about
twice as fast however, since MUL on 386/486/Pentium all use about 10-11 cycles.
On a superscalar cpu, the 256-byte lookup table is a lot easier:
The same Pentium cpu can do this in less than 10 cycles, as long as the four
table lookups all hit the L1 cache.
--
- <Terje.M...@hda.hydro.com>
On a superscalar cpu, the 256-byte lookup table is a lot easier:
The same Pentium cpu can do this in less than 10 cycles, as long as
the four table lookups all hit the L1 cache.
The achilles heel of lookup tables: they eat into your precious L1
cache. 20 clocks looks pretty cheap if each lookup *misses* to
main memory. ('course 256-byte table is pretty small & tight loops
will keep the table local).
The CDC 6600 (Cray's first "big" machine) implemented POP count in hardware,
if I remember correctly, the cycle time was typically 2 cycles (100 ns
clock). The cheap version (CDC 6400 & 6500) implemented it with a loop
counter (still in hardware, Cray never built a microcoded machine). The
cycle time stayed the same on later machines, although clock frequency was
increased to 40 Mhz. The functional unit was also pipelined on later
machines, allowing 1 POP instruction per cycle.
Its interesting to note that the parity bit in the x86 ISA is the LSB of
a Population count, although I can't think of any way to use that info to
count bits faster.
Tim McCaffrey
?? Which CPU?
I worked on the Analogic APL Machine at one time; a computer system built
around matching APL to an array processor. We loaded the array processor's
control stores with microcode implementing APL primitives, and handled
non-arithmetic operations with a couple of MC68000s... It was a pretty
slick system, with APL keycaps and character-generators. Probably fewer
than 100 were built...
>============================================================================
>Ian Kemmish 18 Durham Close, Biggleswade, Beds SG18 8HZ
>i...@five-d.com Tel: +44 1767 601 361 Fax: +44 1767 312 006
>Info on Jaws and 5D's other products on http://www.five-d.com/5d
>============================================================================
--
Karl Zimmerman zim...@bostech.com
Contracting Software Engineer
My opinions are not necessarily those of my employers.
They can be useful for scheduling algorithms.
Mike
--
----
char *p="char *p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);}
I don't speak for DSC. <- They make me say that.
>In article <4qp9vm$p...@newsgate.dircon.co.uk> i...@five-d.com (Ian Kemmish) writes:
>>Well, APL's been around for donkey's years, and has a dedicated following
>>within some niches, but only one CPU was ever designed around its array
>>operations (to the best of my out-of-date knowledge).
[snip]
APL is an array processing language. APL programmers tend to solve
problems from an array appraoch. This makes APL programs well suited
to parallel processing as they inherently have a lot of perallelism in
them. I once heard a talk given by Jim Brown of IBM where he described
porting APL2 to IBM's vector processor. He discovered that "good APL",
i.e. concise programs that exploit APL's powerful opperators, was
surprisingly fast compared to programs written in other languages,
when run on the vector processor.
Perhaps APL would be particularly good at exploiting the potential of
superscalar processors.
: I worked on the Analogic APL Machine at one time; a computer system built
: around matching APL to an array processor. We loaded the array processor's
: control stores with microcode implementing APL primitives, and handled
: non-arithmetic operations with a couple of MC68000s... It was a pretty
: slick system, with APL keycaps and character-generators. Probably fewer
: than 100 were built...
I've always wondered if any APL machines are still running at the NSA.
Probably a bit hard to get parts for after all this time. Perhaps they
have been converted into planters or coffee tables. :)
--
Mike Duvos $ PGP 2.6 Public Key available $
m...@netcom.com $ via Finger. $
dav...@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
> 3. Can any common applications make use of it?
I believe the decryption spooks at the National Security Agency are
very fond of the population-count operation. It seems to be so useful
to them that every US-made govt-subsidized supercomputer has included
this instruction even though there is little other use for it.
Actually, speech recognition people (and probably any pattern
recognition people) have a use for POP count:
POP( A AND B ) = the inner product (dot product) of one bit precision feature
vectors
POP( A XOR B ) = Huffman distance metric
A preliminary stage in many pattern recognition algorithms is to
compute a quick distance metric between the input pattern, and some
subset of the reference patterns in the database. You can do this via
Huffman distance, or you can interpret it is a projection.
When projecting, you can arbitrarily tradeoff precision for speed
- i.e. sometimes you might do the dot product in 64 bit floating
point, sometimes in the 8 bit precision that Intel's MMX and similar
"multimedia" instruction sets provide. 1 bit precision is just the
extrapolation of this.
Thus, you can use POP count to take a quick check of many simple
reference vectors, afterwards zooming in on a more full precision
check.
Come to think of it, this is probably the same reason that the NSA
uses.
--
---
Andy "Krazy" Glew, gl...@ichips.intel.com, Intel,
M/S JF3-206, 2111 NE 25th Ave., Hillsboro, OR 97124
Place URGENT in email subject line for mail filter prioritization.
DISCLAIMER: private posting, not representative of employer.
We're looking for a few good microarchitects for our research labs.
{{ VLIW: the leading edge of the last generation of computer architecture }}
The earliest example of an APL-like processor is probably the CDC
STAR-100. This was a storage-storage vector machine that looked quite
a bit like APL. It had, for example, the ability to perform operations
under control of a Boolean mask. (Now in most new vector designs)
I never worked on it, so I can't be more specific.
The TMC CM-2 looked even more like APL: SIMD array operations,
scans and reductions (Really old APL ideas), and masked operations.
The Analogic machine (falling between the above two in history)
was interesting, but it failed for (at least) the reason that
it was dependent on a PC to run the interpreter. My understanding
of the machine, from talking with Jim Ryan, was that the PC ran the
interpreter and treated the Analogic box ( A VERY faster vector
processor, built for catscan/signal processing work) as a backend
attached processor. It was, therefore, doomed by Amdahl's Law for
all but a few applications. Interpretive overhead on top of a slow
machine simply couldn't hack it.
That's why some of us (your truly) are writing parallel APL compilers.
They give us the array and parallel semantics of APL, while removing
the interpretive overheads that have hamstrung APL performance for
the past decades.
Bob
ps: As for a machine built around APL -- I think you'll find that
it works the other way: APL was built around the machine; the machine
being the human brain, and APL being a generalization and cleanup
of mathematical notation.
Robert Bernecky
bern...@eecg.toronto.edu
Snake Island Research Inc
18 Fifth Street, Ward's Island
Toronto, Ontario M5J 2B9
Canada
tel: (416) 203-0854
fax: (416) 203-6999
If I'm not in the eecg lab, I'm home. Or somewhere in between.
Lab: SF2002
Timings on my compiler (APEX -- the APL Parallel Executor) show that
its performance relative to an interpreter are substantially higher on
RISC machines such as RS/6000 and UltraSparc than on a 486. I do NOT
know if this is inherent or merely reflects different interpreters
on the different platforms.
Bob
(pop count)
APL and J make heavy use of Boolean arrays to perform parallel operations
that might otherwise be handled by serial conditional branch constructs.
PopCount and Index-of-first-One(zero) are both APL primitives expressed
as follows on Boolean arrays:
PopCount: +/x NB. x is the vector or array to be summed.
NB. If x is an array, this sums each row.
Index-of-First-One: x iota 1
Index-of-First-Zero: x iota 0
Applications:
PopCount: How many customer accounts are owing more than 10000?
+/accounts>10000
Index-of-FIrst-One: Who is the first such customer?
(accounts>10000)iota 1
Admittedly trivial, but you get the idea.
dav...@ICSI.Berkeley.EDU (David Petrie Stoutamire) writes:
> 1. Which popular ISAs support find-first-bit or pop count?
> 3. Can any common applications make use of it?
Pop count might be used for cardinality of sets represented as bit
vectors. We use find-leftmost-1 (along with bitwise-matrix-multiply)
to support some of the common C string manipulation functions. Lets
us do things like copy or compare NULL-terminated strings at a rate of
8 bytes per 3 instructions.
Preston Briggs
Ok, I give. How does find-leftmost-1 help in C string functions?
I have somewhere in a box a book by Rodney Zaks (SYBEX) on a microcoded APL
implementation designed for the Nanodata QM-1. Not clear whether it was ever
used..
I've also heard of IBM 360 microcode for APL.
--
Alex Colvin
alex....@dartmouth.edu
And Cliff Click wrote:
>Ok, I give. How does find-leftmost-1 help in C string functions?
Imagine we're doing the the easy one -- strlen.
We want to do it in word-sized chunks, 8 bytes per load.
We load a word, then do a bitwise matrix multiply with -1,
then find-leftmost-0 to see if there are any 0 bits in the result.
If so, we've found the end of the string and the position of the bit,
divided by 8, should be added to the accumulated length.
Obviously the trickiness is in the bitwise matrix multiply. This
operation treats a pair of 64-bit words as 8x8 bit matrices and
multiplies them using the boolean operations of AND for multiplying
two bits and either OR of XOR for summing. For this application, we
need the OR variant (the XOR version is useful for computing GF(2)).
Imagine we have 8 bytes of a string that look like
abcdefg\0
organized as an 8x8 bit matrix, we'd have
01100001 11111111 11111111
01100010 11111111 11111111
01100011 11111111 11111111
01100100 x 11111111 = 11111111
01100101 11111111 11111111
01100110 11111111 11111111
01100111 11111111 11111111
00000000 11111111 00000000
If you're counting cycles, recall that we issue 3 operations per tick,
and all operations require the same time. The operations used in the
inner loop are
load
inc
bit-mat-mul
find-leftmost-0
conditional branch
which we software pipeline into 3 instructions.
Preston Briggs
> The Analogic machine (falling between the above two in
> history) was interesting, but it failed for (at least) the
> reason that it was dependent on a PC to run the interpreter.
> My understanding of the machine, from talking with Jim Ryan,
> was that the PC ran the interpreter and treated the Analogic
> box ( A VERY faster vector processor, built for
> catscan/signal processing work) as a backend attached
> processor. It was, therefore, doomed by Amdahl's Law for all
> but a few applications. Interpretive overhead on top of a
> slow machine simply couldn't hack it.
The array processor got a decent amount of work done with a
maximum throughput of 10 MFLOPS. Its work was set up for it by a
couple of motorola 68k processors which transformed the APL
codestring into sequences of array processor operations and
controlled I/O to mass storage.
The scalar preformance of the system was of course significantly
less than its vector performance, but this was pretty much
obvious to everyone going into the project and not some sudden
surprise.
One purpose of the APL Machine project was to create a
"fingertip" array processor, which people could use to run large
vector intensive problems without a lot of programming.
Prior to the APL Machine, the mechanism for programming this and
other similar array processors was to hang them off a VAX Unibus,
and to write a VAX FORTRAN program containing lots of calls to
subroutines which allocated buffers, moved data to and from the
host machine, and activated chains of individual microcoded array
processor programs to perform the actual data munching.
Contrast this approach to programming with being able to type in
a string of APL, hit carriage return, and have an instant answer,
and the improvement in functionality is apparent.
One should view the success of the APL Machine in terms of the
degree to which it facilitated the use of an array processor for
applications appropriate for an array processor, and not in terms
of some imagined competition with existing mainframe APLs of the
era.
There are few cheap machines available even today where one can
reshape a random vector of 10,000 elements into a 100x100 matrix,
symmetrize it, compute a sorted list of all its eigenvalues and
eigenvectors, and have the answer back in slightly over one
second of elapsed time, with no more programming effort than
typing a trivial APL expression.
There were also major hardware reasons why the APL Machine was
suited only to array processor applications, and was not intended
as a general purpose APL platform. The CPU lacked the internal
consistancy checks one would find in a mainframe or supermini and
was capable of running along merrily with missing chips and badly
seated boards. There was no microverification or parity on
internal data paths, and no attempt to engineer a "run correctly
or halt" aspect to to the box. This resulted in what the
engineers referred to as "sustained but not instantaneous
accuracy", which, while perfectly adaquate for signal and image
processing and certain other real time applications, was
inappropriate for other common uses of APL, like accounting and
financial modeling.
The APL machine also contained quite a bit of semiconductor
memory, prior to the solution of the alpha particle problem in
the manufacture of memory chips. So it was statistically likely
that an occasional bit would change state every so many hours.
Again, this was not a problem for typical signal processing
applications, which could accept an occasional hiccup, but would
have again been unacceptable for a mainframe, where good
engineering would have required that memory be sniffed and single
bit errors be repaired during the refresh. Not having error
correcting memory was a perfectly fine thing to do for a low cost
high performance array processor, which was the arena in which
the machine was designed to compete.
There were a number of numerical idiosyncrasies as well, such as
the use of a non-conventional rounding algorithm which did not
require a ripple carry, thus simplifying the design of the
floating point hardware. Again, perfectly ok in the world of
array processor applications, but a complete non-sequitor to
someone used to floating arithmetic as it is conventionally
implemented.
So, Amdahl's law aside, we did succeed in creating the world's
most easily programmed array processor, which hummed along
merrily doing interesting problems in image processing, neural
nets, and GodKnowsWhat for the DOD types. This was, regretably,
a small chunk of a not particularly large overall market for
array processors, and thus the APL Machine did not make us all
filthy rich beyond our wildest dreams. Whether this constitutes
"failure" or "success" of course depends upon ones point of view.
> That's why some of us (your truly) are writing parallel APL
> compilers. They give us the array and parallel semantics of
> APL, while removing the interpretive overheads that have
> hamstrung APL performance for the past decades.
Compiling APL to native code, while it is certainly something we
know how to do, has not demonstrated itself to be the perfect
solution to the scalar overhead problem. Compiling to threaded
code, containing context validation instructions, calls to
parallel-optimized primitives, and a few well-chosen native mode
instructions when absolutely necessary seems the way to go here.
As an example, we did all the structural mixed functions on the
APL machine by compositing scatter/gather routines having various
granularities together with access pattern generators, set up as
co-routines coupled by small fixed-sized buffers. The individual
routines were small in number, and deeply pipelined. I'm not
sure a compiler could really have output faster code if the
arguments consisted of more than a few elements.
Rather than breaking APL down completely into native mode code,
you can buy yourself a lot by other tricks, like having a stack
frame wide enough to store scalars and small vectors inline, and
recognizing frequently occurring contexts involving scalars and
emitting native code specifically for those.
Personally, I think the "interpretive overhead" issue tends to be
a bit oversold. I have rarely encountered an application which
couldn't be tweeked to run acceptably fast on a well-written
production APL system, and even in the worst case, seeding the
APL environment with a few extra assembly language functions is
generally a faster solution than employing a tool as inately
complex as an APL compiler.
The Zaks implementation, like Budd's APL compiler, were both
"APL-like", in that they changed a fair bit of the language semantics,
such as name scope rules. Thus, they could not attract the dusty deck crowd.
>I've also heard of IBM 360 microcode for APL.
This did a fair job of speeding up conformability checking and syntax
analysis, if my fading brain cells recall properly. It did not
help out much on array operation speed.
The performance of this on Booleans was harmed by the dumb, Intel-like,
storage scheme for Booleans, which was intended to exploit an odd
performance quirk of the machine model (/40 , /145?) for which it was built.
This ended up making all other uses of Booleans on that system slower.
Lesson, of course, is: Don't compromise data structures in the name
of machine X hardware characteristics.
A good compiler can do a much better job on performance than
any hardware-assisted interpreter, as the compiler can do nifty things
like exploit code motion, loop fusion, etc. These things are too expensive
to find on the fly in an interpreter. A hybrid compiler/interpreter
might be able to do both, but I'm not about to start writing one today.
Bob
Sorry if I gave the impression that the project was a failure
from the technical standpoint. I meant that it didn't take over
the world of APL or array computation. It was certainly a VERY
impressive box that could be carried on a car seat, as opposed
to taking up an entire room.
>Compiling APL to native code, while it is certainly something we
>know how to do, has not demonstrated itself to be the perfect
>solution to the scalar overhead problem. Compiling to threaded
>code, containing context validation instructions, calls to
>parallel-optimized primitives, and a few well-chosen native mode
>instructions when absolutely necessary seems the way to go here.
I disagree. My compiler, APEX (the APL Parallel Executor)
is able to beat FORTRAN on a number of iterative benchmarks, due
precisely to removal of setup costs and conformability checking overheads.
Threaded code and the other things you mention above are not doing
Real Work, cause pipeline stalls in RISC machines, etc. Parallel-optimized
primitives may not help as much as loop fusion combined with
larger-grain parallelism, e.g., at the level of an FFT. Fine-grain
parallelism will cause implicit synchronization delays at the end
of each primitive. You want to avoid these and get the effect of
chaining on the CRAY.
Loop fusion on a serial processor can provide significant speedup
even for non-iterative (in the APL code sense) applications on
large (half-million elements) arrays. I get a speedup of 13.44 over
an interpreter due to this, on an application with these characteristics,
which an APL programmer would claim was an excellent APL application.
A compiler has the luxury of time to find special cases, remove
conformability checks as a result of data flow analysis,
replace n^2 algorithms with linear-time ones, etc. Its code
can be trivially scheduled for RISC machines.
>
>As an example, we did all the structural mixed functions on the
>APL machine by compositing scatter/gather routines having various
>granularities together with access pattern generators, set up as
>co-routines coupled by small fixed-sized buffers. The individual
>routines were small in number, and deeply pipelined. I'm not
>sure a compiler could really have output faster code if the
>arguments consisted of more than a few elements.
True, but it presumes your application is running 99% in the
vector box. This may be true for some specialized applications,
but by and large, you end up coming back to the interpreter a lot.
When it did work in the vector box, the Analogic machine was
quite breathtaking to watch, I admit.
>Rather than breaking APL down completely into native mode code,
>you can buy yourself a lot by other tricks, like having a stack
>frame wide enough to store scalars and small vectors inline, and
>recognizing frequently occurring contexts involving scalars and
>emitting native code specifically for those.
We call the things that do this compilers.
(For those ignorant of APL, the language has no declarations.
Array type and rank (dimensionality) are determined entirely
by the execution context.) It is slow and plenty hard to do in
an interpretive environment. A hybrid environment might do the
trick, sort of like the HP/3000 approach.
Let the compiler analysis work in the backgroun while you think
of what to type next.
>Personally, I think the "interpretive overhead" issue tends to be
>a bit oversold. I have rarely encountered an application which
>couldn't be tweeked to run acceptably fast on a well-written
>production APL system, and even in the worst case, seeding the
I'll show you some data base applications that eat 6 processors
on a 3090 for just this reason. Real-life APL applications
execute about half of ALL ops on 0 or 1-element arrays, and more than
85% of all ops are on arrays of about 16 elements or less.
(These numbers come from at 3 different sources. I instrumented
a mainframe APL interpreter and captured data from all users for
several weeks for my version. Rob WIllhoft of IBM obtained similar
data (IBM Systems Journal). An earlier Quote Quad paper by Jordan
provides similar data.)
I do not deny that you can find specific number-crunching applications
that will exploit a vector back end or in which interpreted APL will
perform quite well. What I DO deny is that such applications are
typical.
>APL environment with a few extra assembly language functions is
>generally a faster solution than employing a tool as inately
>complex as an APL compiler.
I don't know about you, but I certainly trust the correctness of
APL compiler-generated code over my hand-crafted assembler code,
and I've been writing that stuff since the dark ages.
(Maybe that's why. 8^})
And, the code is portable and parallel, which ain't the case with
assembler code.
The nice thing about the complexity of an APL compiler is that
the user need not know about it.
The expression...
~n & (n - 0x01010101) & 0x80808080
... gives zero if and only if none of the bytes in the word are nonzero.
If the result is nonzero then the position of the first 1 tells you which
byte was nonzero.
You might want to disassemble your own company's "libmoto", where the C
string functions make use of this property. :-)
-- Bruce
> I disagree. My compiler, APEX (the APL Parallel Executor) is
> able to beat FORTRAN on a number of iterative benchmarks,
> due precisely to removal of setup costs and conformability
> checking overheads.
> Threaded code and the other things you mention above are
> not doing Real Work, cause pipeline stalls in RISC machines,
> etc. Parallel-optimized primitives may not help as much as
> loop fusion combined with larger-grain parallelism, e.g., at
> the level of an FFT. Fine-grain parallelism will cause
> implicit synchronization delays at the end of each
> primitive. You want to avoid these and get the effect of
> chaining on the CRAY.
> Loop fusion on a serial processor can provide significant
> speedup even for non-iterative (in the APL code sense)
> applications on large (half-million elements) arrays. I get
> a speedup of 13.44 over an interpreter due to this, on an
> application with these characteristics, which an APL
> programmer would claim was an excellent APL application.
I think the most basic requirement for an APL compiler is that
the user not realize it is there. Implementors of APL executors
are perfectly free to employ compiler techniques if they wish,
including generating object code for everything and branching to
it if that is their choice. It is certainly also reasonable for
the purposes of software distribution to permit applications
translated in such a manner to be written, together with
necessary runtime support, into separate modules that can be
distributed to end users.
Where I think the "APL Compiler" approach has failed
historically, is that such efforts in the past have not been
seamlessly integrated into development environments. There has
been some external chunk of code known as "The APL Compiler",
with separate documentation, and it has been necessary to export
code, often with source changes, and run it through this extra
monolithic module in order to make use of its functionality. This
is generally not something lots of users choose to do, and past
APL Compiler efforts have not been resounding successes, outside
of producing interesting papers at the yearly convention.
For people who write APL that looks like FORTRAN, a compiler can
of course toss out all conformability checking, consolidate loop
iteration to cover multiple primitive operations, and wind up
with code that runs as fast as FORTRAN. However, when the code
involved utilizes the full dynamic typing and scoping rules of
the APL language, you are stuck with either conventional
interpretation, or context-sensitive compilation which faults and
recompiles to more general code for operations on data whose
attributes fluctuate. In the fully general case, you pretty much
need to have the APL system in memory with you, if you are going
to support the entire language.
> A compiler has the luxury of time to find special cases,
> remove conformability checks as a result of data flow
> analysis, replace n^2 algorithms with linear-time ones, etc.
> Its code can be trivially scheduled for RISC machines.
Of course.
>> Rather than breaking APL down completely into native mode
>> code, you can buy yourself a lot by other tricks, like
>> having a stack frame wide enough to store scalars and small
>> vectors inline, and recognizing frequently occurring
>> contexts involving scalars and emitting native code
>> specifically for those.
> We call the things that do this compilers.
I think the distinction between compilation and interpretation
begins to blur at this point. Having some sort of immediate mode
for object references which contains the data for small objects
is certainly independent of whether one is compiling or
interpreting. Again, an interpreter/compiler might choose to
translate incrementing the variable "I" into an assertion that
"I" is an integer scalar, an integer add instruction, and a
branch to check for overflow. It's certainly possible to
consider a context in which the first and last of these could be
redundant and deleted, or a context in which the integer
assertion would later fault and the data would be promoted to
floating point.
While I think that compiler techniques have a valuable place in
advanced APL systems, I don't think it is possible to create
separate entities called "APL Compilers" without restricting the
APL they compile to a FORTRAN-esque subset, unless one has most
of the innards of the APL system itself present in the runtime
support for the compiled modules. At this point, they really
cease to be separate entities, in any real sense of the word.
> I'll show you some data base applications that eat 6
> processors on a 3090 for just this reason. Real-life APL
> applications execute about half of ALL ops on 0 or 1-element
> arrays, and more than 85% of all ops are on arrays of about
> 16 elements or less. (These numbers come from at 3 different
> sources. I instrumented a mainframe APL interpreter and
> captured data from all users for several weeks for my
> version. Rob WIllhoft of IBM obtained similar data (IBM
> Systems Journal). An earlier Quote Quad paper by Jordan
> provides similar data.)
Such statistics are widely known and quoted. I think most
implementors toy with the idea of generating actual machine
instructions for the most common special cases, and then wimp out
and do a straightforward interpreter, which is easier to code.
> I do not deny that you can find specific number-crunching
> applications that will exploit a vector back end or in which
> interpreted APL will perform quite well. What I DO deny is
> that such applications are typical.
While I understand what you are saying, I am not yet ready to
completely embrace the notion that APL applications that run
acceptably well under a tightly coded conventional interpreter
are highly unusual. I use APL for lots of numerical analysis
stuff, and I rarely hit the kind of performance bottlenecks you
describe. Sure the stuff would run faster if I converted it to
hand-tuned assembler, but cycles are cheap and getting cheaper.
Sure I'd like it if the APL on my PC generated real machine
instructions for all the obvious idioms, but I don't think most
of the things I run would really go from impulse to warp speed
even if that were the case.
> I don't know about you, but I certainly trust the
> correctness of APL compiler-generated code over my
> hand-crafted assembler code, and I've been writing that
> stuff since the dark ages. (Maybe that's why. 8^}) And, the
> code is portable and parallel, which ain't the case with
> assembler code.
> The nice thing about the complexity of an APL compiler is
> that the user need not know about it.
Well - that's kind of what I've been saying all along. Compiler
techniques are fine, if the implementor is willing to take the
time to do them, but they should forever remain invisible to the
user of the system.
Neither do I.
> But there was/were
> a/some 370 architecture machines which had "APL assist" in microcode.
> Someone from IBM might know more about it.
Yup. The 370/45 had a writable microcode store, and someone used it to
add an instruction: "interpret APL program". Talk about CISC... I used
that system in the mid-70s. On APL, it flew! (for the time) Except for
floatin-point; there wasn't much the microcode could do about the '45's
lack of good hardware for that.
There was also an early-80s desktop thing for APL, but it wasn't
microcoded or special hardware, just an S/370 interpreter running the
S/370 APL interpreter. No speed demon.
Greg
--
________________________________________________________________________
Greg Pfister | My Opinion Only | Phone: (512) 838-8338
pfi...@austin.ibm.com | Sic Crustulum Frangitur | Fax: (512) 838-5989
> cli...@ami.sps.mot.com (Cliff Click) writes:
> > Ok, I give. How does find-leftmost-1 help in C string functions?
>
> The expression...
>
> ~n & (n - 0x01010101) & 0x80808080
>
> ... gives zero if and only if none of the bytes in the word are nonzero.
> If the result is nonzero then the position of the first 1 tells you which
> byte was nonzero.
>
> You might want to disassemble your own company's "libmoto", where the C
> string functions make use of this property. :-)
The source code is sitting in front of me. I feel silly.
Thanks,
This arose in the very early eighties before PCs could be
considered for use as a controller, I think. The plan
I worked on was supposed to be channel attached to the
S/370 running Sharp APL, making its likelihood of
survival and proliferation even smaller.
I disagree, for two reasons:
a. Users of C interpreters(debuggers) are quite happy with
the bimodal approach to product construction. APLers
who are building products are no different.
b. There are other reasons for having an APL compiler
besides raw performance. Of course, performance is a
most important reason.
- Packaging a product for distribution to clients
is made nastier if the client has to distribute an all-singing
all-dancing interpreter with the product. Ditto
error handling in the product (for example, does it
drop the user into the interpreter, when it dies while
the user is pricing an insurance policy for a customer?).
- Intellectual property issues: Interpretive systems,
at least of the APL nature, effectively make you ship
source code with your product. APL locked functions
are not, by their nature, highly effective means of
encryption. If they were, execution speed would be
limited by decryption speed within the interpreter.
Now, I do not disagree that having an APL executor that has
great performance due to being compiled on the fly is not a swell
idea. But, I think other issues are involved.
>Where I think the "APL Compiler" approach has failed
>historically, is that such efforts in the past have not been
>seamlessly integrated into development environments. There has
>been some external chunk of code known as "The APL Compiler",
>with separate documentation, and it has been necessary to export
>code, often with source changes, and run it through this extra
>monolithic module in order to make use of its functionality. This
>is generally not something lots of users choose to do, and past
>APL Compiler efforts have not been resounding successes, outside
>of producing interesting papers at the yearly convention.
This is certainly true of APL and of C and of FORTRAN, for
debug vs optimized build reasons. Everyone would love to
see the situation you describe be the way to go. BUT, I think
you have to be willing to at leat push the "please compile this
here thingy" button, and you have to put SOME constraints on
the language if you want excellent performance.
The key is to make these as inobstrusive as possible.
>
>For people who write APL that looks like FORTRAN, a compiler can
>of course toss out all conformability checking, consolidate loop
>iteration to cover multiple primitive operations, and wind up
>with code that runs as fast as FORTRAN. However, when the code
>involved utilizes the full dynamic typing and scoping rules of
>the APL language, you are stuck with either conventional
>interpretation, or context-sensitive compilation which faults and
>recompiles to more general code for operations on data whose
>attributes fluctuate. In the fully general case, you pretty much
>need to have the APL system in memory with you, if you are going
>to support the entire language.
This is the accepted wisdom of the ages that has inhibited APL's
growth by keeping it running at a crawl (did I really write that?).
However, several parts of the above deserve clarification:
a. "toss out all conformability checking": This sounds like a
leave-the-parachute-behind switch. APEX does not do this.
It analyzes the source code and discards conformability checks
only when data flow analysis guarantees that you can NOT
get a length error. The code to do this is relatively straightforward,
but could do a much better job if I introduced shape algebra
code into the data flow analyzer. Even so, it does a good
job at present. For my current set of about 40 standard
benchmarks, APEX is able to discard 80% of ALL scalar function
conformability checks.
b. " When the code involved uses full dynamic typing and scoping rules...
you are stuck with interpretation..."
Not so, for several reasons.
First of all, APEX uses static single assignment to map
all that typing into purely functional form. This makes
the WHOLE JOB of compiling APL much, much easier. Without SSA,
the analysis is daunting, and I would probably have given
up for the reasons you suggest.
So, the problem is solved. It's no longer a problem.
More importantly, though, is what you get from an analysis
of realistic APL applications: They do NOT use "full dynamic
typing..." because applications tend to be things that people
want to maintain. Dynamic scoping is an evil thing if you
ever want to change a program written by another, and large
APL site programmers simply shun these coding styles if
there is any way to avoid them.
>While I think that compiler techniques have a valuable place in
>advanced APL systems, I don't think it is possible to create
>separate entities called "APL Compilers" without restricting the
>APL they compile to a FORTRAN-esque subset, unless one has most
>of the innards of the APL system itself present in the runtime
>support for the compiled modules. At this point, they really
>cease to be separate entities, in any real sense of the word.
This is not true. The things you have to leave out are
things that manipulate the symbol table (quadFX), and
if you want reasonable performance it is advisable to adopt
a control-structured mapping APL's devil-may-care branching.
>
>
>Such statistics are widely known and quoted. I think most
>implementors toy with the idea of generating actual machine
>instructions for the most common special cases, and then wimp out
>and do a straightforward interpreter, which is easier to code.
Actually, MOST implementors (by which I mean those of us who
been in the biz of creating APL systems that are actually used
by paying customers) write a straightforward interpreter,
then go nuts trying to speed it up by "generating actual
machine code for the most common special cases". This, of course,
does not work very well except for a few n^2 or n^3 complexity
primitives, as you still get nailed by syntax analysis, setup,
and storage management on the rest of the code.
Trust me on this one. I been dere. I did dat. After about
20 years, I gave up.
>While I understand what you are saying, I am not yet ready to
>completely embrace the notion that APL applications that run
>acceptably well under a tightly coded conventional interpreter
>are highly unusual. I use APL for lots of numerical analysis
>stuff, and I rarely hit the kind of performance bottlenecks you
Numerical analysis in APL is, believe it or not, fairly rare.
Most uses are commerical in nature. Your stuff is probably
lying in the high end of the interpreter performance envelope,
with big arrays and not much iteration.
A compiler would get speed on that from loop fusion, to the tune
of about a factor of 1 for each function in the expression.
Startup cost on such applications are small, so the speedups
of n-hundred are not going to happen.
>Well - that's kind of what I've been saying all along. Compiler
>techniques are fine, if the implementor is willing to take the
>time to do them, but they should forever remain invisible to the
>user of the system.
This is, of course, a peachy keen idea, but reality has this
bad habit of intruding...
You might want to test your code before you post it. :-)
I would suggest that you try n = 0x01000000, and n = 0x00010000.
Since these both produce 0x80808080, I don't see how this trick
is going to help you find the LEFTMOST zero byte.
On the other hand, if you are looking for the RIGHTMOST zero byte, ...
--
den...@netcom.com (Dennis Yelle)
"You must do the thing you think you cannot do." -- Eleanor Roosevelt
> I disagree, for two reasons:
> a. Users of C interpreters(debuggers) are quite happy with
> the bimodal approach to product construction. APLers who are
> building products are no different.
Users of languages like C and FORTRAN have worked with painful
development environments for so long that any step beyond the
typical compiler/linker/source_level_debug tools appears to be a
giant one. An interpreter/debugger looks like a gift from God to
these folks.
Users of languages like APL, LISP, and even the better
incrementally compiled BASICs are already used to being able to
demand immediate execution of any expression in the suspended
program context, and to having source code edits instantly
manifest themselves as soon as they exit the editor. Moving
these folks to the new high performance "bimodal" approach to
product construction is going to be a hard sell.
> b. There are other reasons for having an APL compiler
> besides raw performance. Of course, performance is a most
> important reason. - Packaging a product for distribution to
> clients is made nastier if the client has to distribute an
> all-singing all-dancing interpreter with the product.
If the interpreter cannot sing and dance, then what you are
distributing is a subset of APL. :)
> Ditto error handling in the product (for example, does it
> drop the user into the interpreter, when it dies while the
> user is pricing an insurance policy for a customer?). -
This of course depends upon what the product's designer has done
in terms of trapping errors and processing them. If there is no
path back to immediate execution mode, and nothing which requires
expression evaluation in the code, then of course these modules
need not be accessible from the distributed executable. Again,
not really a compiler versus interpreter issue at all.
> Intellectual property issues: Interpretive systems, at
> least of the APL nature, effectively make you ship source
> code with your product. APL locked functions are not, by
> their nature, highly effective means of encryption. If they
> were, execution speed would be limited by decryption speed
> within the interpreter.
Not true. Once the translation of the source into the internal
form becomes sufficiently complex to make its inverse
impractical, one has to keep function source around anyway. I
tend to prefer this, as I don't like my whitespace or numeric
constants synthetically regenerated everytime I open a function
for editing. Locking then simply consists of deleting the
source, and the pseudocode from which the interpreter works is
usually not particularly illuminating to someone bent on source
reconstruction.
> This is certainly true of APL and of C and of FORTRAN, for
> debug vs optimized build reasons. Everyone would love to see
> the situation you describe be the way to go. BUT, I think
> you have to be willing to at leat push the "please compile
> this here thingy" button, and you have to put SOME
> constraints on the language if you want excellent
> performance. The key is to make these as inobstrusive as
> possible.
Again, I think it is misleading to draw conclusions from what C
and FORTRAN users might be persuaded to tolerate. A better model
might be LISP, where high performance optimizing compilers
seemlessly integrated into the system have been the standard for
years. Yes, one has a "compile button" and one can write out
executables, but there are not a whole lot of language
constraints put on programmers, although people are of course
perfectly free to confine their coding efforts to things that can
be made to run fast.
> This is the accepted wisdom of the ages that has inhibited
> APL's growth by keeping it running at a crawl (did I really
> write that?).
Yes you did. I might suggest that the conventional wisdom as to
how slow APL supposedly runs bears no relationship to the real
behavior of most production APL systems. I have seen lots of
FORTRAN programmers greatly impressed when they were finally
shown what APL (albeit interpreted) can do.
> More importantly, though, is what you get from an analysis
> of realistic APL applications: They do NOT use "full dynamic
> typing..." because applications tend to be things that
> people want to maintain. Dynamic scoping is an evil thing if
> you ever want to change a program written by another, and
> large APL site programmers simply shun these coding styles
> if there is any way to avoid them.
I disagree. Breaking something complicated into reasonably sized
easily-understood chunks is often best done by having smaller
"helper" functions do work on variables declared local in the
procedures that call them. There is nothing evil about this, and
it sure beats having a workspace filled with lots of unidentified
global variables, or functions of several hundred lines with 20
line headers.
There is a huge amount of code out there that makes use of
dynamic scoping in APL, and people are not going to be happy if
the official "APL Compiler" chokes on it.
Again, dynamic scoping is a "feature" unless you are a FORTRAN
or C programmer. The solution to dynamic scoping is shallow
binding, not an ideological purge.
> Actually, MOST implementors (by which I mean those of us who
> been in the biz of creating APL systems that are actually
> used by paying customers) write a straightforward
> interpreter, then go nuts trying to speed it up by
> "generating actual machine code for the most common special
> cases". This, of course, does not work very well except for
> a few n^2 or n^3 complexity primitives, as you still get
> nailed by syntax analysis, setup, and storage management on
> the rest of the code. Trust me on this one. I been dere. I
> did dat. After about 20 years, I gave up.
I'm not sure I see this as an unsolvable problem. Again, I've
seen proposals for APLs (unfortunately ones that didn't get
implemented) which almost completely eliminated the overhead of
memory management, setup, syntax analysis and the like, and in
which control of execution resided in sequences of machine
instructions produced by translation of the APL source. Again,
if you generate code which anticipates the most common case, and
adjust it when problems occur, most of the time you can get the
same speed as compiled code.
This has been done with interactive BASIC for over a decade now,
and although I am certainly not suggesting that BASIC is as
complex as APL, dynamic string management and garbage collection
in a large virtual address space does present some similar design
problems.
> Numerical analysis in APL is, believe it or not, fairly
> rare.
> Most uses are commerical in nature. Your stuff is probably
> lying in the high end of the interpreter performance
> envelope, with big arrays and not much iteration.
Most data processed is commercial. Most computers are not
ultrafast numeric engines. Most programmers are not research
scientists.
For all I know there may very well be a multi-million dollar
market for compiling APL pension plan software into code that
runs with the speed of FORTRAN. Then again...
> This is, of course, a peachy keen idea, but reality has this
> bad habit of intruding...
Time will tell.
True, but people doing things such as dynamic programming
or simulations might be willing to tolerate subsetting in
order to get a factor of several hundred speedup. People are
strange that way -- wanting their problem to run to completion
before they die.
> > Intellectual property issues: Interpretive systems, at
> > least of the APL nature, effectively make you ship source
> > code with your product. APL locked functions are not, by
>
>source, and the pseudocode from which the interpreter works is
>usually not particularly illuminating to someone bent on source
>reconstruction.
I disagree. Unlocking locked functions is a common activity,
and if the underlying code is doing something of significant
value, you may be willing to go to some lengths to examine
it in detail. Comment removal does not usually mask algorithms.
>Again, I think it is misleading to draw conclusions from what C
>and FORTRAN users might be persuaded to tolerate. A better model
>might be LISP, where high performance optimizing compilers
Good point.
>
>Yes you did. I might suggest that the conventional wisdom as to
>how slow APL supposedly runs bears no relationship to the real
>behavior of most production APL systems. I have seen lots of
Perhaps you have lived around production APL systems very much.
I have seen Great Mainframes brought to their knees
by fairly trivial APL applications that require tasks such
as finite state machines tripping over data coming from
outside sources, such as stock tickers.
>I disagree. Breaking something complicated into reasonably sized
>easily-understood chunks is often best done by having smaller
>"helper" functions do work on variables declared local in the
>procedures that call them. There is nothing evil about this, and
>it sure beats having a workspace filled with lots of unidentified
>global variables, or functions of several hundred lines with 20
Sorry. I think we're confused on terminology here.
I was trying to suggest that use of semi-globals (or globals)
as means of writing large systems is a way to make people
old before their time. Which I think was your point as well.
>There is a huge amount of code out there that makes use of
>dynamic scoping in APL, and people are not going to be happy if
>the official "APL Compiler" chokes on it.
I don't know if the official one chokes on it or not, but my
APEX compiler handles it just fine. I was trying to suggest
that some language cleanup in APL would go a long way toward
making applications easier to write and more maintainable.
>Again, dynamic scoping is a "feature" unless you are a FORTRAN
>or C programmer. The solution to dynamic scoping is shallow
>binding, not an ideological purge.
Agreed.
>I'm not sure I see this as an unsolvable problem. Again, I've
>seen proposals for APLs (unfortunately ones that didn't get
^^^^^^^^^^^^^^^^^^
>implemented) which almost completely eliminated the overhead of
>memory management, setup, syntax analysis and the like, and in
>which control of execution resided in sequences of machine
>instructions produced by translation of the APL source. Again,
>if you generate code which anticipates the most common case, and
>adjust it when problems occur, most of the time you can get the
>same speed as compiled code.
I will believe this when I see one that is implemented. Paper
designs have this funny property of slowing down when you
actually have to implement them.
This approach goes back to the HP/3000 APL circa late 1970s.
I think it has promise if you are willing to clone functions
so as to have several distinct copies of code lying around.
Otherwise, you can end up with lousy performance because you are
constanting compiling and invalidating code. As an example
of this, consider a NUB function (gives unique elements of a
vector) or similar utility function that is called from two
loci, once with integer data and once with character data.
If you don't clone the NUB and make two copies,
one for each data type it's called with, you end up
with performance that is worse than that of a naive interpreter.
Bob
This is a ticklish point, probably with more discussion in it than merit. Is a
microcoded machine less a natural CPU than a hardwired one?
All of the APL machines that I have encountered were microcoded one level or another.
The best might have been the Burroughs 700 machine that they announced and
demonstrated at APL 76. At the same convention, the MCM-80 APL machine was
demostrated as a desktop. It had a one-line LED dot matrix display, more of a
desktop calculator than a computer. Both of these predate the IBM 51x0, which were
(in all of their documentation) microcoded, but the architecture was very vertical,
looking very much like and embedded controller design.
Lump in the others and which is microcoded, which is emulated, which-'n-what's
compiled?
Mike.
Oops, you're correct that the find-leftmost-1 test is fooled by one or more
^A characters immediately before the first zero byte. I guess I'm lucky that
text strings seldom do this, but I'm off to fix my own routine right now.
However the main claim -- that the result is zero if and only if none of the
bytes in the word are zero still stands. It's just that the tidying up the odd
remainder bytes isn't quite as tidy as I'd thought :-(
-- Bruce
Could you expand on why semiglobals are such evil? The maintenance
issue does not seem compelling, since you cam write unmaintainable
code with almost anything. On the other hand, this programmer, at
least, found them both natural and useful.
Thanks
Olivier Lefevre
Is there a FAQ or web site or book somewhere that has a really
thorough list of all of these sorts of bizarre optimizations?
I know some showed up in Hakmem, but certainly not all. Some in
Graphics Gems, too...where else?
Thanks,
Doug
--
Doug Merritt do...@netcom.com
Professional Wild-eyed Visionary Member, Crusaders for a Better Tomorrow
Unicode Novis Cypherpunks Gutenberg Wavelets Conlang Logli Alife Anthro
Computational linguistics Fundamental physics Cogsci Egyptology GA TLAs
In article <31DEE3...@webworldinc.net>,
Mike Housky <mi...@webworldinc.net> wrote:
>All of the APL machines that I have encountered were microcoded one level or another.
> The best might have been the Burroughs 700 machine that they announced and
>demonstrated at APL 76. At the same convention, the MCM-80 APL machine was
>demostrated as a desktop. It had a one-line LED dot matrix display, more of a
>desktop calculator than a computer. Both of these predate the IBM 51x0, which were
>(in all of their documentation) microcoded, but the architecture was very vertical,
>looking very much like and embedded controller design.
The early MCM-80 contained an 8008 microprocessor. The APL interpreter
was written in assembler. They later converted to bit slice hardware,
and emulated an 8008, to run the interpreter faster. Yes, it was a
poor choice, but the machine was actually usable by the standards
of the day.
The IBM had microcode which emulated the IBM 360 ISA. On top of that
was the old original IBM 360 APL interpreter.
--
Don D.C.Lindsay http://www.cs.colorado.edu/~lindsay/Home.html
Greg Pfister <pfi...@austin.ibm.com>:
| Neither do I.
As an undergraduate I worked on Richard Hobson's "SAM" (Structured
Architecture Machine) project in the Simon Fraser University
(Vancouver, BC, Canada) CS dept, around 1980-81. This was a true
APL machine, in the same manner as the Symbolics machines were true
Lisp machines.
At the time I worked on it, SAM was built around dual (Micron?) MK-16
16-bit microprocessors, programmed in vertical microcode. (You could
look on the MK-16 as a crude RISC machine, but it's probably more
accurate to think of it as a 16-bit bit-slice chip. As I recall,
microinstructions were around 30 bits long.)
The basic design looked like this:
[[Note to the non-APL-cognoscenti: in APL, *all* data objects are arrays.
"Scalars" are just the special case of 0-dimensional array.]]
+-------------+
| IO | PMU = Program Management Unit
| processor | (APL interpreter)
| (Z-80) |
| (assembler) | DMU = Data Management Unit
+-------------+ (all array operations)
/|\
|
|
\|/
+-------------+ --------+ +-------------+
| | -------->::::|-----> | |
| PMU | --------+ | DMU |
| (MK-16) | | (MK-16) |
| (microcode) | +-------- | (microcode) |
| | <-----|::::<-------- | |
+-------------+ +-------- +-------------+
/|\ /|\
| |
| |
\|/ \|/
+-------------+ +-------------+
| symbol | | symbol |
| table | | table |
| | | |
| APL fns | | APL arrays |
| | | |
+-------------+ +-------------+
The PMU handled the main logic for the APL interpreter. This was all
written in MK-16 microcode. (Part of) the symbol table was stored in
the PMU's off-chip (segmented virtual) memory. User APL functions were
also stored there.
(Except for temporary editing buffers, this was the
*only* in-memory storage for user APL functions; the
APL-source-text to internal-form "assembly" transformation
was reversible to recover the source for display, editing,
or other external IO. I never did figure out why we
didn't just keep the source on disk, as that would
have allowed some streamlining of the PMU's internal
representation, not to mention allowing comment text
not to tie up PMU memory.)
The internal form of user APL functions was sort of a hybrid of RPN
and 3-operand assembler-type operations, with "APL array" being the
basic data object. The PMU interpreted this, and sent low-level data
manipulation instructions (referring to APL arrays by an integer symbol
table index) to the DMU via a queue.
The DMU maintained a parallel symbol table in its off-chip (segemented
virtual) memory, so it could understand symbol table indices coming
from the PMU. The DMU did all the actual manipulation of APL arrays,
again programmed entirely in MK-16 microcode. There was no hardware
assist for big-array operations. Indeed, the MK-16 hardware provided
(only) 16-bit integer arithmetic, so bigger integers and all floating
point had to be done in (microcoded) software. For conditional branches
and the APL execute function (which takes a user-level APL string and
executes it as APL source code), the PMU sent the DMU instructions on
what to do, and the DMU sent status codes back to the PMU via another
queue.
The whole architecture was designed, and the internal representations
were optimized, to keep interpretive overheads low and make common cases
(scalars and small arrays) run fast. As I recall, the function-call
overhead was well under a dozen cycles, and the fast-path scalar integer
I <- I+1 was 2 or 3 cycles. For Fortran-like loops, our fast-path
code was basically as fast as anything a Fortran compiler could have
generated, subject to the constraint that we treated each APL primitive
independently.
Another interesting aspect to this project was that all the microcode
was initially written and debugged in APL itself, using cover functions
simulating the machine's operation. APL is in fact quite a nice for
this kind of thing, similarly to how LISP is a nice environment for
building new AI languages.
Alas, with tight funding and at most 3-4 people working on the project,
it took quite a few years to get a working system, and by that time
the state of the art in hardware had moved a long ways past the MK-16.
When I worked on the project the hardware wasn't ready yet. However,
I wrote enough of the architecture simulator and enough of the PMU's
APL-interpreter microcode so I could run Ackermann's function (hand-
compiled). When I left SFU in 1983 I believe Dr. Hobson had a
wire-wrapped prototype working, but I don't think it was ever even
close to being ready for commercial product production engineering.
A project like this really needs at least an order of magnitude more
resources than we had.
- Jonathan Thornburg <bk...@island.net> (personal E-mail)
U of British Columbia / Physics Dept / <thor...@theory.physics.ubc.ca>
"Washing one's hands of the conflict between the powerful and the powerless
means to side with the powerful, not to be neutral." - Freire / OXFAM
BESM-6 had (at least) two more bizarre instructions: bit packing and
unpacking by mask. They allowed conversion of octal numbers to text
format and back in a couple of instructions, and many more neat things,
like encryption, bitmapped graphics, etc.
Leo
Cool! That sounds kind of like an Intercal machine. Mmmm, Intercal....
---
chris
The binary search method is pretty good. On some chips the following
way is faster.
Suppose you have a 32-bit word of the form 00...0011...11 with ones in the
n lowest positions (0 <= n <= 32). If you multiply mod 2^32 by the magic
number 7*255*255*255 then the top 6 bits of the product uniquely identify n.
So declare a table:
static char tab[64] =
{0,1,0,16,0,2,29,0,17,0,0,0,3,22,30,0,0,0,20,18,11,0,13,0,0,4,0,7,0,23,31,0,
15,0,28,0,0,0,21,0,19,10,12,0,6,0,0,14,27,0,0,9,0,5,0,26,0,8,25,0,24,0,32,0};
Now for the index of the lowest set bit (i.e., the ffs() function):
if (n) {
n ^= n-1UL;
n = (n<<3)-n;
n = (n<<8)-n;
n = (n<<8)-n;
n = (n<<8)-n;
return tab[n>>26];
} else return 0;
On ARM this comes out as:
; input and output in n, workspace t
TEq n, #0 ; test equality with 0
SubNE t, n, #1 ; subtract if not equal
EOrNE n, n, t ; exclusive or if not equal
RSbNE n, n, n, LSL #3 ; reverse subtract if not equal
RSbNE n, n, n, LSL #8
RSbNE n, n, n, LSL #8
RSbNE n, n, n, LSL #8
LdRNEB n, [tab, n, LSR #26] ; load register, if not equal, with byte
Pretty fast, n'est-ce pas?
If you want the index of the highest set bit do this instead:
; input and output in n
n |= n>>1;
n |= n>>2;
n |= n>>4;
n |= n>>8;
n |= n>>16;
n = (n<<3)-n;
n = (n<<8)-n;
n = (n<<8)-n;
n = (n<<8)-n;
return tab[n>>26];
Bye,
Dances with bits.
.-. Robert...@inria.fr .-.
/ \ .-. .-. / \
/ \ / \ .-. _ .-. / \ / \
/ \ / \ / \ / \ / \ / \ / \
/ \ / \ / `-' `-' \ / \ / \
\ / `-' `-' \ /
`-' Hit me with those laser beams. `-'
What ever happened to Denny Roberson anyway?
|> Greg
|> --
|> ________________________________________________________________________
|> Greg Pfister | My Opinion Only | Phone: (512) 838-8338
|> pfi...@austin.ibm.com | Sic Crustulum Frangitur | Fax: (512) 838-5989
--
Del Cecchi
cecchi@rchland
Rob
--
Rob Peglar Network Systems Corporation
ro...@network.com 7600 Boone Ave N. Mpls. MN 55428
612.391.1028 612.391.1358 (fax)
Maybe they were planning to run Intercal.
-- Richard
--
This article was probably generated by a buggy newsreader.
Hey, I don't want to start a flame war (languages and OS's - thats
religious stuff to some people!) but whenever I think about APL
it reminds me of the old saying (Maybe Mark Twain) "don't take a
job that requires investment in new clothing". In this case,
"don't use a language that requires a new font" :-)
Maybe my problem with APL is that I very rarely find myself
needing to invert matrices at the spur of the moment.
Now I expect you APL fanatics to have at least one "iota" of
humour in you.
Paul
All opinions strictly my own.
--
Paul W. DeMone The 801 experiment SPARCed an ARMs race to put
Kanata, Ontario more PRECISION and POWER into architectures with
pde...@tundra.com MIPSed results but ALPHA's well that ends well.
Satan's CPU? The P666 of course - check out Appendix H(ell) under NDA
Better by far than the 5100 series was the MCM series. This was an APL machine which booted in APL. It had 40
k (out of 64K) dedicated to the interpreter and associated O/S needs. It used the remaining 24K for
application code in an overlay type of virtual workspace. Objects (functions, variables) of 4K or less were
automatically swapped to disk when not in use to make room for the rest of the code.
We wrote a system which supported a manufacturer doing over 20 million $ per year of packaging materials with
this system. I did finished goods inventory, raw materials, Accounts Receivable and Accounts Payable,
including a slew of sales analysis and operations reports. The configuration that supported all this consisted
of one MCM 1024, one 10 megabyte hard disk and a thimble printer. Total value, circa 1980, $40,000.
I do not think the CPU was specifically designed for APL though the APL was definitely on ROM. The machine
could boot and do APL with no disks attached.
Gaetan Godin
Hmm! Cross fertilize this thread with the APL CPU one, and out pops
the idea of an Intercal CPU, which raises a whole slew of questions.
What would the ISA be like? What fabrication process would be weird
enough to match the spirit of the language? And how twisted would the
bus design be?
--
"For they know they will sooner gain their end by appealing to men's pockets,
in which they have generally something of their own, than to their heads,
which contain for the most part little but borrowed or stolen property"
-- Samuel Butler, "Erewhon" (On political reformers)
> b. I have been selling the point that the APL character set was
> AND STILL IS a major stumbling block in the APL community for
> some years now. I make sporadic sales to language designers,
> but not as often to users. The usual exchange involves
> my claim that the ability to display arbitrary glyphs on your
> screen does not make the APL character set problem go away.
> In particular:
> - entering program text
> - exchanging programs among users [you are effectively forced
> into ascii mode for both network and Rosetta Stone reasons]
> - publishing programs
> remain metaproblems. They are all problems that can be
> circumvented by sufficient effort throughout the entire
> organization, but it is not obvious that such an effort
> will pay off. Hence, I claim, the APL character set remains
> a problem.
> The answer I get is usually: "No it isn't."
I'm a "believer" in the APL character set, but not even I would claim that
it's "not a problem." It is a problem, for the reasons Bob gives. But I,
and many others, believe that it's problem worth the effort of dealing
with. To understand this viewpoint, one must think of APL not as a
language for writing computer programs, but rather as a "language" for
developing and defining algorithms, one for which we have interpreters
that make it "readable" by computers.
As a pencil-and-paper language for algorithm development, APL is
unsurpassed, far superior to any "conventional" algorithmic or
mathematical notation. Surely I'm not the only one with the experience,
while working in a non-APL programming shop, of regularly using APL to
develop algorithms on paper before subsequently "translating" them into
the language in which they will be implemented. Without APL, one is
forced to write these algorithms in "mathematics" (or something even less
well defined), which means using a notation that is not only inferior for
the purpose, but also idiosyncratic and non-standardized.
When one reads "algorithms" published in Pascal or C one is actually
reading computer programs, and must "sort out" the algorithm from the
"computer overhead management" that the author was forced to add in order
to publish the algorithm in "correct" Pascal or C. In APL, the "algorithm
portion" and the working program are effectively the same thing.
When one develops a new algorithm for implementation on a computer one
typically goes through a three-stage process: (1) develop the algorithm
on paper; (2) translate it into the language in which it will be
programmed; (3) add the "computer overhead management" needed to turn it
into an executable program. The best modern computer languages (including
J) free the developer from step 3, but only APL, with its superiority as a
pencil-and-paper notation which derives from its "odd" character set,
really frees the developer from both step 2 and step 3.
--
Eric Landau, APL Solutions, Inc. (ela...@cais.com)
"Sacred cows make the tastiest hamburger." -- Abbie Hoffman
I suspect it's not that hard, but recommend you talk with an APL
vendor such as Dyalog or APL2000, who have certainly solved that
particular problem.
Otherwise, pick up a copy of J from Iverson Software Inc, 33 Major St.,
Toronto, Canada (I am not affiliated with ISI, merely passing on word)
for about a hundred bucks or so (Don't recall exact price), or get
an older version from watserv1.uwaterloo.ca /pub/languages...
if you're feeling like trying it sans documentation.
J is ascii-based, so you don't need to wage the character set war.
Bob
Eric's comments, largely deleted here, are right on the mark.
APL is certainly the most valuable tool for thinking about algorithms
that I have ever encountered. Furthermore, it is executable, allowing
one to validate hypotheses and try things out.
Eric and I merely differ on getting the notation onto a computer so
that normal humans can use it.
Bob
Rather, the first personal computer made by IBM was either the 1620 or
the 1130. :-) Talk about small! Our 1130 had 8k words of memory,
and a whopping 512k removable disk drive. And still, it had a pretty
decent Fortran compiler!
ALberto.
Over the years, I have programmed in FORTRAN, C, BASIC and APL. The
efficiency (for me) of coding in APL far outways the problems of
dealing with the quirky character set. Even when that meant buying
weird typeballs and daisy-wheels for the printers.
Doug White
The IBM 1130 I used in 1971 also had a pretty decent APL interpreter,
with 2741-style typesphere and the 3-way "crypto-shift" keyboard, you
could get a sprain reaching for "transpose". Are there any
1130s left?
/Andrew Shepp
http://www.digitravel.com
APL isn't the only language that people complain about having to use
something other than the absolute minimum character set. Both C and
Pascal have support for character sets that don't have such bizarre
characters as '[' or '~'. And think about all the languages that are
case insensitive because so many terminals/printers didn't (don't?)
support lowercase.
I don't think APL's problem is a fundamental difference, but rather a
matter of (a large) degree. The significance of the characterset
problem is decreasing, but not very quickly. Trigraphs were added to
C in '89, but supporting upper and lower case was acceptable. Manybe
in another couple of decades, the APL glyphs will become ok. *sigh*.
-wayne
--
Wayne Schlitt can not assert the truth of all statements in this
article and still be consistent.