One unique feature of the Am29000 architecture is a special instruction.
This instruction is intended to be used to speed-up string processing,
but my guess is that other uses will be discovered. The instruction is
called "compare-bytes" and works like this:
Compare bytes specifies two source register operands and one destination
register operand. The 4 pairs of corresponding bytes of the two 32-bit
source operands are compared for equality (i.e., the two most-significant
bytes are compared, the two next-most-significant bytes are compared, etc.).
If any of the four pairs are equal, then the destination register is set
to the value "TRUE" (which on the Am29000 is a one in the most-significant
bit with all other bits cleared to zero). If none of the four pairs are
equal, then the destination register is set to "FALSE" (all bits cleared).
(Am29000 conditional branch instructions test only the most significant bit of
a register, condition codes are not used; we get a free "test for negative.")
So, if one of the source operands is set to all zeros (four null characters)
(which can be specified in the instruction by choosing the second operand
as the zero-extended eight-bit constant zero) and the other operand is a
word of the character string being dealt with (say for copying or comparing),
the Am29000 can, in one cycle (not counting the branch), determine if the word
contains the end of string character (according to the C language definition
of string). If the word does not contain the end of string character, then
the four bytes in the word can be manipulated (e.g. loaded or stored) as a
unit. Word operations on the Am29000 are much more efficient than character
operations (this is true of most machines though).
There are, of course, special circumstances to deal with (such as misaligned
strings, and we have a funnel shifter to help in those cases), but by using
the compare-bytes instruction in the library routines strcpy() and strcmp()
(and strlen() too, but we haven't bothered since it seems to never be used in
the programs we have encountered), significant improvements in the run-time
of many C programs can be realized. Another thing which really helps is to
have the compiler word-align literal strings (and I have implemented this),
but even with word-alignment, some substrings will begin on strange boundaries
and must be dealt with correctly.
My approach to using this instruction consisted of re-writing the library
routines in C with function calls wherever the compare-bytes instruction
should go. I compiled this C code with my compiler, changed the assembly code
to eliminate the function calls in favor of the compare-bytes instruction, and
assembled it into the library (actually a module of code that gets included in
all final links, but that is just a detail of our simple environment). Since
most C programs (especially utilities and other systems programs) do a lot of
string processing, this one instruction is really worth the small
implementation cost. It often improves run times by 15% to 20% (just goes to
show that the impact of processing C language strings has been long- ignored).
It implements just the right semantics and probably has other applications for
specialized pattern matching.
I just thought some of you would be interested.
bcase
The HP Spectrum has three instructions that are very similar.
Unit XOR will Exlusive-OR two registers and let you skip if any byte or
halfword is equal or not equal.
Unit AddComplement[&Trap on Condition] will add the ones complement of
one register to another. This is sort of subtraction without the forced
carry. That way, all bytes, halfwords & nibbles essentially have the
same carry-in (although a carry out of one stage could force a carry-in
to another, it turns out this doesn't affect the major uses of the
instruction). Then, either a trap will occur or the next instruction will
be skipped if the conditions are met. The conditions are any halfword/
byte/nibble carry, any halfword/byte zero, and their complements.
Among other things, these instructions can be used for string searching a
word at a time, decimal/byte/halfword range checking/validation a byte at
a time, and decimal addition pre-biasing.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
This is of course a good point, and it seems that this is going to be
one of those hard cases for compiler writers. But it opens the issue
of subroutine libraries as providing an environment which also needs
support in the instruction set. The Unix string routines are a good
example, because they are even now often recoded in assembly for efficiency.
Various pieces of Unix kernel frequently suffer the same fate
(e.g. tty buffers). But once one is into instructions which are good
for certain operating systems or environments, rather than certain languages,
one has narrowed one's market. For instance, the Am29000 special string
compare is useless if strings are encoded as a length followed by the chars.
-mark
Background: I happen to think Unix-style strings are a big market, and I love
things which make Unix faster (because there is a positive feedback cycle
which makes more Unix yet more available and desirable.).
--
Spoken: Mark Weiser ARPA: ma...@mimsy.umd.edu Phone: +1-301-454-7817
After May 1, 1987: wei...@xerox.com
>My question is this: How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it. One of the positive points to the RISC debate is
>that it brought out the point that useful instructions which are hard for a
>compiler to generate are not always a win. The old STAR-100 had a number of
>instructions, including string processing instructions, which were left out of
>the Cyber 205 because it turned out that the compilers never generated them.
>They were neat instructions though. So, my question is, is this instruction
>easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what
>conditions?
This is an excellent question, one that I intended to address in the original
posting. However, I forgot to include it when the damn article got kinda
long.
The fact of the matter is: A C compiler is going to have a real hard time
generating this instruction. A Pascal Compiler might have a slightly
easier time since the conept of "string" is a little more integrated with
the langauge. In general, a compiler for a language without built-in types
and functions for dealing with strings will have a hard time with our
compare-bytes instruction. We realized this when we added it to the
architecture (and it was truly a last-minute addition). However, it has
such a significant impact on C programs in general that we were not deterred
by the realization that this instruction will probably occur only a few
times in the strcpy(), strcmp(), and strlen() library functions. Out of
all the megabytes of code in a UNIX system, this instruction will account
for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
guess). But if it improves the running time of all the string-oriented
system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
worth it. And the implementation cost was so small. Also, there are
some instructions that must be present just to administer the system,
like return from interrupt, move-from-special-register, etc. These
are not generated by a compiler either. Just to reiterate a point: RISC
is not reducing the instruction set but is improving performance.
Ok, so you don't believe the above? How about "It improved dhrystone
a hell of lot."
bcase
In most C implementations, "strcpy", "strcmp", etc. are library
routines; if the speedup from these instructions is sufficiently
large, it doesn't matter if the compiler can generate them because
the compiler can generate calls to those routines, and those routines
can be written in assembler language. You could also imagine a
compiler that would recognize names like "_builtin_strcpy" and
generate the assembler code for those routines in line, and a
<string.h> that contains
#define strcpy _builtin_strcpy
(see the footnote on page 86 of the October 1, 1986 ANSI C draft).
Another possibility might be to have something like the WE32000 C
compiler's "asm" directive, which permits you to define "functions"
that are expanded in line to assembler code.
Similar techniques can probably be used in other languages. If they
do string processing in routines, the routines can be written in
assembler language to use these instructions. If they generate
stereotyped sequences for these string operations, these stereotyped
seqences can use these instructions.
However, I don't know how difficult it would be to recognize various
special cases and generate non-stereotyped sequences that use these
instructions, or how much of a speedup this would give (if these
special cases weren't common, speeding them up might not buy you
enough to worry about it).
Unfortunately, Dhrystone is an artificial benchmark. I couldn't resist
doing a quick test [we have some pretty neat profiling tools for taking
an already-compiled program & turning it into one that does procedure &
statement counts, then gives you instruction cycle counts [which would be,
typically 60% of the total cycles, the other 40% being cache misses,
TLB-miss overhead, memory path interference, etc.]
OK: here's a quiz: I did some simple tests [ones we use in our standard
Performance Brief suite, which includes diff, grep yacc, and nroff],
plus csh. I won't pretend this is definitive, since I just whipped
it up. How much time would you guess these programs spend doing
strlen, strcmp. strcpy [the high-runners]? [Just guess the instruction
cycle %]. Try 0-1%, 2-5%, 6-10%, 11-20%, 21%-up.
ANSWERS:
program strlen strcmp strcpy
% cycs cy/call % cycs cy/call % cycs cy/call
diff .03% - 0 - 0 -
grep none
yacc .04% - .59% - 0 -
nroff 0 - 0 - <0.1% -
csh 1.71% 20 1.27% 9 1.84% 21
% of total
func calls 3.76% 6.11% 3.75%
Dhrystone <.01% 19 16.94% 103 22.36% 136
Bottom-line:
1) Dhrystone isn't remotely representative of some
common string-pushing programs in UNIX.
2) most of these, the total is <1% of instruction cycles,
hence <0.6% of fully-degraded cycles. Maybe you can save 20%
of this, or about .1%.
3) For csh: what's going on is that these routines are called
very frequently, but for short strings: 3-6 characters;
strcmp's obviously fail quickly [2nd or 3rd character].
I think the implication is that maybe you can get rid of 20% of
the cycles, which would be a 1% instruction cycle saving,
or about <0.6% full-degraded cycle saving for csh.
4) Given all of this, maybe what you get can be grossly estimated
as about .3%, maybe. [Again, this was somethign whipped up in
half an hour, so hardly definitive].
5) Note that Dhrystone spends a huge lot of its time copying and
comparing long strings. Hence, it's well worth a little extra
setup time for Dhrystone to lessen the cost per loop.
[In fact, we only got around to installing an assembler strcpy
when we noticed how much time was spent there in Dhrystone.]
Thus, I'd say it's still an open issue [for those of you who happen
to be designing computer architectures at this instant!]
Dhrystone: says it's important, but is totally unrepresentative.
csh: says it might get you .6%
others: surprise you by using str* so little
general evidence: says it might be useful, but if you follow our
1% rule [and HP used something similar, as I recall],
there is as yet insufficient evidence to include it.
If I were right now designing a system, I'd be tempted to do a lot
of data-gathering.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Mike Shebanow
This I believe, I was wondering when someone would point that out. I
think it is time for some *real* optimizing compilers that can detect
when they are compiling a benchmark and optimize for the appropriate
instruction! :-) :-) :-)
--
--Chuck McManis
uucp: {anywhere}!sun!cmcmanis BIX: cmcmanis ARPAnet: cmcm...@sun.com
These opinions are my own and no one elses, but you knew that didn't you.
Here at Project Synthesis we found that out a while back. It is pretty
strange to think of it that way, but yes, Printf and friends are
compute bound, not I/O bound. We have experimented with a number of
solutions, including a hardware printf coprocessor for the 68020.
Predictably, it improved performance quite a bit. We also have tried
using specially coded versions of printf in assembly language that we
passed through a (very time expensive) exhaustive search optimiser and
the results were also pretty good.
Perry Metzger
Re: (a) above. NO. Implementing a byte addressable (especially writable)
memory slows down all memory accesses for a slight improvement in byte
processing efficiency. For my own part, I can say that from the beginning
of the Am29000 project, I was firmly against anything but a word-oriented
interface. You have to realize that byte accessing requires an alignment
network somewhere: it will add some nanoseconds to all memory accesses; you
can put the alignment network in its own pipeline stage, but even then, it
will *always* slow down every memory access, there is nothing you can do about
it! (The same reasoning leads to our addressing-mode-less load and store
instructions: if addressing modes are there, then the instructions always
pay the cost even when the compiler knows better!) Thus, this is not strictly
a RISC issue, but a maximum through-put issue regardless of RISCness. (Note
that the Am29000 extract-byte and insert-byte instructions are essentially
putting the alignment network in a pipeline stage, but the compiler can
decided to pay or not pay the penalty for a given memory access (depending
upon whether it is a byte access or not)).
Re: (b) above. NO. We do not do inlining of the strcmp(), strcpy(), or
strlen() routines (I wish we could, it would be even better!). The Am29000
has one of the fastest calling conventions around. The performance
improvement we saw with the special instruction is an order of magnatude
(ok, I'm taking a guess here, but it is probably pretty close) greater than
what would have been gained by in-lining in this case.
Re: Branch folding. I like all optimizations. Give me more MIPS! More.
MORE. MORE!
bcase
---------------
Back again by popular demand:
Sung to the tune of "Timewarp:"
It's just a shift the left,
And then a shift to the ri-i-ight!
Get a load of those MIPS
And the code's real ti-i-ght!
This lock-step pipeline
Really drives me insa-a-a-e-a-ane
Let's do the I-fetch again!
Yes, you are right: when the length is known, things are *much* easier.
But, the instruction is still a useful primitive: it can be used to find
out, in one cycle, whether or not a 32-bit word contains a byte of
interest. The compiler may not generate it, but it can be used in hand-
coded inner loops for this purpose.
bcase
[ Discussion of best history of string instructions]
> The fact of the matter is: A C compiler is going to have a real hard time
> generating this instruction. A Pascal Compiler might have a slightly
> easier time since the conept of "string" is a little more integrated with
> the langauge. In general, a compiler for a language without built-in types
> and functions for dealing with strings will have a hard time with our
> compare-bytes instruction. We realized this when we added it to the
> architecture (and it was truly a last-minute addition). However, it has
> such a significant impact on C programs in general that we were not deterred
> by the realization that this instruction will probably occur only a few
> times in the strcpy(), strcmp(), and strlen() library functions. Out of
> all the megabytes of code in a UNIX system, this instruction will account
> for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
> guess). But if it improves the running time of all the string-oriented
> system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
> worth it. And the implementation cost was so small. Also, there are
> some instructions that must be present just to administer the system,
> like return from interrupt, move-from-special-register, etc. These
> are not generated by a compiler either. Just to reiterate a point: RISC
> is not reducing the instruction set but is improving performance.
>
> Ok, so you don't believe the above? How about "It improved dhrystone
> a hell of lot."
>
> bcase
I think that it makes sense to add instructions like this even if
they only get put into the standard libraries, presumably they will
prove to be modest win for most applications written in C. I
vaguely remember reading that calls to string routines represented
about .5% of all C code.
As for Brian's second statement about Dhrystone being helped by
string instructions: He is certainly right about that.
--
Clif Purkiser, Intel, Santa Clara, Ca.
{pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif
These views are my own property. However anyone who wants them can have
them for a nominal fee.
When were nroff/grep/yacc written compared to when str* was standardized
in the library? After all, the str* libraries didn't spring full-blown on
day one with those exact sematics and calling sequences (I'm guessing).
I'll also guess (could easily be wrong) that there is a lot more string
stuff going on than you see in the str* statistics, and that it's hand
coded "in line" (in C, of course).
Counter point: What do similar NEW programs look like? Take some code
written by someone (or a team) that started with the "modern" libraries.
Run the statistics on that, and see...
Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l"
beat on those really hard (at least in System-V), since way down at the
bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd
entries (go look at "getpwent" & friends).
I know that I sometimes avoid str* because I'm afraid it might be slow,
particularly when I know that keeping a count (a.k.a. length) can avoid
a lot of searching for the end. The solution is usually a high-level change
in the applications basic algorithms so that so much copying & searching
is simply avoided. A simple example is when you are building various
strings up out of pieces (as in "speeding up" shell scripts by recoding
in C). Even USING the libraries, sometimes instead of:
strcpy(buf, "piece");
strcat(buf, "another piece");
strcat(buf, "still another piece");
I find myself doing:
char *p = &buf[0];
int n;
strcpy(p, "piece"); p += strlen(p);
strcpy(p, "another piece"); p += strlen(p);
sprintf(p, "format%d%s", args...); p+= strlen(p);
strcpy(p, "still another piece"); p += strlen(p);
etc...
Then when you're done, you can assert that strlen(buf) == (p - buf);
Rob Warnock
Systems Architecture Consultant
UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail: !rpw3
DDD: (415)572-2607
USPS: 627 26th Ave, San Mateo, CA 94403
The version of "_doprnt" that comes with 4BSD and comes in the VAX System III
source (althought whether it's the one that's in the library or not,
I don't know), and may very well have come with UNIX 32V, is written
in assembler. The 4.3BSD version, at least, does use the string
*and* EDITPC instructions, although it doesn't use the string
instructions for handling things like "%s".
VAX System V doesn't use this version (and no, it's probably not
because the assembly version doesn't handle all the formatting
operators in the S5 version - I know it handles most of them, and I
think it handles all of them), so I'm curious whether they dropped it
because it didn't buy much.
>We have experimented with a number of solutions, including a hardware
>printf coprocessor for the 68020. Predictably, it improved performance
>quite a bit. We also have tried using specially coded versions of printf
>in assembly language that we passed through a (very time expensive)
>exhaustive search optimiser and the results were also pretty good.
How much did the coprocessor cost, and how much better was it than the
assembly-language version?
cycles %cycles cum % cycles bytes procedure (file)
/call /line
511480 44.87 44.87 6558 22 getname (ls.c)
310413 27.23 72.11 16 40 fgetc (../fgetc.c)
99006 8.69 80.79 43 28 _flsbuf (../flsbuf.c)
79820 7.00 87.80 340 17 _doprnt (../doprnt.c)
18830 1.65 89.45 54 23 select (ls.c)
15073 1.32 90.77 321 19 gmtime (../ctime.c)
....
1655 0.15 98.79 10 5 strcmp (../strcmp.s)
....
1338 0.12 99.18 18 5 strlen (../strlen.s)
....
63 0.01 99.96 8 10 strchr (../strchr.c)
[Surprise! str* stuff isn't on the map in this one. Just goes to show
that even experienced [and Rob is] folks' intuition needs checking.
>Well, this surprised me also.]
This whole string-processing discussion has been quite worthwhile,
if only to remind us that we often discuss pretty esoteric things
in this newsgroup, while at the same time, we sometimes don't have
the foggiest ideas on even basic program behavior!!
Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
had been designed to return a pointer to the END of the string,
(or a count of the number of characters in the string)?
They already have this information for free, but instead they return
their first argument, something that the caller already knows.
l1: lw t2,1(a1)
addu a1,4
addu t3,t2,t0
xor t3,t2
and t3,t1
beq t3,t1,l1
Quiz: what are the magic constants that need to be loaded into t0 and
t1 to make this work?
This is 7 cycles per word, or 1.75 cycles per byte.
-Earl
./ls /bin /usr/bin [listed 64 times]
%Time Seconds Cumsecs #Calls msec/call Name
22.1 10.72 10.72 80190 0.1336 strcmp
3.3 1.60 40.25 21577 0.0742 memcpy
0.7 0.33 47.65 10688 0.0312 strlen
0.3 0.13 47.97 strchr
0.1 0.05 48.40 memchr
0.1 0.03 48.53 strcpy
./ls -CF /bin /usr/bin [listed 64 times]
%Time Seconds Cumsecs #Calls msec/call Name
7.9 10.67 100.55 80191 0.1330 strcmp
1.2 1.58 128.38 32234 0.0491 memcpy
0.3 0.42 133.08 21440 0.0194 strlen
0.1 0.12 134.05 2 58. strchr
0.1 0.08 134.23 memchr
0.0 0.02 134.50 strcpy
./ls -l /bin /usr/bin [listed 8 times]
%Time Seconds Cumsecs #Calls msec/call Name
0.4 1.42 338.83 10023 0.1413 strcmp
0.3 1.18 341.32 18761 0.0631 memcpy
0.0 0.15 345.45 1336 0.112 strlen
0.0 0.10 345.92 memchr
0.0 0.08 346.18 strcpy
0.0 0.03 346.35 1 33. strchr
Note: on the Amdahl, strcmp and friends are in assembler.
Rick Richardson, PC Research, Inc: (201) 922-1134 ..!ihnp4!castor!pcrat!rick
when at AT&T-CPL: (201) 834-1378 ..!ihnp4!castor!pollux!rer
At one point there was a new-hire at Sun who was busily going through
the compiler replacing most of the printf's with puts's and such. Chris
Aoki and I looked at each other for a few minutes and then did some timings.
The new-hire was right, printf was slow, but we put in the right fix:
we sped it up, and left the compiler alone!
Turns out that it was doing an amazing amount of parsing to handle the
simplest of cases. Just adding a front-end to printf that handled %s,
%d, %c, %o, %x, and the "h" and "l" versions of these, sped the whole
thing up by a large factor (don't remember the factor). If you used
something beyond these simple capabilities, it punted to the old
general case code. But 99% of the printf strings ran in the trivial front
end.
When Geoff Collyer and I were speeding up "C news", we often found
that stupid algorithm design was responsible for slowness, particularly
calling string compares or string cat's in inner loops. Don't fix
the instruction set, bunky, fix the applications!
Similarly, I have a strong suspicion that compiling strcmp(a,b) to:
subtract *b from *a
if nonzero, you have the result
else call strcmp()
(say using the nice _inline_ stuff in the draft C standard) would speed
it up a lot more than all the instructions you could imagine. It certainly
had that effect in C news when we did just that with a macro.
I too am curious about the "printf coprocessor" for the 68020. Was this
a paper design or did it actually get built? Considering the 68881 can
format an integer or float into decimal (BCD, but the hard part is done),
I can't see it winning much at all. Also remember that for a coprocessor
to access main memory takes twice as long as for the 68020 to do it --
so you don't want the CP scanning out format strings and such.
--
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu g...@ingres.berkeley.edu
> Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
> had been designed to return a pointer to the END of the string,
> (or a count of the number of characters in the string)?
In some C implementations, sprintf() DOES return a count of the number
of characters that it wrote. I find this behavior in a System V manual,
for instance, and on VMS.
Since both behaviors existed, the ANSI people were able to choose the more
useful one, and the October 1986 draft mandates that sprintf() returns
a count. Unfortunately this does not apply to the other string functions.
Also unfortunately, the new book "Portable C and UNIX System Programming"
by "J. E. Lapin" misses this one. (That's why I had to check manuals...)
Meanwhile, until ANSI C is universal, we should all be careful NOT to assume
that we know what type sprintf() returns...
Mark Brader "Perhaps their software was written by a Byzan-tine-ager"
utzoo!sq!msb -- Peter Neumann
This made me curious, and I started playing around with alternate versions of
strcpy. I did some timings on a microVAX II (cc -O, 4.3BSD) using the library
strcpy, a straightforward version of my own (strcpy2), and a version which
tried to minimize branches out of the instruction sequence (strcpy3). I ran
strcpy2 and strcpy3 both as procedure calls and as in-line expansions. The
timings, for 1,000,000 executions of copying "Test string 1", were
routine inline? time (sec)
strcpy no 134.7
strcpy2 no 65.2
strcpy3 no 51.6
strcpy2 yes 43.3
strcpy3 yes 30.8
The most shocking thing about the above is that the library strcpy is half the
speed of a very straightforward C implementation of the routine! Perhaps
we should spend less time arguing about hardware support for string-handling
routines, and more time making sure that the people who implement the
run-time library are somewhat competent. It also certainly appears that
in-line expansion of simple library routines is a big win.
I also ran some tests on a SUN-3/180 (SUN UNIX 4.2, cc -O). The results were:
routine inline? time (sec)
strcpy no 22.0
strcpy2 no 21.1
strcpy3 no 20.1
strcpy2 yes 15.1
strcpy3 yes 13.7
The SUN library routines seem much more competently coded, but even here,
perhaps a bit of tuning would help (note that my timing was pretty sloppy,
so small differences may not be significant), and in-line expansion would
certainly be a big win.
Routine codes:
strcpy2(ss1,ss2)
char *ss1,*ss2;
{ register char *s1,*s2;
s1 = ss1;
s2 = ss2;
while (*s1++ = *s2++); }
strcpy3(ss1,ss2)
char *ss1,*ss2;
{ register char *s1,*s2;
s1 = ss1;
s2 = ss2;
while (*s1++ = *s2++)
{ if (!(*s1++ = *s2++)) break;
if (!(*s1++ = *s2++)) break;
< the above statement is repeated 20 times >
} }
***************************************************************************
Ehud Reiter
reiter@harvard (ARPA, BITNET, UUCP)
Me too. I wish strcpy et al would return a pointer to the *end* of the copied
string (fgets too, as someone else brought up). In fact, one could argue that
the low-level routine should not even bother to write the final '\0'; then
your example would become
*_strcpy(_sprintf(_strcpy(_strcpy(p, "piece"), "another piece"),
"format%d%s", args...), "still another piece") = '\0';
(Which can still be written using temporaries if the code gets too opaque!)
Karl W. Z. Heuer (ima!haddock!karl or ka...@haddock.isc.com), The Walking Lint
Ehud Reiter
reiter@harvard (ARPA,BITNET,UUCP)
Let's see... How about
t0 = 0x7efefeff
t1 = 0x81010100
Is that right? If so, it appears to work only for 7 bit characters (but
we admit that such a restriction is reasonable in real life)...
(This is a fun problem but probably a bit too hard for use in interviews.)
Mike Johnson
Brian Case
Tim Olson
Don't laugh! It's not unknown for an optimizing-compiler group to put extra
attention into making common benchmarks run fast. I'm told that there is at
least one compiler that has a specialized loop-unrolling optimization whose
major purpose is to quietly turn the "looping" version of Linpack into the
"unrolled" version.
--
"We must choose: the stars or Henry Spencer @ U of Toronto Zoology
the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
You've got to be careful when you run benchmarks. If you are indeed running
4.3, then you are most likely using the 4.3 libc, which has been carefully
optimized to use the fancy VAX instructions for the string routines.
Unfortunately, some of these instructions are not implemented by the
MicroVAX-II hardware. As it turns out, what is happening is that your tests
(including Dhrystone) are causing kernel traps to emulate those
instructions!
Avie
Hey, John, wanna know something really interesting? Go look at 4.2 BSD
"ls", at routine getname(), and you'll discover it's full of getpwent()'s
(plus {set,end}pwent(), etc.) and strncpy()'s (hidden in the "SCPYN" macro).
And of course getpwent() uses the str* routines, though it also has its
own pwskip() instead if index(). (Have you got a 4.{2,3} BSD "ls" that you
could run your profiler on?)
Then look at the Sys V "ls", and lo & behold, all of those neat library
routines have been unfolded into the getname() routine, and hand-coded
into a most glorious mess, which calls only (guess)... fgetc()!!!
(No wonder my jaw dropped when I saw your numbers!)
Now other parts of Sys V use {set,get,end}pwent(), so I guess somebody
decided that "ls" was running too slow (or had an "intuition"?) and ripped
it apart. Given the looks of the code, I'm surprised that even fgetc()
survived.
This reminds me of Gerry Weinberg's "Railroad Paradox":
"We don't need fast strlib routines, because no one ever uses them."
Yeah... 'cause they're not fast, everyone codes around them!
We (me, too, I just realized I got suckered in also) must be *VERY* careful
to look inside any major piece of code we are going to try and use to prove
something or other, in order to avoid simply confirming the {good,bad,weird}
practices of the past.
I am now completely unsure whether we can get ANY reasonable statistics on
what the proper usage (percent, etc.) of "str*" should/could be in NEW code,
if we base it on code that has been hacked on as long as the various Unices.
Anybody want to cough up a "Stryngstone" benchmark for us to throw rocks at?
In article <5...@wb1.cs.cmu.edu> av...@wb1.cs.cmu.edu (Avadis Tevanian) writes:
>... the 4.3 libc ... has been carefully optimized to use the fancy
>VAX instructions for the string routines. Unfortunately, some of
>these instructions are not implemented by the MicroVAX-II hardware.
>As it turns out, what is happening is that your tests (including
>Dhrystone) are causing kernel traps to emulate those instructions!
Exactly. Strcpy, strcat, and strlen were all modified to use the
Vax `locc' instruction to find the ends of strings. This instruction
is not implemented in hardware in the uVax II. The obvious solution
is to arrange the libraries so that on a uVax, programs use a
straightforward test-byte-and-branch loop (see sample code below).
There are two ways to do this. One could attempt to determine at
run-time whether `locc' is available; or one can simply assume that
anything compiled on a uVax will run on a uVax, and anything compiled
on a `big Vax' will run on a big Vax. The former would be hard,
requring a system call, but would likely be worthwhile if this
could be done at most once per program run. The latter is easy:
just build libc.a differently on a uVax (and then watch rdist run,
and weep).
Both tricks, however, require some way for user programs to discover
which CPU is executing them. A `getcputype' call, anyone? (But
what about dynamic process relocation, where a program might move
from one CPU type to another? [ECAPISTRANO, process migrated])
Here is a sample replacement for strlen (untested!), assuming there
were a getcputype system call.
/* get CPU type numbers */
#include <sys/cputype.h>
/* lenroutine is the address of the proper routine, once known */
.lcomm lenroutine,4
ENTRY(strlen)
.word 0 # save no registers
movl lenroutine,r0 # know which routine to use?
beql 1f # no, go figure (and pipeline flush)
jmp (r0) # go do it
/*
* Someone should find out whether a branch to the jmp (r0) below
* would be slower (two pipeline flushes vs. one?). Need to test
* all architectures!
*/
/* figure out which routine to use */
1: calls $0,_getcputype
cmpl $UVAX2,r0 # is it a MicroVAX-II?
beql 2f
movl bigvax,r0 # use big vax code
brb 3f
2: movl chipvax,r0 # use chip vax code
3: movl r0,lenroutine # remember which to use
jmp (r0) # and go do it
/* locc version */
bigvax:
... # insert 4.3BSD code here
ret
/* byte-at-a-time version */
chipvax:
movl 4(ap),r0 # get string
movl r0,r1 # and avoid two mem refs
1: tstb (r0)+ # find the \0
bneq 1b # loop until just past the \0
decl r0 # point back at \0
subl2 r1,r0 # return r0 - r1
ret
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP: seismo!mimsy!chris ARPA/CSNet: ch...@mimsy.umd.edu
Note that the library "strcpy" uses "locc" to find the length of the
source string and then does a "movc3" to copy it. This requires two
passes over the source string. Whether the whizzo VAX string
twiddling instructions are a win or not depends on how long the
strings are. (Also, which of the whizzo string instructions does the
microVAX II implement in microcode and which are handled in
software?)
> I also ran some tests on a SUN-3/180 (SUN UNIX 4.2
Which release? The "UNIX 4.2" is a conventional phrase. The 4.2
refers to 4.2BSD; it's not the release number.
> routine inline? time (sec)
> strcpy no 22.0
> strcpy2 no 21.1
This is almost certainly not significant. The code to the SunOS 3.0
version of "strcpy":
char *
strcpy(s1, s2)
register char *s1, *s2;
{
register char *os1;
os1 = s1;
while (*s1++ = *s2++)
;
return (os1);
}
The only difference is that yours doesn't return a pointer to the
original string (which it has to if it's to be compatible), so the
differences are almost certainly insignificant. The SunOS 3.2
version is:
ENTRY(strcpy)
movl PARAM,a0 | s1
movl PARAM2,a1 | s2
movl a0,d0 | return s1 at the end
moveq #-1,d1 | count of 65535
| The following loop runs in loop mode on 68010
1$: movb a1@+,a0@+ | move byte from s2 to s1
dbeq d1,1$
bne 1$ | if zero byte seen, done
RET
which is more-or-less the same thing, only using the 68010's moral
equivalent of whizzo string instructions. In the cases I tested it
on, it was faster than the C version. (Thanks to John Gilmore and
Vaughan Pratt for the little "bne 1$" trick at the end there.)
Unrolling the loop, as you did, might be a bigger win, especially on
the 68020 where even the unrolled loop would probably fit in the
instruction cache.
I'm going to try these on some of the machines I use, and see if this
will speed up things.
--
bill davidsen sixhub \
ihnp4!seismo!rochester!steinmetz -> crdos1!davidsen
chinet /
ARPA: davidsen%crdos...@ge-crd.ARPA (or davi...@ge-crd.ARPA)
The Wang Labs VS processor also has some very useful assembler (not machine)
instructions for string processing. The manner in which this is handled is
by adding an attribute to functions called "builtin". This enables the C
(or other language) compiler to generate in-line code for functions that are
in the instruction set.
Example:
builtin char *memcpy();
main()
{
struct stbuf {
char buf[8];
int i;
} structure1;
struct stbuf structure2;
/** initialize structure1 with code here **/
/** This function generates a single VS assembler instruction **/
memcpy(&structure2,&structure2,sizeof(structure1));
}
The list of builtins is clearly listed in the reference manual and includes all
the standard System 5 string functions. If one does not wish to use builtins,
then the function is declared explicitly without the builtin attribute.
I am not sure, but I think I recall that if the function is not declared
explicitly and the function is a builtin, then it is treated as a builtin.
Portability is not a problem with this scheme.
Your basic ricker.
1) As has been suggested previously, complex routines like printf should be
optimized to run fastest in the most common cases.
2) Simple routines like strcpy should be adjusted to perform well on a
particular architecture (if the microVAX doesn't have a hardware locc
instruction, then is it too much to ask that the run-time library supplied
for the microVAX be changed not to use locc, at least in small and frequently
used routines like strcpy?). A moderate degree of loop unrolling should be
considered.
3) Simple routines like strcpy should be recoded in assembler, at least to
the degree of having their procedure prologues simplified, and so that they
use registers which don't have to be restored.
4) In-line expansion of common (and simple) library routines should be
considered.
The above points are all fairly simple and obvious, but I suspect that
in many cases they have not been done. My impression has been that far more
care and thought go into hardware design and compiler design than go into
run-time library design, although it may be easier (perhaps because of this!)
to squeeze an extra 10-20% performance from the system by improving the library
than by improving hardware or the compiler.
Ehud Reiter
reiter@harvard (ARPA,BITNET,UUCP)
P.S. A few people have pointed out that saying a SUN-3 runs 4.2 isn't saying
much. Sorry about that! My SUN-3/180 runs 3.1.
The fix, of course, is:
o Have the startup module determine what version of the CPU is running
and set a global.
o Have the string functions use different instruction sequences,
depending on which CPU is being used.
This is similar to how the 8087 is used on the PC.
Careful, John, this code has a bug: on a signed-char machine, the result
when comparing "" to "\203" will be wrong. Correct lexical comparison
requires that the terminating NUL compare low against all characters, which
is not the way subtraction works on a signed-char machine. This will get
more important as ISO Latin becomes more common.
Admittedly, many existing implementations of strcmp botch this too!
Yes, you've got it. Congratulations. (Gee, do I owe you lunch or
something now?) Actually, it works for 8-bit characters in bytes 0,
1, and 2 and a 7-bit one in byte 3 (I'm using little-endian byte
numbers here). Thus it may accept '\200' as a zero in byte 3. But
this is easily handled in the follow-on code that has to detect which
byte is nonzero by actually testing byte 3 and branching back to the
loop if it is nonzero. Actually, here is that code. Note how after
the word loop the carries also conveniently let you detect the byte.
This part only works for little-endian byte ordering. (Note: my
personal convention is to indent instructions in branch delay slots
once space):
## search for word with zero byte
l1: lw t2,1(a1) # read next word
addu a1,4
addu t3,t2,t0 # cause carry out of any nonzero byte
xor t3,t2 # xor to find carry bits
and t3,t1 # look at carries out of bytes
beq t3,t1,l1 # all bytes carried => none zero
## found word with zero byte, now find the byte
sll t4,t3,23 # carry out of byte 0?
l3: beq t4,0,x0 # if not, go compute length
sll t4,t3,15 # carry out of byte 1?
bge t4,0,x1 # if not, go compute length
sll t4,t3,7 # carry out of byte 2?
bge t4,0,x2 # if not, go compute length
srl t4,t2,24 # byte 3 zero?
bne t4,0,l1 # no, re-enter loop
Really sophisticated programs spend most of their time doing really
sophisticated things; simple grunt routines like str*() aren't universally
applicable.
> When were nroff/grep/yacc written compared to when str* was standardized
> in the library? After all, the str* libraries didn't spring full-blown on
> day one...
I don't know (offhand) about these, but I recall once being curious about
why lex was slow. It turned out to use subroutines for isalpha(), isspace(),
isdigit(), etc, instead of the (presumably newer) ctype.h macros. HOWEVER,
even after making that change to a version of lex, it didn't help all that
much, because lex has so much else to do.
> Even USING the libraries, sometimes instead of:
> strcpy(buf, "piece");
> strcat(buf, "another piece");
> strcat(buf, "still another piece");
> I find myself doing:
> char *p = &buf[0];
> int n;
> strcpy(p, "piece"); p += strlen(p);
> strcpy(p, "another piece"); p += strlen(p);
> sprintf(p, "format%d%s", args...); p+= strlen(p);
> strcpy(p, "still another piece"); p += strlen(p);
> etc...
>
WHOOPS! That's about as much work (first you copy the string, then
you step over it) as the use of strcat(). If you don't mind being
non-portable to oddball architectures, you can write a routine like
"strcatl()", which takes a destination buffer as a first argument, and
a NULL-terminated list of strings to concatenate into that buffer (or
you can be slightly portable and do it with <varargs.h>). The easiest
way might be, however, to use:
char *strcpya(p1, p2) /* strcat and return end address */
char *p1, *p2;
{
while (*p1++ = *p2++) /**/;
return p1-1;
}
and then
strcpya(strcpya(strcpya(buf, "first piece"), " second piece"),
" and the last");
or if you prefer
char *p,buf[127 /* a prime number*/ ];
p = strcpya(buf, "first piece");
p = strcpya(p, " second piece");
p = strcpya(p, " third piece");
Either way, you only touch each byte once.
Sometimes, the question you need to ask yourself about library routines is
not "Could they be better coded with this screwball assembly instruction?"
but rather "Does it *really* do what I want?".
--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit...@MIT-XX.ARPA
"Happiness is the planet Earth
in your rear-view mirror."
- Sam Hurt, in "Eyebeam, Therefore I Am"
Concur somewhat at this point.
>There are two ways to do this. ...
There is a third and much more efficient way:
Shared resident libraries.
This way all you have to do is make sure you install the correct
library on a particular machine. Everyone except memory and
disk drive vendors benefit from shared libraries. Assuming a
vectored entry point interface to the library, you can move your
images from one type of Vax to another and your program will
run with the most efficient `str*' routines available for that
machine, ie. the routines in that machines resident library.
None of this re-link everything that uses `ctime' nonsense either.
Of course some people need resident libraries more than others,
a case in point are the customers of Sun Microsystems. Here
resident libraries, in addition to a host of other benn'ies
previously alluded to, will put a stop to the following:
"Gak!! That was a fifty line program. It took
forever to link and it eats up 700k of disk space???"
Since Sun is working on making their system SVID compatible
the wait shouldn't be too long now. If I remember correctly
Apollo has always had resident libraries, but then I've never
even as much as seen an Apollo product.
Bjorn R. Bjornsson
alberta!bjorn
To further speed this up you could write it as:
strcpy2(ss1,ss2)
register char *ss1, *ss2;
{
while (*s1++ = *s2++)
;
}
thus eliminating two assignments and two local variables.
--
---------------------------------------------------------
Glover (upon discovering that Gibson is his new partner):
"God hates me, that's what the problem is!"
Gibson (blowing cigarette smoke out his nose):
"Hate 'em back, it works for me!"
Gibson (later in the film, with handgun):
"PUNTA, PUNTA, PUNTA!!"
Robert Allen,
rob...@spam.istc.sri.com
---------------------------------------------------------
Note also that a two-pass strcpy() should not be used in any environment
where two processes may share data space... or in "ANSI C" jargon, where
the source string is volatile. If you did that, you might find that you
have scribbled on bytes after the terminating null, or produced a
destination string with no terminating null at all. (The latter effect
can of course be avoided by inserting the null separately rather than
copying it.)
When I had to do some work on VMS a couple of years ago, we found that
the VMS implementation of strcpy() was considerably slower than locc-movc3
for the string lengths we were working on, and we decided that this was
the reason.
Mark Brader
In some article, someone adds these arrows and notes:
>>$strcpy2(ss1,ss2)
>>$char *ss1,*ss2;<---------------------------- put "register" on this line
>>${ register char *s1,*s2;<-------------- as many compilers will use this
>>$ as local variables overriding
>>$ s1 = ss1; the arguments.
[etc]
In article <99...@sri-spam.istc.sri.com> rob...@sri-spam.istc.sri.com
(Robert Allen) writes:
>To further speed this up you could write it as:
> strcpy2(ss1,ss2)
> register char *ss1, *ss2;
> {
> while (*s1++ = *s2++)
> ;
> }
>thus eliminating two assignments and two local variables.
Now these people all seem to be talking about Vaxen; I feel I must
point out the various errors here. It is clear that neither the
original someone nor Robert Allen actually compiled these to
assembly. (Use `cc -O -S' to do this.)
First, the notes attached to the arrows are wrong. Compilers cannot
override the arguments when the names differ. The loop copies from
s2 to s1; the parameters were ss2 and ss1. Second, the code produced
for the second example (once corrected to `while (*ss1++ = *ss2++)')
is *identical* to the code for the first! (Vax running 32V, 3BSD,
4BSD, and, no doubt, Sys3, Sys5, V8, and V9.) The two assignments
the second version attempts to eliminate must still occur, for
`ss1' and `ss2' are passed on the stack, and must be copied into
the two registers---the same two registers allocated in the first
version.
The proper way to speed strcpy() on a MicroVAX-II is no doubt to
use the following assembly code:
_strcpy:.globl _strcpy
.word 0 # save no registers
movq 4(ap),r1 # get s1 and s2 into r1 and r2
movl r1,r0 # save s1
1: movb (r2)+,(r1)+ # *s1++ = *s2++
bneq 1b # loop until a zero is moved
ret # return original s1 in r0
Note that this is remarkably similar to the compiler's output
for the original code, modified to have the proper return value:
char *
strcpy(ss1, ss2)
char *ss1;
char *ss2;
{
register char *s1 = ss1, *s2 = ss2;
while ((*s1++ = *s2++) != 0)
/* void */;
return (ss1); /* must return the original value */
}
.globl _strcpy
_strcpy:
.word 0xc00 # save r11 and r10
movl 4(ap),r11 # here is s1
movl 8(ap),r10 # and s2
L16: movb (r10)+,(r11)+ # *s1++ = *s2++
jneq L16 # (this assembles to a `bneq')
movl 4(ap),r0 # return the original s1
ret
All one can improve on the locc-poor MicroVAX-II is the register
usage and the parameter grabbing. (c2, at least from 32V to 4.3BSD,
will never turn two `movl's into a `movq'. Ah well.)
This does not follow at all. The SVID does not describe shared
libraries; this is the correct thing for it to do, since there may be
machines out there for which shared libraries are difficult to
implement. If we do shared libraries, it won't be because of the
SVID.
>If I remember correctly Apollo has always had resident libraries, but then
>I've never even as much as seen an Apollo product.
They do.
yes, I confess. I didn't compile the version I typed into
the message, although I did confirm that it worked by coding
and running an example with the *correct* variable names.
Woe is me. Also, I appear to have been wrong about saving
the two assignment statments also.
This isn't the only, or nessisarily best, way to implement shared
libraries. Primos implements shared libraries via faulted links.
(call by name first time a routine is called, which replaces the
faulting pointer with one the actual routine.) I think this is yet
another idea they borrowed from Multics. (The hardware overhead is a
fault bit on pointers, but you could implement it on a vax by
reserving half of your address space. (Hmmm... does a user process
ever need to reference the kernal area of memory?)) Of course, VMS
uses entry vectors, and we all know VMS does everything right. :-)
--
Bob Larson
Arpa: Bla...@Usc-Eclb.Arpa
Uucp: (several backbone sites)!sdcrdcf!usc-oberon!castor.usc.edu!blarson
seismo!cit-vax!usc-oberon!castor.usc.edu!blarson
The System V.3 (and ANSI) *{scanf printf} routines have a a format specifier
%n which stores the number of bytes {read, written} so far into the int
pointed to by the current argument. Also in System V.xx and ANSI, sprintf
returns the total number of bytes written (not the pointer to the string
like 4.2 BSD does).
--
Michael Meissner, Data General Uucp: ...mcnc!rti-sel!dg_rtp!meissner
It is 11pm, do you know what your sendmail and uucico are doing?
> The proper way to speed strcpy() on a MicroVAX-II is no doubt to
> use the following assembly code:
>
> _strcpy:.globl _strcpy
> .word 0 # save no registers
> movq 4(ap),r1 # get s1 and s2 into r1 and r2
> movl r1,r0 # save s1
> 1: movb (r2)+,(r1)+ # *s1++ = *s2++
> bneq 1b # loop until a zero is moved
> ret # return original s1 in r0
>
> Note that this is remarkably similar to the compiler's output
> for the original code, modified to have the proper return value:
> All one can improve on the locc-poor MicroVAX-II is the register
> usage and the parameter grabbing. (c2, at least from 32V to 4.3BSD,
> will never turn two `movl's into a `movq'. Ah well.)
Replacing your loop with:
1: movb (r2)+,(r1)+
bequ 2f
movb (r2)+,(r1)+
bneq 1b
2:
will almost certainly speed things up (say 15%). I haven't actually
tried it, but I've tried entirely analogous cases. Loop unrolling can
produce speed-up even when the instruction count is unchanged, if
taken branches are replaced by untaken branches.
Radford Neal
I don't know what you mean by "doesn't follow at all" (B-).
SVID specifies shared memory. The point is that having shared
memory implies ease of shared library implementation but not
vice versa. For systems that have neither it is probably easier
to just add shared library support than it is to go whole hog
and provide shared memory.
If I end up working at a Sun shop when I finish my degree here (about
two months from now), and Sun does not deliver shared libraries at
the time they distribute an OS with shared memory, you can be sure
that I will implement same right away.
Remember:
+--------------------------------------------+
| shared memory => shared resident libraries |
| |
| The arrow does not go the other direction |
+--------------------------------------------+
Bjorn R. Bjornsson
alberta!bjorn
Mmmm, not necessarily. I had to study this sort of thing in another
context recently. When the number of iterations is short -- remembering
that strcpy is often used for quite short strings -- an unrolled loop can
be slower than a simple one. The extra instruction fetches on the first
iteration of the unrolled loop can cost more than the extra loop control
in the simple loop.
(When the number of iterations is long, there is an optimal degree of
unrolling. For simple cache designs, the curve of "time taken for a long
copy" against the log of the unrolling factor is beautifully symmetrical,
with a definite and specific minimum. I haven't analyzed the situation
for more complex caches, although I conjecture vaguely similar results.)
Note that the shared libraries implemented in System V, Release 3 do
*!NOT!* use the SVID shared memory mechanism. Maybe AT&T did the
wrong thing here, maybe not, but they did, at one point, have an
implementation that had shared memory but not shared libraries, and
now have an implementation with shared libraries and shared memory
but do not use the shared memory system calls to provide shared
libraries. (Note that S5R3 shared memory can't be write-protected;
this may have had something to do with their decision not to use
it....)
>If I end up working at a Sun shop when I finish my degree here (about
>two months from now), and Sun does not deliver shared libraries at
>the time they distribute an OS with shared memory, you can be sure
>that I will implement same right away.
Sun *already* distributes an OS with SVID-compatible shared memory,
although shared memory segments are currently wired down. This may
change in a future release. Most of the work in implementing shared
libraries is, as somebody here put it, in the "libraries" part, not
the "shared" part. Shared memory may be necessary, but it's far from
sufficient.
>(Note that S5R3 shared memory can't be write-protected;
>this may have had something to do with their decision not to use
>it....)
Not true; it can be write-protected, so that probably wasn't what
made them decide not to use it. There may have been other problems;
e.g., using S5 shared memory would require the C startup code to find
and map in the library, and they may not have wanted to do that, or
they may have decided it was too much work to arrange that the
library be mapped in at a particular address (which their shared
library implementation requires).
In this case, these instructions would be used in the hand coded
strcopy and strcmp C library routines, so compilers need not be concerned.
My guess is they are still a win, because these libraries are used a bunch.
Jan Stubbs ....sdcsvax!ncr-sd!stubbs
619 485-3052
NCR Corporation
Advanced Development
16550 W. Bernardo Drive MS4010
San Diego, CA. 92127
void strcpy(s1,s2) register char *s1,*s2; { while (*s1++ = *s2++); }
--
|------------dan levy------------| Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
| an engihacker @ | vax135}!ttrdc!ttrda!levy
| at&t computer systems division | Disclaimer: try datclaimer.
|--------skokie, illinois--------|
It only becomes reasonable to tailor a system for a particular piece of
hardware when there are only a small number of variants that run that
architecture. In other words, this might have been fine when there was
the 780 and the 750 (nobody counted the 730 or MV-1 anyway) but once
you have a bunch of models, you just have to make the code
straightforward and don't do anything that *really* breaks on some
machine. I presume in the Vax case this means mostly avoiding the
unimplemented instructions.
I worked on an APL system for the IBM 360/370 and just finding out the
timings for the 15 or 20 models that could run the code was too much work,
let alone figuring out which combination would be best until IBM's next
release. (No flames on 15..20, this was in 1973!)
(Of course, the same applies to an "architecture" like C/Unix -- write
code that's straightforward and doesn't do anything that really breaks
anywhere. Super optimizing your C source is kinda hard these days --
are you *sure* it's better to code it this way on the Cray? IBM? DG?
DEC? 8080?)
It's true that a tailored shared library could give some benefit, but
the general problem extends to what code to generate inline, not just
in library routines.
> 3) Simple routines like strcpy should be recoded in assembler, at least to
> the degree of having their procedure prologues simplified, and so that they
> use registers which don't have to be restored.
> 4) In-line expansion of common (and simple) library routines should be
> considered.
These should both be done automatically by a good compiler. Compilers
that put in large procedure prolog/epilogs and don't simplify them when
possible have no excuses. Those that won't use the scratch registers
for variables when possible have excuses but newer compilers are beating
them -- excuses don't benchmark very well.
--
Copyright 1987 John Gilmore; you can redistribute only if your recipients can.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu g...@ingres.berkeley.edu
I found that the book, while useful, does mis-characterize a lot of
features. I presume that this is mostly because they said they worked
from the documentation, rather than actually testing the systems! A
shame, for a book where half the page count (130 pages) is the tables
of comparisons, and explanatory notes on the tables.
So far the system comparison I trust best is Guy Harris's, done in the
Sun "System V Enhancements Overview" manual (I think I got that title right),
since he actually had to merge the code...
PS: If you think you know what options grep, fgrep, and egrep take,
and what pattern metacharacters they handle -- try them! Note also that
there are or were two grep's (/bin/grep and /usr/ucb/grep) on 4.2BSD
and maybe 4.3. James Woods may be the only person alive who knows
all this stuff, since he redid it for speed and PD-ness.
Since when?
The problem is, shared memory requires adequate support from the
underlying machine architecture and memory management. Recognizing
this, the SVID (last I looked, my copy is not at hand now) specified
what the shared memory facility must look like IF it is implemented,
but did not require its implementation.
Beyond shared memory, shared libraries require substantial support
from the link editor etc. Compatibility with existing facilities
may well constraint Sun's ability to implement shared libraries.
Also, are shared libraries important enough to claim a substantial
chunk of their limited development resources? There are other
important things that also need to be taken care of..
Note that I did not specify where the entry vector was to be located.
It can be in the library or in your process. Having the vector in
your process rather than the library may give you little bit more
flexibility at the cost of space, perhaps a significant amount
(depending on the library size and or the number of routines you need).
>(call by name first time a routine is called, which replaces the
>faulting pointer with one the actual routine.)
I was aware of this method and you can use it in conjunction with
the entry point vector scheme. You initialize the vector with
distinct values that are guaranteed to cause an access violation.
> I think this is yet
>another idea they borrowed from Multics. (The hardware overhead is a
>fault bit on pointers, but you could implement it on a vax by
>reserving half of your address space.
As I alluded to above all that is needed is that such pointers
are guaranteed to fault and that they be distinguishable from each
other. There is no requirement to reserve half the address space.
I'm getting a little rusty in some of the small details of VMS but
if memory still serves me the VMS run time library lives in system
space (this makes a lot of sense by the way).
> (Hmmm... does a user process
>ever need to reference the kernal area of memory?))
Certainly there have been a lot of lot of service programs written
by non-DEC folks that execute partially in kernel mode to obtain
whatever info they need from VMS. This is not for the faint of
heart though and you do need CMKRNL (change mode to kernel) privileges.
Now whether you call such programs "user programs" or not is entirely
up to you.
> Of course, VMS
>uses entry vectors, and we all know VMS does everything right. :-)
It costs enough so it had better do everything right B-).
Bjorn R. Bjornsson
alberta!bjorn
Right on! Things like a[i++] = *++p; are totally lost on a Cray.
Unfortunately, the current Cray C compiler is none too great ... we're
working on it!
-----------------------
Bron Nelson {ihnp4, lll-lcc}!ohlone!nelson
Not the opinions of Cray Research
Of course, these instructions are also used in the library routines.
Another use that comes to mind is when loader zeros out a BSS area...
It isn't always a conscious effort to trick customers into
buying the compiler. The compiler group might only have a
handful of "test programs" for their compiler. If the compiler
does any peep hole optimizations, the compiler group probably
compiles whatever test programs they have to see how their
compiler might generate better code. If any of their test
programs are benchmarks, they will be spending time making the
benchmarks run fast.
Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA
tektronix!orca!paulsc
A simple but common logic flaw in my opinion. Granted that it can require
up to 15 or 20 times the effort to support 15 or 20 models, but the issue is
whether any such model is worth added support. I can understand a statement
like "I'm not going to optimize for the Lemon III Model B since Lemon Computer
Corporation hasn't even sold one yet." But John Gilmore seems to be saying:
"IBM was selling thousands of machines a month so the only sensible thing
was to move my product to a company whose market was so small they wouldn't
confuse me with multiple models."
Apologies to John -- the problem is likely to be with budgeting misconceptions
rather than the technical staff.
>
> It's true that a tailored shared library could give some benefit, but
> the general problem extends to what code to generate inline, not just
> in library routines.
The user doesn't demand a general solution. He just doesn't like his
application running 20 times slower than necessary. The plain fact is that
major savings can result from optimizing a few routines (strcpy, ldiv being
good examples).
> > 3) Simple routines like strcpy should be recoded in assembler, at least to
> > the degree of having their procedure prologues simplified, and so that they
> > use registers which don't have to be restored.
>
> These should both be done automatically by a good compiler....
"should be" but not necessarily "is". There are some *really pathetic*
compilers out there. From a recent poison remedy pamphlet:
Induce vomiting. If necessary show the subject the output
of Whitesmiths 68k C compiler.
James D. Allen -- opinions not necessarily necessary.
The major problem with this is programs which throw function pointers
around, instead of just calling the routines. If you aren't careful, you
wind up faulting *every* call in some contexts.
Frank Adams ihnp4!philabs!pwa-b!mmintl!franka
Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108
Do you know of any major vendor who *does* provide that kind of
support? Does DEC provide different versions of the VMS libraries
for different VAX models? Does IBM provide different versions of the
MVS libraries for different 370 models?
>But John Gilmore seems to be saying: "IBM was selling thousands of machines
>a month so the only sensible thing was to move my product to a company whose
>market was so small they wouldn't confuse me with multiple models."
John is obviously NOT saying anything even remotely resembling that.
Show me where he said *anything* about *moving* his product from an
IBM machine. He's merely saying that it wasn't worth the effort
producing N versions of the APL system for N different IBM 370
models.
>The user doesn't demand a general solution. He just doesn't like his
>application running 20 times slower than necessary. The plain fact is that
>major savings can result from optimizing a few routines (strcpy, ldiv being
>good examples).
OK, show me a plain fact that indicates that, for anybody supplying
large volumes of some software package, you can get a 20X speed up by
tweaking the code for different models of a line of machines (not
"tweaking the code for machines with or without a given hardware
option", but "tweaking the code for different models of a line of
machines" - e.g., a VAX-11/780, VAX 8600, and VAX 8200, or a 370/168,
3033, 4381, and 3090).
>> These should both be done automatically by a good compiler....
>
>"should be" but not necessarily "is". There are some *really pathetic*
>compilers out there.
OK, are the compilers in question "pathetic" or "good"? They can't
be both....
This is part of the point of my research: to get everyone writing
compilers to have a large consistent set of programs to check
optimizations just like we have programs (validation suites) to check
conformance to language standards. Hope, I can get the various companies
interested in a set of open performance measurement tool.
From the Rock of Ages Home for Retired Hackers:
--eugene miya
NASA Ames Research Center
eug...@ames-aurora.ARPA
"You trust the `reply' command with all those different mailers out there?"
"Send mail, avoid follow-ups. If enough, I'll summarize."
{hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
On Acorn's ARM risc machine using this feature as Brian Case suggests speeded
up dhrystone by 20-25%, which may comment on dhrystone or risc. This gives
some 5-6K dhrystones on the slow clocked pre-production 3 micron chips.
(Wonder what state of the art fabrication can give....)
Please, do the shared libraries ANYWAY! Pretty Please?
I'm TIRED of Megabyte sized SUN executables. And linking several tasks
together into a single executable (as recommended in the manuals) isn't
a solution, it's a hack.
.
.
.
strcmp2(s1, s2)
register short int *s1, *s2;
{
while (*s1 == *s2++)
if (!((*s1++ & 255) && (*(char *)s1)))
return(0);
return(*s1 - *--s2);
}
This relies on the fact that it doesn't matter if you compare the NULLs at the
end of two equal strings or not. We go along the string twice as fast because
of the 16 bit operations, although there is a slight penalty due to the extra
test. The alignment problem only comes in if either of
the strings begins on an odd boundary, which is fairly unusual anyway.
Also watch the last part of the 'if' condition - this won't work on
some machines, you may have to add 1 to the pointer.
--
-Adam.
/* If at first it don't compile, kludge, kludge again.*/
Or, to better mimic the original semantics, p = cat(buf, p1, ...) with no
malloc involved. (Yeah, the malloc version is useful too; I've got one of my
own.)
However, this is only useful if each operation is a strcpy or strcat; it
doesn't work if there are intermixed one-char stores (*p++ = c) or more
complex stuff (the embedded sprintf in the original example).
Of course, one could always write the whole thing with sprintf, using "%s%s%s"
format instead of strcat(strcat(strcpy(...))), but sometimes it's better to
use the low-level stuff.
Karl W. Z. Heuer (ima!haddock!karl or ka...@haddock.isc.com), The Walking Lint
Another possible solution would be:
cat(p, n, p1, p2, ..., (char*)0);
Where p == destination string, n == amount of space available,
p1 == the first pointer, etc...
You might have two versions one with n, one without. The debate is now
what it should return:
1. pointer to the initial string.
2. the number of characters cat()'ed.
3. a pointer to the terminating null.
For Sun, yes, because their graphics libraries go into almost every
interactive program, and said libraries are *enormous*. Sun customers
are generally of the opinion that they have better uses for their disk
space (especially on small systems) than storing those libraries once
for every a.out.
mov200words: mov (a0)+,(a1)+
mov (a0)+,(a1)+
..... 200 mov's in all
mov (a0)+,(a1)+
rts
Then, if, say, a 64-word struct needed to be copied, the compiler would get
the pointers and then call mov200words+(200-64)*2 [ or whatever ] to do the
copy. This would provide unrolled-loop speed with only one loop unrolled in
the whole executable. [ Call it more than once for >200 words ]. Presumably
this would be faster than a loop on a PDP-11 or a 68000, but might lose on a
machine with an instruction cache, that could run a copy loop on-chip. A wizzo
block copy instruction may or may not run faster than the unrolled loop.
The only advantage I am claiming over other unrolled-loop techniques is the
almost complete lack of anything but payoff move operations in the above,
whilst avoiding large amounts of code whenever a copy is done.
Of course, this must have been done before :-)
--
----------------------------------------------------------------------
Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...
This might work on some machines, but don't even think about it being
portable.. It isn't... The most obvious problem will occur on any
machine where the order of bytes in an int variable is [low, high].
This will cause the last like to fail to return the correct greater/
lessthan result....
There's also the problem you mentioned about the odd int boundry. It may
be "fairly unusual anyway" because your coding style tends to place
strings on even boundrys without even thinking about it. I admit that
it hasn't happend in most of my code either, but if you just once tried
porting something that assumed even string boundrys to a compiler that
packs the static stings end to end, you'd realize this isn't a good idea.
}
---
John Stanley (jo...@viper.UUCP)
Software Consultant - DynaSoft Systems
UUCP: ...{amdahl,ihnp4,rutgers}!{meccts,dayton}!viper!john
This is extremely usual! Suppose you have a string where the first character
indicates what kind of string it is, I have this in a big program I'm working
on now. I therefore often say "dosomething(&string[1])".
In other words, the problem with these kinds of assumptions is that you can
add odd numbers to char pointers to get final substrings.
--
Alan J Rosenthal
fl...@csri.toronto.edu, {seismo!utai or utzoo}!utcsri!flaps,
flaps@toronto on csnet, flaps at utorgpu on bitnet.
"Probably the best operating system in the world is the [operating system]
made for the PDP-11 by Bell Laboratories." - Ted Nelson, October 1977
I was so impressed by this new trick (well, to *me* it is new:-)
that I immedeately decided to try it. my Whitechapel MG-1,
a 32016 based machine, the results were impressive.
I coded strcpy() using this methods, and the results were
great. Break-even with normal strcpy() at 4-char strings, performance
slightly worse with 5/6/7-char strings, and getting better and
better from there on. For strings with length 4N (N>=4) performance
was twice that from old strcpy(). This is the routine:
#define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080)
strcpy(at,f)
long *at;
register long *f;
{
register long d;
register long *t = at;
register char *fc, *tc;
do {
d = *f++;
if( !hasnull(d) ) {
*t++ = d;
continue;
}
tc = (char *)t;
fc = (char *)(f-1);
while( *tc++ = *fc++);
return;
} while(1);
return(at);
}
Coding in assembler caused a 30% decrease in time for small (10-char)
strings (less registers to save, t/tc and f/fc in the same reg, etc).
Something I haven't explained yet is that unaligned strings give the
*same* performance. Maybe the extra fetches are noise wrt the
number of instruction fetches?
Note that the 32016 is a 32 bit machine with a 16 bit bus,
so that is probably why I found twice the speed, in stead of four
times.
Anyway, the next thing I thought of is "Wow! This is *great* for
strcmp() on big-endians. Comparing 4 bytes in one go through
the loop!". But, of course, I don't have a big-endian handy.
Anyone care to try this?
--
Jack Jansen, ja...@cwi.nl (or ja...@mcvax.uucp)
The shell is my oyster.
If one of the bytes contains 0x80, then has_nullbyte() will
say 'yes'. This can be circumvented by a more thorough test
after this one to see if there really is a null there.
Jack Jensen's subroutine does this apparently by accident;
there is a 'while( *tc++ = *fc++ );' after the test finds a 'null'
so the only effect of having a 0x80 byte will be to revert to
the normal strcpy for the rest of the string.
Someone posted a similar but usually better test on comp.arch ( I think )
a little while ago:
#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
This one is only 'fooled' by an 0x80 in the most significant byte.
which makes the following test much simpler ( a sign test ).
In either case ( especially in this one with two identical
constants ) it helps a lot if the constants are in registers while
the loop is running. When using C, you can do this explicitly:
register k1=0x7efefeff, k2=0x81010100;
Virtual Memory Architecture in SunOS
Rob Gingell, Joe Moran, and Bill Shannon, Sun Microsystems
Shared Libraries in SunOS
Rob Gingell, Meng Lee, Xuong Dang, and Mary Weeks, Sun Microsystems
Sun has sold a LOT of 4MB Sun-3/50's that can't be upgraded with more
memory, so they can't just tell you to buy more memory if the 4.N and 5.N
software releases continue the balloon tradition. Probably shared libraries
is part of what will put more tomatoes in that same itty bitty can.
For those who haven't heard it, the half-serious rule of thumb is that
Sunnix release X requires 2^(17+X) bytes of memory for the kernel. I hear
that Sun claims this tradition will be broken with release 4...
> I was so impressed by this new trick (well, to *me* it is new:-)
> that I immedeately decided to try it. my Whitechapel MG-1,
> a 32016 based machine, the results were impressive.
Be careful if you use this. It does not work correctly for strings near
the end of memory. Consider an (unaligned) string which is 21 bytes long
and is right at the end of data space (or any data segment on machines
which have such nonsense). The posted code will fault on the last read.
On machines where data space is always a multiple of 4 bytes long, this
method will work if you first copy enough bytes to get the source pointer
to a longword boundary and then start copying 4 bytes at a time.
Someone else claimed that the has_nullbyte macro above does not work
correctly if it contains a byte equal to 0x80. I do not see this.
Consider x==0x17801717.
x - 0x01010101 == 0x167f1616
~x == 0xe87fe8e8
(x - ONES) & ~x == 0x007f0000
has_nullbyte(x) == 0x00000000
The claim that has_nullbyte(x) will be TRUE appears to be incorrect.
(I was able to prove to myself that the macro always gives the correct
answer, but I won't bother to write it up and post it.)
Richard M. Mathews
Locus Computing Corporation lcc.r...@LOCUS.UCLA.EDU
lcc.richard@UCLA-CS
{ihnp4,trwrb}!lcc!richard
{randvax,sdcrdcf,ucbvax,trwspp}!ucla-cs!lcc!richard
Great idea, Gregory! I saw this implemented in a Pascal/PDP-11 compiler,
and it fascinated me then. Whether it's faster or slower than a block
move or a loop depends on the machine. For instance, on a RISCy machine
with a separate I-bus, the limiting factor is data accesses anyway, so
everything takes about the same time. On the Vax-11/780, the MOVC3 seems
to take almost the same time as the equivalent number of MOVLs, and rather
more time than MOVQs. Since it also destroys 6 registers, it should be
avoided.
Of course, for small structures you generate the sequence inline; on a Vax
maybe 7 or 8 MOVQs is OK, after that better go to the subroutine (called
by JSR of course). Has anyone published statistics on the size distribution
of Pascal arrays & records?
> Well, not to pick on you, but, have you ever done ANY programming using shared
> memory, succesfully, and bug free?
Just enough to be aware of the problems involved.
> Tell me, how would you do something like a
> strcpy in such an environment?
char *strcpy (to, from)
register char *to, *from;
{
char *start = to;
while ((*to++ = *from++)) != 0)
;
return start;
}
But this would be with the understanding that a copy of a string in the
process of being changed might produce garbage; any process using strcpy()
to lengthen a string would be expected to plant a suitably located NUL first.
(Not necessarily at strlen(from)+to; at to+sizeof(tobuffer)-1 would do.)
If this isn't sufficient protection, then semaphores or the like become
appropriate.
> You should just say: ``No one should use shared memory, unless
> he/she knows what they are doing''
Well, I can certainly agree with that.
Mark Brader
You're mistaken the "has_nullbyte_(x)" defined above works
as advertised for all 32 bit x. That is to say it returns
a nonzero result if and only if x contains a null byte.
>Someone posted a similar but usually better test on comp.arch ( I think )
>a little while ago:
>
>#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>
>This one is only 'fooled' by an 0x80 in the most significant byte.
>which makes the following test much simpler ( a sign test ).
You are right this one does not always tell the truth. Besides
it's a lot more effort.
However for tricks like these to work reliably (especially on
systems with memory protection) you had best get help from
your local linker and 'malloc'. Otherwise one day a program
is going to read off the edge of an accessible hunk of memory
and flush your program straight down the drain. Making sure
that no data item ends closer than n-1 bytes (if you're reading
n bytes at time) from a boundary (between accessible and
inaccessible memory) fixes this.
Bjorn R. Bjornsson
{mnetor,ihnp4}!alberta!bjorn
Right. I apologize for any inconvenience I may have caused anybody.
I also apologize for posting bunk in a sleepy haze and then vanishing
for a week.
>
>>Someone posted a similar but usually better test on comp.arch ( I think )
>>a little while ago:
>>
>>#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>>
>>This one is only 'fooled' by an 0x80 in the most significant byte.
>>which makes the following test much simpler ( a sign test ).
>
>You are right this one does not always tell the truth. Besides
>it's a lot more effort.
I seems to be about the same, in the loop. Below is the subtract version, and
the add version on an NS32k:
movd 0(r0),r1
subd r2,r1
bicd 0(r0),r1
andd r3,r1 ;r2 & r3 contain constants
cmpqd $0,r1 ;need this on ns32k
movd 0(r0),r1
addd r2,r1
xord 0(r0),r1
andd r3,r1
cmpd r3,r1
I can't remember why I thought it was more efficient. I guess it
is if you haven't got 'bic'. ( and if 'and' doesn't set the Z flag :-)
The original poster of this method was describing a RISC which
had a 'compare registers and branch if equal' op.
>
>However for tricks like these to work reliably (especially on
>systems with memory protection) you had best get help from
>your local linker and 'malloc'. Otherwise one day a program
>is going to read off the edge of an accessible hunk of memory
>and flush your program straight down the drain. Making sure
>that no data item ends closer than n-1 bytes (if you're reading
>n bytes at time) from a boundary (between accessible and
>inaccessible memory) fixes this.
Not generally a problem. The whole point of this is that the 32-bit
accesses are constrained to 32-bit boundaries to make them fast.
Furthermore, in most systems, your access-controlled segments or
whatever are rounded up to a size at least as round as the machine
word. (The MMU does not usually deal with address lines a0,a1 on a
32-bit machine). Thus if it is legal to read the first byte in a word,
it is legal to read the whole word, and thus the last byte.
Untrue. It would be true if I had mistakenly used ^ instead of &~.