Explanation, please!

Douglas C. Schmidt

unread,

Aug 24, 1988, 10:16:14 PM8/24/88

to

The following piece of wonderful obscurity comes from Stroustup's
C++ Programming Language book, page 100:

void send(int *to,int *from, int count) {
int n = (count + 7) / 8;

switch(count % 8) {
case 0: do { *to++ = *from++;
case 7: *to++ = *from++;
case 6: *to++ = *from++;
case 5: *to++ = *from++;
case 4: *to++ = *from++;
case 3: *to++ = *from++;
case 2: *to++ = *from++;
case 1: *to++ = *from++;
} while (--n > 0);
}

}

Now, much to my surprise, this is not only valid C++, it is also valid C!
Could some one please explain to me why this is so? It seems like
the case 7-1 labels are actually nested inside the do {} while loop,
and thus not in the scope of the switch (should a break statement exit
both the switch and the loop, or just one?!?!).

Finally, Stroustrup asks the rhetorical question ``why would anyone
want to write something like this.'' Any guesses?!

thanks,

Doug Schmidt
--
Douglas Schmidt
"If our behavior is strict, we do not need fun." -Zippy th' Pinhead
"If our behavior is struct, we do not need defun." -Anon

Chris Torek

unread,

Aug 25, 1988, 3:16:15 AM8/25/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas

C. Schmidt) writes:
>The following piece of wonderful obscurity comes from Stroustup's
>C++ Programming Language book, page 100:

> switch(count % 8) {

> case 0: do { *to++ = *from++;
> case 7: *to++ = *from++;

...

> case 1: *to++ = *from++;
> } while (--n > 0);
> }

>Now, much to my surprise, this is not only valid C++, it is also valid C!
>Could some one please explain to me why this is so? It seems like
>the case 7-1 labels are actually nested inside the do {} while loop,
>and thus not in the scope of the switch (should a break statement exit
>both the switch and the loop, or just one?!?!).

`break' exits the innermost switch or loop, hence a `break' in cases
7 through 1 exits the do-while, not the switch.

>Finally, Stroustrup asks the rhetorical question ``why would anyone
>want to write something like this.'' Any guesses?!

This has been called `Duff's device' (after Tom Duff, who probably did
not invent it first), and it looks exactly like what an optimising
compiler generates when it does loop unrolling. It works because case
labels are just that---labels. As long as there is a switch in
scope, a case label is legal; the case applies to the closest switch.
(And if C had used a separate keyword for `break switch', the whole
thing could be consistent :-) .)

Why? To quote a certain infamous C language hack :-) , `It looks exactly
like what an optimising compiler generates when it does loop unrolling.'

Incidentally, there are some compilers that choke on that form. I
would have to look hard at the dpANS to decide whether it is Officially
Legal.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris

Doug Gwyn

unread,

Aug 25, 1988, 12:16:29 PM8/25/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:

- switch(count % 8) {
- case 0: do { *to++ = *from++;
- case 7: *to++ = *from++;
- case 6: *to++ = *from++;
- case 5: *to++ = *from++;
- case 4: *to++ = *from++;
- case 3: *to++ = *from++;
- case 2: *to++ = *from++;
- case 1: *to++ = *from++;
- } while (--n > 0);
- }

-Now, much to my surprise, this is not only valid C++, it is also valid C!
-Could some one please explain to me why this is so?

Could you explain why it shouldn't be? "case" labels are just a
special form of label. You can stick a label on most statements.

-It seems like the case 7-1 labels are actually nested inside the
-do {} while loop, and thus not in the scope of the switch

?? "Thus"? Why would the labels go out of scope? They're
definitely within the body of the switch.

-(should a break statement exit both the switch and the loop, or just one?!?!)

A "break" applies to whatever the innermost container is. A "break"
between two of the "*to++ = *from++;" statements would exit the
do-while loop.

-Finally, Stroustrup asks the rhetorical question ``why would anyone
-want to write something like this.'' Any guesses?!

It's about the fastest way to move arbitrarily-aligned data in
portable C with a guarantee as to what happens in the case that the
data overlaps. memcpy() doesn't guarantee anything for overlaps.
memmove() does, but that's a recent X3J11 invention that probably
doesn't exist on your system yet.

Daniel R. Levy

unread,

Aug 25, 1988, 7:39:20 PM8/25/88

to

< void send(int *to,int *from, int count) {
< int n = (count + 7) / 8;
<
< switch(count % 8) {
< case 0: do { *to++ = *from++;
< case 7: *to++ = *from++;
< case 6: *to++ = *from++;
< case 5: *to++ = *from++;
< case 4: *to++ = *from++;
< case 3: *to++ = *from++;
< case 2: *to++ = *from++;
< case 1: *to++ = *from++;
< } while (--n > 0);
< }
< }

Question: what if count==0?
--
|------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
| Bell Labs Area 61 (R.I.P., TTY)| AND ARE NOT TO BE IMPUTED TO AT&T.
| Skokie, Illinois |
|-----Path: att!ttbcad!levy-----|

Jeff Siegal

unread,

Aug 26, 1988, 8:53:49 AM8/26/88

to

In article <28...@ttrdc.UUCP> le...@ttrdc.UUCP (Daniel R. Levy) writes:
>>[Duff's device loop example from C++ book]
>
>Question: what if count==0?

The program breaks. I prefer (and also use, in highly-speed-sensitive
code):

#define duff16(counter, block) \
switch (counter & 0x0f) { \
do \
{ \
counter -= 16; \
{ block; } \
case 15: { block; } \
case 14: { block; } \
case 13: { block; } \
case 12: { block; } \
case 11: { block; } \
case 10: { block; } \
case 9: { block; } \
case 8: { block; } \
case 7: { block; } \
case 6: { block; } \
case 5: { block; } \
case 4: { block; } \
case 3: { block; } \
case 2: { block; } \
case 1: { block; } \
case 0: /* null statement */; \
} while (counter >= 16); \
}

duff16(n, *to++ = *from++)

Jeff Siegal

Eric S. Raymond

unread,

Aug 26, 1988, 1:30:20 PM8/26/88

to

(Code below reproduced so that comp.arch people seeing this followup only
won't get terminally frustrated. This is *really neat*, gang...)

In article <6...@paris.ics.uci.edu> Douglas C. Schmidt writes:
>
> void send(int *to,int *from, int count) {
> int n = (count + 7) / 8;
>
> switch(count % 8) {
> case 0: do { *to++ = *from++;
> case 7: *to++ = *from++;
> case 6: *to++ = *from++;
> case 5: *to++ = *from++;
> case 4: *to++ = *from++;
> case 3: *to++ = *from++;
> case 2: *to++ = *from++;
> case 1: *to++ = *from++;
> } while (--n > 0);
> }
>
> }
>

> Finally, Stroustrup asks the rhetorical question ``why would anyone

> want to write something like this.'' Any guesses?!

Yeah. That's the most hackish way of trying to write a portable optimized
copy routine I've ever seen. I gather the whole point of the shenanigans
is to get all the *from++ -> *to++ instructions in the generated code to be
adjacent.

This only makes if the author knows he's got a hardware instruction pipeline
or cache that's no less than 8 and no more than 9 byte-copy instruction widths
long, and stuff executing out of the pipeline is a lot faster than if the
copies are interleaved with control transfers. Dollars to doughnuts this code
was written on a RISC machine.

(Gawrsh. That sounded just like one of the big boys on comp.arch tawkin'. I
think I'll cross-post over there just to see if I get shot down in flames...)

--
Eric S. Raymond (the mad mastermind of TMN-Netnews)
UUCP: ...!{uunet,att,rutgers}!snark!eric = er...@snark.UUCP
Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718

Henry Spencer

unread,

Aug 26, 1988, 3:01:36 PM8/26/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:

>The following piece of wonderful obscurity comes from Stroustup's

>C++ Programming Language book, page 100...

Tom Duff, what hast thou wrought? :-)

>...

>Now, much to my surprise, this is not only valid C++, it is also valid C!
>Could some one please explain to me why this is so? It seems like
>the case 7-1 labels are actually nested inside the do {} while loop,
>and thus not in the scope of the switch (should a break statement exit
>both the switch and the loop, or just one?!?!).

You're thinking of switch as if it were Pascal's case; the correct model
is closer to Fortran's computed goto. The labels must appear within the
case body, but they can be (essentially) *anywhere* within the body,
including within nested blocks. The sensible programmer will not exploit
this freedom except in truly unusual situations, but it is available.

Break breaks out of the innermost construct that it could apply to, i.e.
"just one".

>Finally, Stroustrup asks the rhetorical question ``why would anyone
>want to write something like this.'' Any guesses?!

This construct is called "Duff's Device", after its discoverer. One can
often make a loop run faster by "unrolling" it, duplicating its body
N times and running one-Nth as many iterations -- this reduces the loop-
control overhead by a factor of N. The annoying part is that the number
of times one needs to do the body is usually not an even multiple of N,
so one has to do a partial iteration at beginning or end. The usual
assembly-language trick is to compute a start address in the middle
of the loop and start by branching there, not to the beginning, thus
doing a partial iteration first. Most high-level languages have no way
to express this. Tom Duff (of Bell Labs) realized that it could be done
in C. Ugh.
--
Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology
they just act that way. | uunet!attcan!utzoo!henry he...@zoo.toronto.edu

William E. Davidsen Jr

unread,

Aug 26, 1988, 4:22:24 PM8/26/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:

| The following piece of wonderful obscurity comes from Stroustup's
| C++ Programming Language book, page 100:

[ a do loop within a switch, with cases all through the do loop ]

This does not seem to be clearly covered in the latest dpANS (section
3.6). The issue is if it is legal *and defined* to jump into a loop.
While a case label behaves just like any other label, it's not clear
what a jump into a loop implies, and I don't see that the existing
standard is any clearer on the topic than the discussion was three years
ago.

Consider:
i = 15;
if (j < 10) goto BadMove;

/* more code here */

do {
int i = 7;
int myvect[400];
BadMove: /* entry into block */
myvect[300] = i;
if (j < 5) return i;
} while (j-- > 302);

There was a lot of discussion of this and I don't see a clarification
in the standard as it stands. I think there are several possibilities:
1) it's not allowed
2) it's allowed, but the initializations at the start of the
block don't take place. What does this imply about the
block local variables?
3) it's allowed and the block local variables are allocated and
initialized as you would expect.
4) it's allowed but the results are implementation dependent.

In all other cases entry into a block causes allocation of the block
local (auto) variables. Initialization is performed at the time of
allocation. By allowing entry at any point the expected behavior is
unspecified.

Suggestion:
Forget (1), it's there for completeness. Pick any of 2-4 and add it to
the standard NOW, as an editorial change. We talked about this three
years ago, it can hardly be considered a recent issue. I would not
suggest (1) because it breaks a lot of existing code, but I would be
happy to see a (5) "you can't jump into a block with local variables."

This is one of the rare times when I would like to be wrong, to be
told "oh, it's there in a footnote, you can, or you can't, and this is
what it does." If the action is implementation dependent then I
(personally) wouldn't use the feature, since I value portability
highly. Please tell me X3J11 didn't let this go unresolved for three
years.
--
bill davidsen (we...@ge-crd.arpa)
{uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

Doug Gwyn

unread,

Aug 26, 1988, 5:18:14 PM8/26/88

to

In article <99...@eddie.MIT.EDU> j...@fenchurch.MIT.EDU (Jeff Siegal) writes:
>... I prefer (and also use, in highly-speed-sensitive code): ...

It is worth noting that byte-at-a-time is not optimal on most architectures.
Duff's device is useful for arbitrarily-aligned data, but if you can
arrange for (most of) the source and destination buffers to be relatively
word-aligned, then along with processing the head and tail pieces via
something like Duff's device, the bulk of the data can be transferred in
bigger chunks, gaining considerable increase in speed. One should really
use memcpy() in all cases where the source and destination do not overlap,
because all the work done to optimize block-move for a specific architecture
are supposed to be centralized there.

A couple of people have complained about the tone of my previous response
on this topic. I meant it to be informative and minimal, i.e. requiring
some thought on the reader's part. Since some have missed the point, let
me summarize the C style issue:

Since the constructs are not forbidden by the C language
rules, they are perforce permitted. One should have an
explicit rule to point at as being violated in order to
think that the code is not valid. "Intuition" is a lousy
guide, particularly when it was formed by experience in
other contexts (e.g. Pascal).

Now, because of the demands made on one's understanding
of the language, the cited code would be considered bad
programming style IF what it did could be effectively
accomplished in some other way. As I pointed out in my
previous posting on this, Duff's device does indeed have
use in cases where other approaches fail, namely moving
overlapping unaligned data. Eventually we can all use
memmove() for this, but not today.

Incidentally, Duff's device is used to manipulate text strings in the
"sam" text editor. I recently changed it to use the device for both
directions, only in cases of overlap, using memcpy() for all other
cases. (Originally the device was used for all transfers in one
direction and memcpy() for all in the other, which assumed more about
the behavior of memcpy() than is officially guaranteed.) So this is
not just an academic issue..

Douglas C. Schmidt

unread,

Aug 27, 1988, 3:57:11 AM8/27/88

to

Hi,

Since I posted my original question there has been a great deal of
abstract discussion about the relative merits of the loop unrolling
scheme. The topic has piqued my curiousity, so I when ahead and
implemented a short test program, included below, to test Duff's
device against the ``ordinary for loop w/index variable'' technique.
See for yourself....

After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
Sequent Balance was pretty heavily in favor of the regular (non-Duff)
loop. Your mileage may vary. I realize that there may be other
tests, and if anyone has a better version, I'd like to see it!

Doug Schmidt
----------------------------------------
#include <sys/time.h>
#include <sys/resource.h>

double Start_Timer();
double Return_Elapsed_Time();

#define MAX_NUM 100000
int array1[MAX_NUM ];
int array2[MAX_NUM ];
int *A = array1, *B = array2;

main(int argc, char *argv[]) {
double Elapsed_Time;
int Count = argc > 1 ? atoi(argv[1]) : MAX_NUM;
int i;

for (i = 0;i < Count ;i++) {
array1[i] = i + 1;
array2[i] = i;
}

printf("Starting Duff's device timing...\n");
Start_Timer();

{
int n = (Count + 7) / 8;

switch(Count % 8) {
case 0: do { *A++ = *B++;
case 7: *A++ = *B++;
case 6: *A++ = *B++;
case 5: *A++ = *B++;
case 4: *A++ = *B++;
case 3: *A++ = *B++;
case 2: *A++ = *B++;
case 1: *A++ = *B++;
} while (--n > 0);
}
}

Elapsed_Time = Return_Elapsed_Time(0.0 );
printf("Elapsed time = %.3f\n",Elapsed_Time);

for (i = 0;i < Count ;i++) {
if (array1[i] != array2[i]) {
printf("Yow, problems at location %d!\n",i);
break;
}
}

for (i = 0;i < Count ;i++) {
array1[i] = i + 1;
array2[i] = i;
}

printf("Starting ordinary copy timing...\n");
Start_Timer();

for (i = 0;i < Count ;i++) {
array1[i] = array2[i];
}

Elapsed_Time = Return_Elapsed_Time(0.0 );
printf("Elapsed time = %.3f\n",Elapsed_Time);

for (i = 0;i < Count ;i++) {
if (array1[i] != array2[i]) {
printf("Yow, problems at location %d!\n",i);
break;
}
}

}

/* no such thing as "negative time"! */
#define ERROR_VALUE -1.0

static struct rusage Old_Time;
static struct rusage New_Time;
static int Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
Timer_Set = 1;
getrusage(RUSAGE_SELF,&Old_Time); /* set starting process time */
return(Old_Time.ru_utime.tv_sec + (Old_Time.ru_utime.tv_usec / 1000000.0));
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */
/* the time since the Old_Time was set is returned. Returns ERROR_VALUE */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time)
double Last_Time;
#endif
{
if (!Timer_Set) {
return(ERROR_VALUE);
}
else {
/* get process time */
getrusage(RUSAGE_SELF,&New_Time);
if (Last_Time == 0.0) {
return((New_Time.ru_utime.tv_sec - Old_Time.ru_utime.tv_sec) +
((New_Time.ru_utime.tv_usec - Old_Time.ru_utime.tv_usec)
/ 1000000.0));
}
else {
return((New_Time.ru_utime.tv_sec +
(New_Time.ru_utime.tv_usec / 1000000.0)) - Last_Time);
}
}
}

--
sch...@bonnie.ics.uci.edu (ARPA)

T. William Wells

unread,

Aug 27, 1988, 10:55:24 AM8/27/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:

: The following piece of wonderful obscurity comes from Stroustup's

: C++ Programming Language book, page 100:
:
: void send(int *to,int *from, int count) {
: int n = (count + 7) / 8;
:
: switch(count % 8) {
: case 0: do { *to++ = *from++;
: case 7: *to++ = *from++;
: case 6: *to++ = *from++;
: case 5: *to++ = *from++;
: case 4: *to++ = *from++;
: case 3: *to++ = *from++;
: case 2: *to++ = *from++;
: case 1: *to++ = *from++;
: } while (--n > 0);
: }
:
: }
:
: Now, much to my surprise, this is not only valid C++, it is also valid C!
: Could some one please explain to me why this is so? It seems like
: the case 7-1 labels are actually nested inside the do {} while loop,
: and thus not in the scope of the switch (should a break statement exit
: both the switch and the loop, or just one?!?!).

A switch statement has the syntax

switch (<expression>) <statement>

While it is conventional to use a compound statement as the
embedded statement, there is nothing requiring it. What a switch
does is permit the definition of a set of labels within the
statement, to which control is passed by the head of the switch.

The exclusive use of compound statements is not only conventional
coding practice, it is also good coding practice.

: Finally, Stroustrup asks the rhetorical question ``why would anyone

: want to write something like this.'' Any guesses?!

The false god of efficiency has reared it ugly head. This
routine would be imagined to be more efficient than the almost
equivalent:

void
send(
int *to,
int *from,
int count)
{
if (count <= 0) {
create progam bug
}
while (--count >= 0) {
*to++ = *from++;
}
}

However, it often (always?) is not. Consider a machine which has
a fast means of moving small blocks of memory. The following
routine should do the same thing (without the bug, of course),
might be more efficient, and is more understandable as well. Of
course, it might generate a smidgin more code, but I think that,
in this case, this is a small price to pay.

void
send(
int *to,
int *from,
int count)
{
if (count <= 0) {
return;
}
switch (count % 7) {

case 7: *to++ = *from++;
case 6: *to++ = *from++;
case 5: *to++ = *from++;
case 4: *to++ = *from++;
case 3: *to++ = *from++;
case 2: *to++ = *from++;
case 1: *to++ = *from++;
}

count >>= 3;
while (--count >= 0) {
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
}
}

However, at least two IBM-PC compilers ought to generate better
code from the "unoptimized" version. (Actually, they do it if
the copied items are characters; I presume that they'll do it for
integers.) I wouldn't be surprised if other compilers did as
well.

void
send(
int *to,
int *from,
int count)
{
/* If you are really paranoid, add the following code, it
catches the case where count is minus full scale on a
two's complement machine. I tend to not do this,
since I consider minus full scale to be an illegal
value and take the minimal pains needed to never
generate it.

if (count <= 0) {
return;
} */

while (--count >= 0) {
*to++ = *from++;
}
}

---
Bill
novavax!proxftl!bill

T. William Wells

unread,

Aug 27, 1988, 11:00:53 AM8/27/88

to

In article <13...@mimsy.UUCP> ch...@mimsy.UUCP (Chris Torek) writes:
: Incidentally, there are some compilers that choke on that form. I

: would have to look hard at the dpANS to decide whether it is Officially
: Legal.

It is.

---
Bill
novavax!proxftl!bill

Henry Spencer

unread,

Aug 27, 1988, 9:00:19 PM8/27/88

to

In article <6...@paris.ICS.UCI.EDU> sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:

>After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>Sequent Balance was pretty heavily in favor of the regular (non-Duff)

>loop...

The odds are excellent that the compilers are applying optimizations to
the regular loop that they can't do for the Duff loop as you've written
it. In particular, your pointers aren't even register variables (and they
are externs, so the compiler can't safely promote them quietly), whereas
a good optimizing compiler will certainly be using register pointers in
the regular loop.

My partner in crime, Geoff Collyer, applied Duff's Device in a number of
places in the C News rnews; the performance improvements were substantial.

Henry Spencer

unread,

Aug 27, 1988, 11:19:26 PM8/27/88

to

In article <6...@proxftl.UUCP> bi...@proxftl.UUCP (T. William Wells) writes:
>[Duff's Device]

>The false god of efficiency has reared it ugly head. This
>routine would be imagined to be more efficient than the almost

>equivalent: ...

> while (--count >= 0)
> *to++ = *from++;
>

>However, it often (always?) is not...

On the contrary, it often (usually?) is. This is from experience, not
theory. (Specifically, experience in C News and related code.) The
major exceptions are (a) the 68010, on which unrolling of simple loops
is a mistake because of "loop mode" [the hardware's very limited loop
cache], and (b) machines with fairly smart compilers and specialized
bulk-transfer instructions, as the compilers may be able to recognize
and optimize the simple loop but not the complex one.

In general, however, in the long run the correct way to implement bulk
data copying is to call "memcpy", which (in the long run) is likely
to be recognized and given special attention by most compilers.

Daniel R. Levy

unread,

Aug 27, 1988, 11:24:29 PM8/27/88

to

In article <6...@paris.ICS.UCI.EDU>, sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
> Hi,
>
> Since I posted my original question there has been a great deal of
> abstract discussion about the relative merits of the loop unrolling
> scheme. The topic has piqued my curiousity, so I when ahead and
> implemented a short test program, included below, to test Duff's
> device against the ``ordinary for loop w/index variable'' technique.
> See for yourself....
>
> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
> loop. Your mileage may vary. I realize that there may be other
> tests, and if anyone has a better version, I'd like to see it!

[schmidt's program deleted, please refer to parent article if you want it]

I modified this program to run under System V, changed the arrays to be dynam-
ically allocated, and changed both the Duff and ordinary copies to use register
pointers instead of global pointers (for the Duff copy) and array indexing (for
the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
Sun-4 both with and without -O optimization (using the standard pcc-type C
compiler on each system). The result? Duff wins by about 10%-20% on all
machines tested.

Here is the code thus modified:

#include <sys/types.h>
#include <sys/times.h>
#include <sys/param.h>

double Start_Timer();
double Return_Elapsed_Time();

#define DEFAULT_NUM 100000

main(argc,argv) char **argv; {
double Elapsed_Time_Duff;
double Elapsed_Time_Ordinary;
int Count = argc > 1 ? atoi(argv[1]) : DEFAULT_NUM;
int i;
register int *A, *B;
register int n;
register int *A_end;
int *array1, *array2;
char *malloc();

if (Count <= 0) {
printf("Bad Count\n");
return 1;
} else {
printf("Count=%d\n", Count);
}

if (!(array1=(int *)malloc(sizeof(int) * Count))) {
printf("Can't malloc %u bytes for array1\n", sizeof(int) * Count);
return 1;
} else if (!(array2=(int *)malloc(sizeof(int) * Count))) {
printf ("Can't malloc %u bytes for array2\n", sizeof(int) * Count);
return 1;

}

for (i = 0;i < Count ;i++) {
array1[i] = i + 1;
array2[i] = i;
}

printf("Starting Duff's device timing...\n");
Start_Timer();

{

n = (Count + 7) / 8;

A = array1;
B = array2;

switch(Count & 0x7) {

case 0: do { *A++ = *B++;
case 7: *A++ = *B++;
case 6: *A++ = *B++;
case 5: *A++ = *B++;
case 4: *A++ = *B++;
case 3: *A++ = *B++;
case 2: *A++ = *B++;
case 1: *A++ = *B++;
} while (--n > 0);
}
}

Elapsed_Time_Duff = Return_Elapsed_Time(0.0 );
printf("Elapsed time = %.3lf\n",Elapsed_Time_Duff);

for (i = 0;i < Count ;i++) {
if (array1[i] != array2[i]) {
printf("Yow, problems at location %d!\n",i);
break;
}
}

for (i = 0;i < Count ;i++) {
array1[i] = i + 1;
array2[i] = i;
}

printf("Starting ordinary copy timing...\n");
Start_Timer();

{
A = array1;
B = array2;
A_end = array1 + Count - 1;

while (A <= A_end)
*A++ = *B++;
}

Elapsed_Time_Ordinary = Return_Elapsed_Time(0.0 );
printf("Elapsed time = %.3lf\n",Elapsed_Time_Ordinary);

for (i = 0;i < Count ;i++) {
if (array1[i] != array2[i]) {
printf("Yow, problems at location %d!\n",i);
break;
}
}

printf("Duff/Ordinary copy time ratio = %.3lf\n",
Elapsed_Time_Duff/Elapsed_Time_Ordinary);

}

/* no such thing as "negative time"! */
#define ERROR_VALUE -1.0

static struct tms Old_Time;
static struct tms New_Time;

static int Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
Timer_Set = 1;

times(&Old_Time); /* set starting process time */
return ((double)Old_Time.tms_utime)/(double)HZ;
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */
/* the time since the Old_Time was set is returned. Returns ERROR_VALUE */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time)
double Last_Time;
#endif
{
if (!Timer_Set) {
return(ERROR_VALUE);
}
else {
/* get process time */

times(&New_Time);
if (Last_Time == 0.0) {
return ((double)(New_Time.tms_utime - Old_Time.tms_utime))/(double) HZ;
}
else {
return (((double)New_Time.tms_utime)/(double)HZ) - Last_Time;

Doug Gwyn

unread,

Aug 28, 1988, 1:19:01 AM8/28/88

to

In article <dpmuY#2EBC4R=er...@snark.UUCP> er...@snark.UUCP (Eric S. Raymond) writes:
>Dollars to doughnuts this code was written on a RISC machine.

Send me my doughnuts! (Or dollars, if you prefer.) RISC machines
were practically non-existent when this C code was first written down
(whether or not by Tom Duff, I'm not sure). It probably appeared on
a PDP-11, perhaps a VAX.

I suspect the first use of Duff's device was as an "efficiency hack",
but as you observe it is not universally the most efficient method.

The modern VALID reason for such code is that it is a relatively
efficient way to do something portably that CANNOT BE DONE by
using the "preferred block copy routine" supplied by an implementation
(be it bcopy() or memcpy()). Whenever the preferred routine will
work correctly (in the portable case, that means whenever the source
and destination fields do not overlap), then of course one should use
it, especially since if the C vendors are doing their job, the C
library routine (which could also be expanded in-line by the compiler)
will be near-optimally tuned for each specific architecture.

P.S. Some implementations of bcopy() and memcpy() do guarantee
proper handling of overlapping fields. But not all do, so one cannot
count on it if one's trying to write truly portable code. ANSI C will
specify a new function memmove() with this property guaranteed.

Henry Spencer

unread,

Aug 28, 1988, 8:21:01 PM8/28/88

to

In article <dpmuY#2EBC4R=er...@snark.UUCP> er...@snark.UUCP (Eric S. Raymond) writes:

>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.

Nope. Bell Labs Research uses VAXen and 68Ks, mostly.

The key point is not pipelining, but loop-control overhead. There is in
fact a tradeoff here: unrolling the loop further will reduce control
overhead further, but will increase code size. That last is of some
significance when caching gets into the act: cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.

T. William Wells

unread,

Aug 29, 1988, 1:36:24 AM8/29/88

to

In article <11...@steinmetz.ge.com> davi...@crdos1.UUCP (bill davidsen) writes:
: [ a do loop within a switch, with cases all through the do loop ]

:
: This does not seem to be clearly covered in the latest dpANS (section
: 3.6). The issue is if it is legal *and defined* to jump into a loop.

It is defined, and the same as it always has been. To quote from
section 3.1.2.4:

"Storage is guaranteed to be reserved for a new instance of such
an object [objects with *automatic storage duration*] on each
normal entry into the block in which it is declared, or on a jump
from outside the block to a label in the block or in an enclosed
block. If an initialization is specified for the value stored in
the object, it is performed on each normal entry, but not if the
block is entered by a jump to a label."

: While a case label behaves just like any other label, it's not clear

: what a jump into a loop implies, and I don't see that the existing
: standard is any clearer on the topic than the discussion was three years
: ago.

:
: i = 15;

: if (j < 10) goto BadMove;
:
: /* more code here */
:
: do {
: int i = 7;
: int myvect[400];
: BadMove: /* entry into block */
: myvect[300] = i;
: if (j < 5) return i;
: } while (j-- > 302);

The above tells us that your example 1) allocates space for i and
myvect before or as the block is entered, and 2) does not
initialize `i' when the loop is entered via the goto, but does on
each subsequent iteration of the loop.

: This is one of the rare times when I would like to be wrong, to be

: told "oh, it's there in a footnote, you can, or you can't, and this is
: what it does." If the action is implementation dependent then I
: (personally) wouldn't use the feature, since I value portability
: highly. Please tell me X3J11 didn't let this go unresolved for three
: years.

Nope. All they did was move it from 3.6.2 (the equivalent of K&R
9.2) to 3.1.2.4. I too have been bitten by not having the
standard say things in the expected place. (Remember my
NULL!=NULL posting? I got corrected by no less than dmr
himself! Flaming red :-)

From K&R section 9.2:

"Any initializations of auto or register variables are performed
each time the block is entered at the top. It is currently
possible (but a bad practice) to transfer into a block; in that
case the initializations are not performed."

Nothing could be clearer, eh?

---
Bill
novavax!proxftl!bill

T. William Wells

unread,

Aug 29, 1988, 3:58:01 AM8/29/88

to

In article <1988Aug28.0...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:

: In article <6...@proxftl.UUCP> bi...@proxftl.UUCP (T. William Wells) writes:
: >[Duff's Device]
: >The false god of efficiency has reared it ugly head. This
: >routine would be imagined to be more efficient than the almost
: >equivalent: ...
: > while (--count >= 0)
: > *to++ = *from++;
: >
: >However, it often (always?) is not...
:
: On the contrary, it often (usually?) is. This is from experience, not
: theory. (Specifically, experience in C News and related code.)

Well, I'll bow to experimental evidence. However, that does make
me wish that more compilers went to the effort to deal with this
kind (the "unoptimized" version) of loop, seeing as how it is so
commonly important.

: In general, however, in the long run the correct way to implement bulk

: data copying is to call "memcpy", which (in the long run) is likely
: to be recognized and given special attention by most compilers.

Agreed. That is how I would wish it to be done.

---
Bill
novavax!proxftl!bill

Dave Martindale

unread,

Aug 29, 1988, 2:01:20 PM8/29/88

to

In article <1988Aug28.0...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>

>In general, however, in the long run the correct way to implement bulk
>data copying is to call "memcpy", which (in the long run) is likely
>to be recognized and given special attention by most compilers.

For the case of copying bytes, yes. But there are other operations
done in tight loops that do a bit more work than a simple byte move,
so memcpy is not usable, yet little enough that unrolling the loop
is still worthwhile. Thus Duff's Device will remain useful, until
all compilers are smart enough to do loop unrolling on their own.

For example: move while packing or unpacking data, table lookup, perhaps
CRC generation.

Dave Martindale

Tom Duff

unread,

Aug 29, 1988, 4:33:51 PM8/29/88

to

I normally do not read comp.lang.c, but Jim McKie told me
that ``Duff's device'' had come up in comp.lang.c again. I
have lost the version that was sent to netnews in May 1984,
but I have reproduced below the note in which I originally
proposed the device. (If anybody has a copy of the netnews
version, I would gratefully receive a copy at research!td or
t...@research.att.com.)

To clear up a few points:
1) The point of the device is to express general
loop unrolling directly in C. People who have
posted saying `just use memcpy' have missed the
point, as have those who have criticized it using
various machine-dependent memcpy implementations
as support. In fact, the example in the message is
not implementable as memcpy, nor is any computer
likely to have an memcpy-like idiom that implements
it.

2) Somebody claimed that while the device was named
for me, I probably didn't invent it. I almost
certainly did invent it. I had definitely not
seen or heard of it when I came upon it, and nobody
has ever even claimed prior knowledge, let alone
provided dates and times. Note the headers on the
message below: apparently I invented the device
on November 9, 1983, and was proud (or disgusted)
enough to send mail to dmr. Please note that I
do not claim to have invented loop unrolling, merely
this particular expression of it in C.

3) The device is legal dpANS C. I cannot quote chapter
and verse, but Larry Rosler, who was chairman of the
language subcommittee (I think), has assured me that X3J11
considered it carefully and decided that it was legal.
Somewhere I have a note from dmr certifying that all
the compilers that he believes in accept it. Of course,
the device is also legal C++, since Bjarne uses it in
his book.

4) Somebody invoked (or more properly, banished) the
`false god of efficiency.' Careful reading of my
original note will put this slur to rest. The
alternative to genuflecting before the god of
code-bumming is finding a better algorithm. It
should be clear that none such was available. If
your code is too slow, you must make it faster. If no
better algorithm is available, you must trim cycles.

5) The same person claimed that the device wouldn't exhibit
the desired speed-up. The argument was flawed in two
regards: first, it didn't address the performance of
the device, but rather the performance of one of its
few uses (implementing memcpy) for which many machines
have a high-performance idiom. Second, the poster
made his claims in the absence of timing data, which
renders his assertion suspect. A second poster tried
the test, but botched the implementation, proving
only that with diligence it is possible to make anything
run slowly.

6) Even Henry Spencer, who hit every other nail square on
the end with the flat round thing stuck to it, made a
mistake (albeit a trivial one). Here is Henry replying
to bi...@proxftl.UUCP (T. William Wells):
>>... Dollars to doughnuts this code

>>was written on a RISC machine.

>Nope. Bell Labs Research uses VAXen and 68Ks, mostly.

I was at Lucasfilm when I invented the device.

7) Transformations like this can only be justified by measuring the
resulting code. Be careful when you use this thing that you don't
unwind the loop so much that you overflow your machine's instruction
cache. Don't try to be smarter than an over-clever C compiler that
recognizes loops that implement block move or block clear and compiles
them into machine idioms.

Here then, is the original document describing Duff's device:

From research!ucbvax!dagobah!td Sun Nov 13 07:35:46 1983
Received: by ucbvax.ARPA (4.16/4.13)
id AA18997; Sun, 13 Nov 83 07:35:46 pst
Received: by dagobah.LFL (4.6/4.6b)
id AA01034; Thu, 10 Nov 83 17:57:56 PST
Date: Thu, 10 Nov 83 17:57:56 PST
From: ucbvax!dagobah!td (Tom Duff)
Message-Id: <831111015...@dagobah.LFL>
To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr,
ucbvax!research!rob

Consider the following routine, abstracted from code which copies an
array of shorts into the Programmed IO data register of an Evans &
Sutherland Picture System II:

send(to, from, count)
register short *to, *from;
register count;
{
do
*to = *from++;
while(--count>0);
}

(Obviously, this fails if the count is zero.)
The VAX C compiler compiles the loop into 2 instructions (a movw and
a sobleq, I think.) As it turns out, this loop was the bottleneck in
a real-time animation playback program which ran too slowly by about 50%.
The standard way to get more speed out of something like this is to unwind
the loop a few times, decreasing the number of sobleqs. When you do that,
you wind up with a leftover partial loop. I usually handle this in C with
a switch that indexes a list of copies of the original loop body. Of
course, if I were writing assembly language code, I'd just jump into the
middle of the unwound loop to deal with the leftovers. Thinking about this
yesterday, the following implementation occurred to me:

send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}while(--n>0);
}
}

Disgusting, no? But it compiles and runs just fine. I feel a combination
of pride and revulsion at this discovery. If no one's thought of it before,
I think I'll name it after myself.

It amazes me that after 10 years of writing C there are still little corners
that I haven't explored fully. (Actually, I have another revolting way to
use switches to implement interrupt driven state machines but it's too
horrid to go into.)

Many people (even bwk?) have said that the worst feature of C is that
switches don't break automatically before each case label. This code forms
some sort of argument in that debate, but I'm not sure whether it's for or
against.

yrs trly
Tom

Charles Simmons

unread,

Aug 30, 1988, 12:31:27 AM8/30/88

to

In article <28...@ttrdc.UUCP> le...@ttrdc.UUCP (Daniel R. Levy) writes:

>In article <6...@paris.ICS.UCI.EDU>, sch...@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>> Since I posted my original question there has been a great deal of
>> abstract discussion about the relative merits of the loop unrolling
>> scheme. The topic has piqued my curiousity, so I when ahead and
>> implemented a short test program, included below, to test Duff's
>> device against the ``ordinary for loop w/index variable'' technique.
>> See for yourself....
>>
>> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
>> loop. Your mileage may vary. I realize that there may be other
>> tests, and if anyone has a better version, I'd like to see it!
>

>I modified this program to run under System V, changed the arrays to be dynam-
>ically allocated, and changed both the Duff and ordinary copies to use register
>pointers instead of global pointers (for the Duff copy) and array indexing (for
>the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
>Sun-4 both with and without -O optimization (using the standard pcc-type C
>compiler on each system). The result? Duff wins by about 10%-20% on all
>machines tested.

I then added a piece to the program to use 'memcpy'. The results?
Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
Duff. So why do people spend so much time avoiding standard subroutines?

-- Chuck

Lawrence V. Cipriani

unread,

Aug 30, 1988, 8:56:52 AM8/30/88

to

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> ch...@amdahl.uts.amdahl.com (Charles Simmons) writes:
[discussion of Duff copy deleted]

>I then added a piece to the program to use 'memcpy'. The results?
>Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
>Duff. So why do people spend so much time avoiding standard subroutines?

Sometimes the standard subroutines are implemented horribly. I was
horrified when I saw that the machine dependent version of memcpy
on the AT&T 3Bs is nothing but a byte by byte transfer written in
assembly language. It is tricky, but doable, to speed this up by a
roughly a factor of sizeof(long). In fact it already is done in the
3B implementation of the UNIX(tm) operating system in the copyin (?)
routine. Why wasn't it done in memcpy too? Sigh.

--
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc l...@cbnews.ATT.COM

Nathaniel Stitt

unread,

Aug 30, 1988, 5:23:24 PM8/30/88

to

In article <dpmuY#2EBC4R=er...@snark.UUCP> er...@snark.UUCP (Eric S. Raymond) writes:

>(Code below reproduced so that comp.arch people seeing this followup only
>won't get terminally frustrated. This is *really neat*, gang...)
>
>In article <6...@paris.ics.uci.edu> Douglas C. Schmidt writes:
>>
>> void send(int *to,int *from, int count) {
>> int n = (count + 7) / 8;
>>
>> switch(count % 8) {
>> case 0: do { *to++ = *from++;
>> case 7: *to++ = *from++;
>> case 6: *to++ = *from++;
>> case 5: *to++ = *from++;
>> case 4: *to++ = *from++;
>> case 3: *to++ = *from++;
>> case 2: *to++ = *from++;
>> case 1: *to++ = *from++;
>> } while (--n > 0);
>> }
>>
>> }
>>
>> Finally, Stroustrup asks the rhetorical question ``why would anyone
>> want to write something like this.'' Any guesses?!
>
>Yeah. That's the most hackish way of trying to write a portable optimized
>copy routine I've ever seen. I gather the whole point of the shenanigans
>is to get all the *from++ -> *to++ instructions in the generated code to be
>adjacent.
>

Here is my own personal version of the "Portable Optimized Copy" routine.
It certainly seems more clear than the above example, and I would expect
it to be at least as fast on virtually any machine. (Note the use of division
and mod in the above routine.)

Code does NOT have to be ugly to be fast.

/* Copy 'count' ints from *from to *to. */
void nats_send(to, from, count)
register int *to;
register int *from;
register int count;
{
/* Copy the bulk of the data */
while(count >= 8)

{
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;

*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;

count -= 8;
}

/* Copy the rest. */

if(count >= 4)

{
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;
*to++ = *from++;

count -= 4;
}

if(count >= 2)

{
*to++ = *from++;
*to++ = *from++;

count -= 2;
}

if(count)
*to++ = *from++;
}

Hope this helps.

--
Nathaniel Stitt | This life is a test. It is only a test. Had
Guidelines Software, Inc. | this been an actual life, you would have received
ucbvax!cogsci!z!nat | further instructions as to what to do and where
(415) 376-1395 | to go.

Henry Spencer

unread,

Aug 31, 1988, 1:07:09 AM8/31/88

to

In article <81...@alice.UUCP> t...@alice.UUCP (Tom Duff) writes:
> [me] >Nope. Bell Labs Research uses VAXen and 68Ks, mostly.
>
> [td] I was at Lucasfilm when I invented the device.

Mea culpa. I remembered that the article came from Bell Labs, but I
forgot the little note on the end saying that the invention happened
at Lucasfilm. Hokay, ve correct it:

Nope. Lucasfilm used VAXen and 68Ks, mostly.

T. William Wells

unread,

Aug 31, 1988, 4:03:58 AM8/31/88

to

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> ch...@amdahl.uts.amdahl.com (Charles Simmons) writes:

: I then added a piece to the program to use 'memcpy'. The results?

: Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
: Duff. So why do people spend so much time avoiding standard subroutines?

Try some history, bud; it's good for what ails you.

I doubt that memcpy even existed then; and it is *not* standard
now. Perhaps it will be several years after the ANSI standard is
adopted, but not till then.

---
Bill
novavax!proxftl!bill

Earl Killian

unread,

Aug 31, 1988, 2:30:46 PM8/31/88

to

Perhaps a better reason:

According to the author, that code was used for copying to a 16-bit IO
device. It would have be illegal to use memcpy or bcopy because they
would make word references to the device.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

Lawrence V. Cipriani

unread,

Aug 31, 1988, 5:02:32 PM8/31/88

to

In article <10...@cbnews.ATT.COM>, l...@cbnews.ATT.COM (that's me) wrote:
> Sometimes the standard subroutines are implemented horribly. I was
> horrified when I saw that the machine dependent version of memcpy
> on the AT&T 3Bs is nothing but a byte by byte transfer written in

> assembly language. ...

Oops. I should have said this was on an early version (~1984) of
a 3B5. New 3Bs don't have this deficiency.

Hank Dietz

unread,

Aug 31, 1988, 5:52:14 PM8/31/88

to

In article <1...@bales.UUCP>, n...@bales.UUCP (Nathaniel Stitt) writes:
> Here is my own personal version of the "Portable Optimized Copy" routine.

.... then he gives a rather verbose, but structured, encoding....

As long as we're getting into structured, portable, hacks, let me suggest
the following two ways of doing block copy:

1. If the number of items/bytes is known at compile time, then you can
define a struct type of the appropriate size and use struct assign.
with type casts to make it fly. For example, suppose p and q are
pointers to ints and I want to copy 601 ints from p to q. Then I
can write the fast and surprizingly portable:

struct t601 { int t[601]; };
*((struct t601 *) q) = *((struct t601 *) p);

Of course, you do have to watch-out for alignment problems, but
if your compiler doesn't generate very fast code for this....

2. If the number of items/bytes is not known, then build a binary tree of
such structs and copy half, then half of what remains, etc. This is
funny looking, but very fast also. Suppose the number of ints (n) is
not known at compile time, but can't be more than 601. You can write:

struct t512 { int t[512]; };
struct t256 { int t[256]; };
struct t128 { int t[128]; };
struct t64 { int t[64]; };
struct t32 { int t[32]; };
struct t16 { int t[16]; };
struct t8 { int t[8]; };
struct t4 { int t[4]; };
struct t2 { int t[2]; };
if (n & 512) {
*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
}
if (n & 256) {
*((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
}
if (n & 128) {
*((struct t128 *) q) = *((struct t128 *) p); q+=128; p+=128;
}
if (n & 64) {
*((struct t64 *) q) = *((struct t64 *) p); q+=64; p+=64;
}
if (n & 32) {
*((struct t32 *) q) = *((struct t32 *) p); q+=32; p+=32;
}
if (n & 16) {
*((struct t16 *) q) = *((struct t16 *) p); q+=16; p+=16;
}
if (n & 8) {
*((struct t8 *) q) = *((struct t8 *) p); q+=8; p+=8;
}
if (n & 4) {
*((struct t4 *) q) = *((struct t4 *) p); q+=4; p+=4;
}
if (n & 2) {
*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2;
}
if (n & 1) *q = *p;

Notice that, in this case, n, p, and q should be declared as being
register variables and that p and q are altered by this routine. Of
course, you can copy larger things by making larger power-of-2 sized
structs.

Incidentally, this ran about 8x faster (on a VAX 11/780) than using
the usual copy loop. Unfortunately, the above code should have been
written as:

if (n & 512) {
*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
}
...

but, for some unknown reason, the VAX C compiler didn't like that.

Enjoy.
ha...@ee.ecn.purdue.edu

Doug Gwyn

unread,

Aug 31, 1988, 9:34:20 PM8/31/88

to

In article <29...@wright.mips.COM> ea...@mips.COM (Earl Killian) writes:
>According to the author, that code was used for copying to a 16-bit IO
>device. It would have be illegal to use memcpy or bcopy because they
>would make word references to the device.

Wrong reason. memcpy() etc. don't have the right behavior for stashing
a sequence of values into the same destination location.

Loren Carpenter

unread,

Aug 31, 1988, 10:13:17 PM8/31/88

to

The Duff Loop (as far as I know) was first cast into C by Tom Duff when
he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever
we needed reasonable speed without resorting to assembly language. It
obviously generalizes to more than memory copy & clear.

I and others have used this control construct for many years, but always
in assembly language. I learned it from Howard Schmeising at Boeing in 1969,
where we were writing optimal stack code for CDC 6600's (a $7M RISC machine).
We could get 5+ 60-bit Mflops if we worked at it.

Loren Carpenter
...{ucbvax,sun}!pixar!loren

der Mouse

unread,

Sep 1, 1988, 5:46:18 AM9/1/88

to

In article <1988Aug28.0...@utzoo.uucp>, he...@utzoo.uucp (Henry Spencer) writes:
> In article <6...@proxftl.UUCP> bi...@proxftl.UUCP (T. William Wells) writes:
>> [Duff's Device]
>> The false god of efficiency has reared it ugly head.

> In general, however, in the long run the correct way to implement
> bulk data copying is to call "memcpy", which (in the long run) is
> likely to be recognized and given special attention by most
> compilers.

Ultimately, Henry is correct (as usual). However, for people who want
their code to run fast today, when memcpy() is a real routine call, it
is worthwhile to actually time things. In particular, if you have a
small amount of data to move, the routine call overhead may well be
enough to completely swamp the speedup obtained by using a highly-tuned
copy routine, particularly on machines like the VAX with high call
overhead.

Moral: If you care, take the time to find out how it *really* is.

der Mouse

old: mcgill-vision!mouse
new: mo...@larry.mcrcim.mcgill.edu

jim nutt

unread,

Sep 1, 1988, 10:49:21 AM9/1/88

to

-> In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com>
-> ch...@amdahl.uts.amdahl.com (Charles Simmons) writes:
-> : I then added a piece to the program to use 'memcpy'. The results?
-> : Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
-> : Duff. So why do people spend so much time avoiding standard
-> subroutines?
->
-> Try some history, bud; it's good for what ails you.
->
-> I doubt that memcpy even existed then; and it is *not* standard
-> now. Perhaps it will be several years after the ANSI standard is
-> adopted, but not till then.

memcpy() IS in the ansi standard library, along with a similar function called memmove(). admittedly, the standard is not finalized yet but it is scheduled to be so soon. i would think that it would behoove compiler vendors to begin tracking the standard now if they aren't already...

jim nutt
'the computer handyman'

--
St. Joseph's Hospital/Medical Center - Usenet <=> FidoNet Gateway
Uucp: ...ncar!noao!asuvax!stjhmc!15.11!jim.nutt
Internet: jim....@p11.f15.n114.z1.fidonet.org

Rod Creason

unread,

Sep 1, 1988, 10:51:32 AM9/1/88

to

Newsgroups: comp.lang.c,comp.arch
Subject: Re: Explanation, please!
Summary:
Expires:
References: <6...@paris.ics.uci.edu> <dpmuY#2EBC4R=er...@snark.UUCP> <1...@bales.UUCP> <90...@pur-ee.UUCP>
Sender:
Reply-To: ro...@unet.PacBell.COM (Rod Creason)
Followup-To:
Distribution:
Organization: Network Equipment Technologies, Redwood City, CA
Keywords: byte copy

In article <90...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:

>2. If the number of items/bytes is not known, then build a binary tree of
> such structs and copy half, then half of what remains, etc. This is
> funny looking, but very fast also. Suppose the number of ints (n) is
> not known at compile time, but can't be more than 601. You can write:
>
> struct t512 { int t[512]; };
> struct t256 { int t[256]; };
> struct t128 { int t[128]; };
> struct t64 { int t[64]; };
> struct t32 { int t[32]; };
> struct t16 { int t[16]; };
> struct t8 { int t[8]; };
> struct t4 { int t[4]; };
> struct t2 { int t[2]; };
> if (n & 512) {
> *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
> }

> [ rest of the example ]

A key point is that this code is *not* portable unless you can *guarantee*
that q and p are ALWAYS aligned at a four-byte boundary. Any 3b2 (WE32000),
for example, will core dump if this is not the case. Although 68000 based
computers only require even alignment, they would still fail in most cases
where this routine would be used.

The key to a fast memory-to-memory copy is getting to these boundaries.
(and in the case where q or p is an odd address and the other is at an
even address, none of this will work)

> ha...@ee.ecn.purdue.edu

Rod Creason
...!{ames,amdahl,oliveb,pacbell}!unet!rodc

Paul Fuqua

unread,

Sep 1, 1988, 12:31:53 PM9/1/88

to

Date: Wednesday, August 31, 1988 9:13pm (CDT)
From: loren at pixar.UUCP (Loren Carpenter)
Subject: Re: Explanation, please!
Newsgroups: comp.lang.c,comp.arch

The Duff Loop (as far as I know) was first cast into C by Tom Duff when
he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever
we needed reasonable speed without resorting to assembly language. It
obviously generalizes to more than memory copy & clear.

There was a discussion of the device in comp.protocols.tcp-ip not too long
ago, as a way to speed up the calculation of checksums. The one point that I
haven't seen brought out here yet is that if the unrolling factor is a power
of two, the divide/remainder operations are simply a shift and a mask.

pf

Paul Fuqua
Texas Instruments Computer Science Center, Dallas, Texas
CSNet: p...@csc.ti.com (ARPA too, sometimes)
UUCP: {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf

Kenneth Goodwin

unread,

Sep 1, 1988, 1:19:39 PM9/1/88

to

In article <90...@pur-ee.UUCP>, ha...@pur-ee.UUCP (Hank Dietz) writes:
> In article <1...@bales.UUCP>, n...@bales.UUCP (Nathaniel Stitt) writes:
> > Here is my own personal version of the "Portable Optimized Copy" routine.

> 2. If the number of items/bytes is not known, then build a binary tree of
> such structs and copy half, then half of what remains, etc. This is

> struct t512 { int t[512]; };
> struct t256 { int t[256]; };
> struct t128 { int t[128]; };

.... etc .....

> if (n & 512) {
> *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
> }
> if (n & 256) {
> *((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
> }

... etc ...

> Incidentally, this ran about 8x faster (on a VAX 11/780) than using
> the usual copy loop. Unfortunately, the above code should have been
> written as:
>
> if (n & 512) {
> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> }
> ...

BUT This is where UNIONS come in handy, I used a similar although
more brief technique for a faster version of a bmov() (byte move)
subroutine on our PDP11-70 a while ago, and subsequently ported
it to memcpy when we updated from V6 to System V.
The basic idea that was used is to create a union of long, int,
(short), and char pointers, use the character pointer to achieve
the needed alignments and then use the largest available pointer
to do the copy. There is no reason why a stucture copy could not be
used, although I suspect on NON-VAX systems it may actually
be detremental (sp?) in some cases.
The PDP11 C compiler used to stuff registers onto the stack
and create a 16 bit word copy loop to do structure copies
using the freed registers, restoring them when it was done.
So a structure copy would be the same as a word copy on that style
of a system (ie, ones without block move instructions)

So In the case of your example, a modified brief version of it
would be:

union ptr_types {
struct t512 { int t512[512] } *t512;
....
struct t32 { int t32[32] } *t32;
long *t_long;
int *t_int;
short *t_short;
char *t_char;
} ;

(probably could dispense with long and short pointers
and related tests)

memcpy(a, b, len)
char *a; *b;
{
register union ptr_types a_ptr, b_ptr;

a_ptr.t_char = a;
b_ptr.t_char = b;

while(NOT ON A WORD BOUNDARY AND CHARS LEFT) {
*a_ptr.t_char++ = *b_ptr.t_char++;
len--;
}
if(len >= sizeof(int) * 512) {
/* if we can use a 512 int structure copy */
*a_ptr.t512++ = *b_ptr.t512++;
len -= (512 * sizeof(int));
}
/*M the biggest win is that the pointers increment correctly
len -= (sizeof(*element pointer)) is the correct form over
N INTS * sizeof int */

.......
I guess the rest is obvious, some GLUE may be needed
that has not be shown.... :-)
Boundaries should be checked on source and destination addresses
to avoid memory faults....
As you may be given incompatible source and destination address
that may require a full char by char copy. The first
test loop sort of does this, but all the other copies
should also check for proper address alignments before
proceeding.
Ken Goodwin
NJSMU.

Andrew Klossner

unread,

Sep 1, 1988, 2:55:47 PM9/1/88

to

Nathaniel Stitt writes:

"Here is my own personal version of the "Portable Optimized
Copy" routine. It certainly seems more clear than the above
example, and I would expect it to be at least as fast on
virtually any machine."

then goes on to present a routine which uses follow-on code to handle
the last few bytes after all octets have been copied. It's cleaner
code, but it won't be quite as fast on many systems with instruction
caches because it has fifteen byte-move instructions, replacing eight
in the original, so more time is spent loading the loop into the
i-cache. On systems with very small i-caches (my favorite example is
the IBM 360/91 with 16 bytes), the bigger loop may not all fit into
cache, and would be considerably slower.

Several contributors have suggested that unrolling a byte-copy loop is
a win. On some architectures it is, but on a good pipelined system it
may not be. As an example, the program fragment

while (count--) {
to[i] = from[i];
++i;
}

can be compiled to code on the M88k which copies memory as fast as a
DMA controller could; the instructions to decrement, increment, and
branch overlap with the data load/store requests.

[If everything's in registers, indexing in this case is actually faster
than keeping separate "to" and "from" pointers and incrementing both.]

This assumes that "to" and "from" are pointers-to-ints or
pointers-to-doubles. Copying less than a word at a time is slower.

-=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP]
(andrew%tekecs....@relay.cs.net) [ARPA]

Jabir Hussain

unread,

Sep 1, 1988, 4:00:35 PM9/1/88

to

In article <90...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:

> struct t512 { int t[512]; };
> struct t256 { int t[256]; };

[...]

> if (n & 512) {
> *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
> }

[...]

> Unfortunately, the above code should have been
> written as:
>
> if (n & 512) {
> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> }
> ...
>
> but, for some unknown reason, the VAX C compiler didn't like that.

one portable way around that is to do something like:

typedef char *caddr_t;
typedef union {
struct { int t[512]; } t512;
struct { int t[256]; } t256;
caddr_t caddr;
} ustruct_t;

{
register ustruct_t src,dst;

src.caddr = (caddr_t) p;
dst.caddr = (caddr_t) q;

if (n & 512) *dst.t512++ = *src.t512++;
...
}

jh

Rick Richardson

unread,

Sep 1, 1988, 6:37:23 PM9/1/88

to

I did a quick comparison of the times for "memcpy(,,n*4)", Duff's
device, Nathaniel's "nat", and Hank's versions. Here's the results:

TEST MACHINE ATT cc VENIXcc GNU1.26
memcpy 8Mhz 286, Venix SVR2 3.35 11.15 -
duff 8Mhz 286, Venix SVR2 16.58 4.56 -
nat 8Mhz 286, Venix SVR2 16.73 5.11 -
hank 8Mhz 286, Venix SVR2 1.83 ccbarf -
memcpy 16Mhz 386, 386/ix 1.0.4 .82 - .82
duff 16Mhz 386, 386/ix 1.0.4 2.01 - 2.26
nat 16Mhz 386, 386/ix 1.0.4 1.95 - 2.21
hank 16Mhz 386, 386/ix 1.0.4 .86 - .89

* Everything was compiled with -O
* test moved 2000,1999,...,2,1 int's

I learned this: Venix obviously doesn't have a good memcpy, and
Duff, or nat make a good replacement. AT&T uses the block move
in memcpy, and it is a definite win on both machines. The AT&T
286 C compiler is terrible for these program on the 286. AT&T
still has a slight edge over gcc *for these programs*. Hanks
version, while not portable, is a clear win on the 286 (disassemble
memcpy and hank's to see why), and virtually as good as memcpy on 386.

Conclusion: you can't make any. It all depends on your computer
and compiler.
--
Rick Richardson, PC Research, Inc.
rick%pcrat...@uunet.uu.net (INTERNET)
uunet!pcrat!rick (UUCP, Personal Mail)
..!pcrat!jetroff (JetRoff Info) ..!pcrat!dry2 (Dhrystone Submissions)

der Mouse

unread,

Sep 2, 1988, 2:58:42 AM9/2/88

to

In article <90...@pur-ee.UUCP>, ha...@pur-ee.UUCP (Hank Dietz) writes:

> In article <1...@bales.UUCP>, n...@bales.UUCP (Nathaniel Stitt) writes:

>> ["Portable Optimized Copy" routine]

> As long as we're getting into structured, portable, hacks, let me
> suggest the following two ways of doing block copy:

> 1. If the number of items/bytes is known at compile time, then you
> can define a struct type of the appropriate size and use struct

> assignment with type casts to make it fly. For example,

[ int *p; int *q; /* want to copy 601 ints from p to q */ ]

> struct t601 { int t[601]; };
> *((struct t601 *) q) = *((struct t601 *) p);

Surprise! Your compiler has sizeof(int)=2, sizeof(long)=4, and aligns
structs on long boundaries. You just copied 602 ints!

> Unfortunately, the above code should have been written as:

> if (n & 512) {
> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> }

> but, for some unknown reason, the VAX C compiler didn't like
> that.

Perhaps the "unknown reason" is that it's illegal! An expression
formed by casting another expression is not an lvalue and never has
been. Any compiler that accepts that is broken. (Yes, there exist
such broken compilers.)

What would you expect from

double d;

((int)d) ++;

Perhaps you'd want d+=1? Perhaps d=1+(int)d? Perhaps
*(int*)&d=1+(int)d even?

Chris Torek

unread,

Sep 2, 1988, 6:47:09 AM9/2/88

to

-In article <90...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:
-> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
-> but, for some unknown reason, the VAX C compiler didn't like that.

In article <3...@quintus.UUCP> ja...@quintus.uucp (Jabir Hussain) writes:
-one portable way around that is to do something like:
-
- typedef char *caddr_t;
- typedef union {
- struct { int t[512]; } t512;
- struct { int t[256]; } t256;
- caddr_t caddr;
- } ustruct_t;
...
- if (n & 512) *dst.t512++ = *src.t512++;

`Portable'? I suggest you try it on a Data General MV/10000.
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: ch...@mimsy.umd.edu Path: uunet!mimsy!chris

Robert Alverson

unread,

Sep 2, 1988, 1:27:48 PM9/2/88

to

In article <90...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:

> Incidentally, this ran about 8x faster (on a VAX 11/780) than using
> the usual copy loop. Unfortunately, the above code should have been
> written as:
>
> if (n & 512) {
> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> }
> ...
>
> but, for some unknown reason, the VAX C compiler didn't like that.
>
>
>Enjoy.
> ha...@ee.ecn.purdue.edu

Actually, the VAX C compiler was completely correct. Attempting
to post-increment the result of a cast is not legal C. One way
around this is to use a little extra indirection:

char* p;

/* was: ((int *)p)++; */
(*((int **) &p))++; /* phew! */

The reasoning is that the result of a cast is an expression, not a
lvalue. Quite often, compilers relax this rule for coercions which
produce no code. In this case, it looks like the VAX C doesn't

Bob

Hank Dietz

unread,

Sep 5, 1988, 12:57:18 PM9/5/88

to

Relative to my article <90...@pur-ee.UUCP> giving a struct-based hack for
doing block copy and to all followups thereof....

Correction:

I hereby appologize for having ommitted a (:-) and hence having made it
sound like I didn't know why the VAX compiler didn't want to use a cast
pointer as the operand of ++. I do know (and I knew even before a dozen of
you reminded me :-), I just think it should be allowed in such cases; i.e.,
be implementation dependent rather than illegal. Anyway, the union-based fix
given by Jabir is certainly a good way to do it (once you fix his typedef to
make t512 and t256 pointers to structs, rather than structs).

Comments:

As I noted in the original posting, the use of a structure-hack block copy is
subject to alignment problems -- if the target machine has alignment
constraints. There are three additional notes on this:

1. Both the source AND destination pointers may need to be aligned,
which implies that the difference between the pointers MUST BE A
MULTIPLE OF THE ALIGNMENT FACTOR. Arbitrary copies will not have
this property, hence, on such machines one must always have a
byte-oriented copy routine to fall back on.

For a 4-byte alignment machine, the test is something like:

if ((p - q) & 3) *byte copy* else *struct copy*

2. If the difference between the pointers is a multiple of the machine
alignment factor, then copying up until one is aligned will insure
alignment of both. Hence, we simply need to insert a few moves to
achieve alignment BEFORE the struct copy code I gave earlier.

For a 4-byte alignment machine, the additional code is:

if ((n >= 1) && (p & 1)) {
*q = *p; ++q; ++p; --n;
}
if ((n >= 2) && (p & 2)) {
*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; n-=2;
}

Notice that, with the above code as a prefix, trailing alignment is
also satisfied (proof left as an exercise to the reader).

3. Although I know of no such C compiler, it is legal for a C compiler
to make the size of a struct anything big enough to hold the things
inside it. Hence, for example, a struct containing an array of 4
things might be the size of 5 things.... If this were done, the
struct copy would nearly work; the only problem is that it could
overshoot the end of the block. There is no general fix for this.

ha...@ee.ecn.purdue.edu

David Keppel

unread,

Sep 6, 1988, 1:25:50 PM9/6/88

to

ha...@pur-ee.UUCP (Hank Dietz) writes:
> if ((p - q) & 3) *byte copy* else *struct copy*

I believe that the VAX "movc" command takes arbitrary pointers and
does the following:

* If both are word-aligned, do a word copy (I mean a 4-byte word).
* If both are non-aligned and could be aligned with 1, 2, or 3 bytes
of byte-copy at either end, then do a byte copy at either end and do
a word copy down the middle.
* If niether aligned then ??

Unfortunately, my VAX hardware reference is out of town for a couple
of weeks, so I can't ask him about neither aligned. Anybody know?

I can immagine that on some machines it is faster to copy words into
register and repack the words in the registers rather than do a byte
copy, since you could be taking advantage of some hardware gak.

Simple example:

machine X has register W1 divided into B4, B5, B6, B7. To do a
copy, align the source pointer (doing byte copies) then read a
wrod-at-a-time into the W1 register, write it back out by writing
B4, B5, B6, B7 (little-endian).

This is beginning to look suspiciously like the kinds of optimizations
that get done for bit BLTs. Anybody know if this ever really gets
done?

;-D on ( Ahh. Architecture at its finest ) Pardo
--
pa...@cs.washington.edu
{rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

Rik Littlefield

unread,

Sep 6, 1988, 3:03:38 PM9/6/88

to

In article <56...@june.cs.washington.edu>, pa...@june.cs.washington.edu (David Keppel) writes:
>
> I can immagine that on some machines it is faster to copy words into
> register and repack the words in the registers rather than do a byte
> copy, since you could be taking advantage of some hardware gak.
>

On the old CDC 6000-series machines (early RISCs...) that was the *only*
practical way to do it, as well as being blazingly fast. We had copies
that would handle arbitrary *bit* alignments at a cost of around 6 instructions
and 2 memory references per 60-bit word, in the middle of the string.
The sequence was basically fetch, shift, mask, mask, OR, and store,
appropriately rearranged to minimize memory delay and functional unit
conflicts, of course. I vaguely remember that this thing could even
be unrolled a couple of times and still fit in the instruction cache
("stack", in those days) for machines expensive enough to have one.

VAXen I don't know about for sure, but I'd be real surprised if their
microcode didn't do the same thing.

--Rik

Network News

unread,

Sep 6, 1988, 9:10:03 PM9/6/88

to

In article <56...@june.cs.washington.edu> pa...@uw-june.UUCP (David Keppel) writes:
| ha...@pur-ee.UUCP (Hank Dietz) writes:
| > if ((p - q) & 3) *byte copy* else *struct copy*
|
| I believe that the VAX "movc" command takes arbitrary pointers and
| does the following:
|
| * If both are word-aligned, do a word copy (I mean a 4-byte word).
| * If both are non-aligned and could be aligned with 1, 2, or 3 bytes
| of byte-copy at either end, then do a byte copy at either end and do
| a word copy down the middle.
| * If niether aligned then ??
|
| Unfortunately, my VAX hardware reference is out of town for a couple
| of weeks, so I can't ask him about neither aligned. Anybody know?

I assume that "both are non-aligned" in the above list means that the
source and the destination have the same alignment, but are not aligned
with respect to a 4-byte boundary, and that "neither aligned" means that
the source and the destination are misaligned.

We use an interesting trick in the Am29000 memcpy routine for
source/destination misalignment. In this case, we set up the alignment
difference in the funnel-count register, read in two source words, and
"extract" a destination word using the funnel-shifter's ability to
extract any 32-bit word from a 64-bit double-word in a single cycle.
The inner loop then consists of shifting the low source word to the high
source word, reading a new low source word, extracting a destination and
storing it (well, there is also the overhead of counting down the
correct number of "word" moves, but you get the idea.)

-- Tim Olson
Advanced Micro Devices
(t...@delirun.amd.com)

Chris Torek

unread,

Sep 7, 1988, 1:30:44 AM9/7/88

to

In article <56...@june.cs.washington.edu> pa...@june.cs.washington.edu

(David Keppel) writes:
>I believe that the VAX "movc" command takes arbitrary pointers and
>does the following:
>
>* If both are word-aligned, do a word copy (I mean a 4-byte word).
>* If both are non-aligned and could be aligned with 1, 2, or 3 bytes
> of byte-copy at either end, then do a byte copy at either end and do
> a word copy down the middle.
>* If niether aligned then ??
>
>Unfortunately, my VAX hardware reference is out of town for a couple
>of weeks, so I can't ask him about neither aligned. Anybody know?

I do not *know*, but I predict that the answer is machine-dependent:
that BI machines use octaword transfers, while SBI machines use
quadword transfers and CMI machines use longword transfers. I believe
that at least the 780 and faster VAXen have an alignment network, and
that the microcode can use this directly, so that even if the two
addresses cannot become simultaneously aligned, the copy can proceed as
if they were, with intermediate 64-bit results accumulated in a series
of latches behind the alignment network.

Incidentally, the microcode has a harder job than simply aligning:
The formats of the two instructions are

movc3 count.rw,src.ab,dst.ab
and
movc5 srclen.rw,src.ab,fill.rb,dstlen.rw,dst.ab

(r = read-reference, a = address-reference; b = byte, w = word; these
tell how the argument is used and what increments and shifts are
applied to postincrement, predecrement, and indexed addressing modes).
In both cases, if the source and destination overlap, the copy is done
in whichever direction is nondestructive. Alas, since the count
(movc3) and length (movc5) fields are only read as words, one
instruction can move at most 65535 bytes. To make these work as a
general copy routine one must surround these with loops which also must
determine the appropriate direction; moreover, since the results are
left in specific registers (r0..r5) the loops must be carefully written
so as to hold the source and destination fields in the appropriate
result-registers to avoid unnecessary moves.

(Fortunately, a C compiler has the sizes of structures available
directly, and can generate the proper series of movc3 instructions for
a structure assignment, but in fact the 4BSD PCC cheats and assumes
that no structure contains more than 65535 bytes. You get a compiler
error if you try to assign one larger than this. Well, at least it
does not generate bad code....)

T. William Wells

unread,

Sep 7, 1988, 8:55:06 PM9/7/88

to

In article <485.23...@stjhmc.fidonet.org> jim....@p11.f15.n114.z1.fidonet.org (jim nutt) writes:
: -> I doubt that memcpy even existed then; and it is *not* standard

: -> now. Perhaps it will be several years after the ANSI standard is
: -> adopted, but not till then.
:
: memcpy() IS in the ansi standard library, along with a similar
: function called memmove().

Yes, I know. I said "it isn't standard", not "it isn't in the
standard". The difference is that "it isn't standard" means that
you can't write portable code that assumes that it exists; the
other should be obvious.

---
Bill
novavax!proxftl!bill

der Mouse

unread,

Sep 11, 1988, 7:08:50 AM9/11/88

to

In article <3...@quintus.UUCP>, ja...@quintus.uucp (Jabir Hussain) writes:
> In article <90...@pur-ee.UUCP> ha...@pur-ee.UUCP (Hank Dietz) writes:
>> *(((struct t512 *) q)++) = *(((struct t512 *) p)++);

>> but, for some unknown reason, the VAX C compiler didn't like that.

> one portable way around that is to do something like:

[summarized]
> union { struct { int t[512]; } *t512; ...; char *caddr; }
[then use ++ on .t512]

This falls hopelessly flat if char * and struct {...} * have
significantly different representation.

Steve Summit

unread,

Sep 12, 1988, 9:55:03 PM9/12/88

to

In article <83...@smoke.ARPA> gw...@brl.arpa (Doug Gwyn) writes:
>Some implementations of bcopy() and memcpy() do guarantee
>proper handling of overlapping fields. But not all do, so one cannot
>count on it if one's trying to write truly portable code. ANSI C will
>specify a new function memmove() with this property guaranteed.

I'll probably be disgusted by the answer, but can someone explain
why two functions are needed instead of one? Why not just add
the guarantee to memcpy? In my experience, it's not hard to
implement the correct, bidirectional copy behavior. Did someone
actually manage to convince the committee that there is an
architecture and/or implementer out there that is so badly broken
that the overlap test would significantly slow down programs
which did not need the guarantee? Or is it a code size issue for
potential in-line implementations? (I am aware that a correct
overlap test is not trivial for badly broken architectures such
as the segmented 80x86, but the correct test on those machines is
not really hard or slow, either.)

Steve Summit
s...@adam.pika.mit.edu

Doug Gwyn

unread,

Sep 13, 1988, 9:40:47 AM9/13/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:
>I'll probably be disgusted by the answer, but can someone explain
>why two functions are needed instead of one? Why not just add
>the guarantee to memcpy?

(1) many people want a fast block-move, many want one that does the
"correct" thing on overlaps. Two routines allows the application to
choose the trade-off. A vendor could choose to make memcpy just
another entry point for memmove.

(2) it's possible that some existing (nonportable) applications depend
on the specific implementation of their current memcpy().

~XT4103000~Marc Mengel~C25~G25~6184~

unread,

Sep 13, 1988, 5:20:09 PM9/13/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:

>In article <83...@smoke.ARPA> gw...@brl.arpa (Doug Gwyn) writes:
>>Some implementations of bcopy() and memcpy() do guarantee
>>proper handling of overlapping fields. But not all do, so one cannot
>>count on it if one's trying to write truly portable code. ANSI C will
>>specify a new function memmove() with this property guaranteed.
>
>I'll probably be disgusted by the answer, but can someone explain
>why two functions are needed instead of one? Why not just add
>the guarantee to memcpy?

> Steve Summit

To check whether the source and destination overlap, you have to compare
two pointer values. You can only do this legally ($3.3.8) if the two pointers
point into the same array/aggregate. Hence you need memmove() for
shifting items in an array, and memcpy() for moving items between arrays.
Using memmove() on items not in the same array is therefore not guaranteed
to work, as the comparison of pointers not in the same array/aggregate is
undefined.

The issue is that to make memcpy() work the way you want it to, you
need to define pointer semantics for comparison on pointers that don't
point into the same array/aggregate. Feel free to post a definition
for these semantics, I am sure the various net.hackers out there can
name you a machine that your semantics don't make sense on; or force
pointers to be unusually large objects.
--
Marc Mengel
mme...@cuuxb.att.com
attmail!mmengel
{lll-crg|mtune|att}!cuuxb!mmengel

Larry Jones

unread,

Sep 13, 1988, 6:36:46 PM9/13/88

to

In article <70...@bloom-beacon.MIT.EDU>, s...@athena.mit.edu (Steve Summit) writes:
> I'll probably be disgusted by the answer, but can someone explain
> why two functions are needed instead of one? Why not just add
> the guarantee to memcpy? In my experience, it's not hard to
> implement the correct, bidirectional copy behavior. Did someone
> actually manage to convince the committee that there is an
> architecture and/or implementer out there that is so badly broken
> that the overlap test would significantly slow down programs
> which did not need the guarantee? Or is it a code size issue for
> potential in-line implementations? (I am aware that a correct
> overlap test is not trivial for badly broken architectures such
> as the segmented 80x86, but the correct test on those machines is
> not really hard or slow, either.)

All of the above. Many machines have only a single block move instruction
which means that the "backwards" move has to be very inefficient and you
need to do a true overlap test (rather than a simple direction test).
That could add up to a lot of code to inline, but it would really hurt
to not inline if the overlap case never happens. The easiest way for
the compiler to know for sure whether the overlap case ever happens is
to have different functions for the different cases.

Also, the committee received a number of letters noting that there are
some fairly common operations where memcpy is the inner loop and so
ANY slowdown would be unacceptable (e.g. block moves on bitmap displays).

----
Larry Jones UUCP: uunet!sdrc!scjones
SDRC scj...@sdrc.uucp
2000 Eastman Dr. BIX: ltl
Milford, OH 45150 AT&T: (513) 576-2070
"Save the Quayles" - Mark Russell

Henry Spencer

unread,

Sep 13, 1988, 7:44:34 PM9/13/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:

>I'll probably be disgusted by the answer, but can someone explain
>why two functions are needed instead of one? Why not just add

>the guarantee to memcpy? ...

The key observation is that in many cases the programmer knows that the
source and destination do not overlap, and it is difficult for the compiler
to figure this out except by checking it at run time. This is particularly
significant if the compiler is doing inlining for efficiency and relatively
small amounts of data are being moved -- in such a case, the overlap check
may significantly hurt performance. Since the older definitions of memcpy
explicitly said "if the operands overlap, no guarantees", it seemed sound
to make memcpy the fast-but-dangerous case and add a new function for the
slow-but-safe case.
--
NASA is into artificial | Henry Spencer at U of Toronto Zoology
stupidity. - Jerry Pournelle | uunet!attcan!utzoo!henry he...@zoo.toronto.edu

Karl Heuer

unread,

Sep 14, 1988, 12:21:12 PM9/14/88

to

In article <20...@cuuxb.ATT.COM> mme...@cuuxb.ATT.COM (Marc Mengel) writes:
>To check whether the source and destination overlap, you have to compare two
>pointer values. You can only do this legally ($3.3.8) if the two pointers
>point into the same array/aggregate.

Watch it -- the first `you' refers to the implementation; the second refers to
the user. Certainly the implementation is free to generate any code that
happens to work.

>Hence you need memmove() for shifting items in an array, and memcpy() for
>moving items between arrays. Using memmove() on items not in the same array
>is therefore not guaranteed to work, as the comparison of pointers not in the
>same array/aggregate is undefined.

No, memmove() is guaranteed to work on *any* valid pointers. What you've
really proved is that memmove() cannot be portably implemented in C.

Karl W. Z. Heuer (ima!haddock!karl or ka...@haddock.isc.com), The Walking Lint
Followups to comp.std.c.

Henry Spencer

unread,

Sep 14, 1988, 7:04:32 PM9/14/88

to

In article <20...@cuuxb.ATT.COM> mme...@cuuxb.UUCP (Marc W. Mengel) writes:
>The issue is that to make memcpy() work the way you want it to, you
>need to define pointer semantics for comparison on pointers that don't

>point into the same array/aggregate...

This is largely irrelevant, since the insides of memcpy will usually be
highly machine-specific anyway.

Steve Summit

unread,

Sep 15, 1988, 8:07:51 PM9/15/88

to

A while back, somebody wrote (explaining why they hadn't and/or
wouldn't use a newly standardized function):

I doubt that memcpy even existed then; and it is *not* standard

now. Perhaps it will be several years after the ANSI standard is

adopted, but not till then.

The reason we keep pointing out the existence of standard
functions to those who would rather (or have to) roll their own
is to encourage people to start using the standard interface, by
writing their own function with the same name and parameters,
rather than sticking special-purpose code in-line, or inventing
some new name. This way the program can take advantage of
standardized, optimized implementations when they become
available (for instance when the program is ported to a different
system which has them).

Steve Summit
s...@adam.pika.mit.edu

Richard Harter

unread,

Sep 16, 1988, 3:22:52 AM9/16/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:

>The reason we keep pointing out the existence of standard
>functions to those who would rather (or have to) roll their own
>is to encourage people to start using the standard interface, by
>writing their own function with the same name and parameters,
>rather than sticking special-purpose code in-line, or inventing
>some new name. This way the program can take advantage of
>standardized, optimized implementations when they become
>available (for instance when the program is ported to a different
>system which has them).

This sounds good, but there is a problem to take into account. If
you have your own routine with the same name as the "standard"
routine there is the potential for problems when you port into
a system that has the "standard" routine in its library. The
problem is that the library implementers may use said routine
internally in the library in a way that conflicts with your routine.
For example, it is quite unsafe to roll your own storage allocator
and call it malloc.
--

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
Richard Harter, SMDS Inc.

Henry Spencer

unread,

Sep 17, 1988, 5:08:16 PM9/17/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:

> I doubt that memcpy even existed then; and it is *not* standard
> now. Perhaps it will be several years after the ANSI standard is
> adopted, but not till then.
>
>The reason we keep pointing out the existence of standard
>functions to those who would rather (or have to) roll their own

>is to encourage people to start using the standard interface...

Note also that there is a public-domain version of memcpy in the strings
library I wrote (which went out in comp.sources.unix quite some time ago,
and is available from their archives). It's not fast, but it works.

Henry Spencer

unread,

Sep 17, 1988, 5:10:53 PM9/17/88

to

In article <33...@cca.CCA.COM> g-...@CCA.CCA.COM (Richard Harter) writes:
>This sounds good, but there is a problem to take into account. If
>you have your own routine with the same name as the "standard"
>routine there is the potential for problems when you port into

>a system that has the "standard" routine in its library...

This just means that inclusion of the routine in question must be optional,
so it can be disabled on such systems. No big deal. Possibly the best way
to do this is to #ifdef the user-supplied routine, so that an explicit -D
option (or whatever) is required to use it.

T. William Wells

unread,

Sep 18, 1988, 5:30:45 AM9/18/88

to

In article <70...@bloom-beacon.MIT.EDU> s...@adam.pika.mit.edu (Steve Summit) writes:

: A while back, somebody wrote (explaining why they hadn't and/or

: wouldn't use a newly standardized function):
:
: I doubt that memcpy even existed then; and it is *not* standard
: now. Perhaps it will be several years after the ANSI standard is
: adopted, but not till then.

That was me talking. And you have the story wrong. Duff wrote
an unrolled loop some time in the past to do approximately what
memcpy does, inline, before memcpy was generally available;
moreover, the circumstances were such that even if these routines
had been available, it wouldn't have been good programming for
him to use them.

Please avoid taking things out of context.

---
Bill
novavax!proxftl!bill

Richard Harter

unread,

Sep 18, 1988, 4:35:44 PM9/18/88

to

In article <1988Sep17....@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>In article <33...@cca.CCA.COM> g-...@CCA.CCA.COM (Richard Harter) writes:
>>This sounds good, but there is a problem to take into account. If
>>you have your own routine with the same name as the "standard"
>>routine there is the potential for problems when you port into
>>a system that has the "standard" routine in its library...

>This just means that inclusion of the routine in question must be optional,
>so it can be disabled on such systems. No big deal. Possibly the best way
>to do this is to #ifdef the user-supplied routine, so that an explicit -D
>option (or whatever) is required to use it.

No big deal for me -- I roll my own and use different names, so I don't
have any problems. I don't need surprises like finding that my version
of a routine breaks the system because there is a system routine with the
same name (and nominal effect) that has side effects that the system relies
on. It may be that there are no such routines in any of multitude of systems
that I maintain code on. Given the eccentricities of some of the
implementations out there I have my doubts. Given the choice between not
having the problem in the first place, and discovering the hard way during
porting that I have to remove my copy of the routine in question, my
choice is clear -- life is too short to solve unnecessary mysteries.
(I also don't want to spend the time and effort of keeping track of
which routines are available in which releases of different operating
systems, except when absolutely necessary.) However I grant that my
views are dictated by my particular requirements.

Henry Spencer

unread,

Sep 19, 1988, 5:35:08 PM9/19/88

to

In article <33...@cca.CCA.COM> g-...@XAIT.Xerox.COM (Richard Harter) writes:
>... I don't need surprises like finding that my version

>of a routine breaks the system because there is a system routine with the
>same name (and nominal effect) that has side effects that the system relies

>on...

If you read the fine print in the X3J11 drafts, this is explicitly forbidden
unless the name is one of those reserved by the standard. Yes, this will
require revisions to libraries in many implementations. X3J11 thought it
well worth the trouble, given the increasingly severe problems with name-
space pollution in C.

Richard Harter

unread,

Sep 20, 1988, 2:07:05 PM9/20/88

to

In article <1988Sep19.2...@utzoo.uucp> he...@utzoo.uucp (Henry Spencer) writes:
>In article <33...@cca.CCA.COM> g-...@XAIT.Xerox.COM (Richard Harter) writes:
>>... I don't need surprises like finding that my version
>>of a routine breaks the system because there is a system routine with the
>>same name (and nominal effect) that has side effects that the system relies
>>on...

>If you read the fine print in the X3J11 drafts, this is explicitly forbidden
>unless the name is one of those reserved by the standard. Yes, this will
>require revisions to libraries in many implementations. X3J11 thought it
>well worth the trouble, given the increasingly severe problems with name-
>space pollution in C.

I seem to recall this, yes. Would you care to hazard a guess as to the date
on which I can rely on this being true in all major implementations? Today?
1992? 1995? 2001? 2061? :-)

This question is not entirely facetious. In practice you have to deal with
operating systems and languages as they are today, and you have to plan for
the future. Perhaps today I avoid potential library routine name conflicts
because of the aforementioned problems. Perhaps tomorrow I rely on being able
to pick up a superior implementation from the system. But today and tomorrow
my software has to run on a suite of architectures. And so it goes.