Actually I want my code to be
for (i=0;i<N;i++)
{
array2[i*30]=array1[i];
array2[i*30+1]=array1[i];
array2[i*30+2]=array1[i];
array2[i*30+3]=array1[i];
.................
.................
.................
array2[i*30+28]=array1[i];
array2[i*30+29]=array1[i];
}
If there anyway to use macros to do it instead for listing out
manually in the code like above?
(the listing out such as the above example will be very bad in the
case I need, say, array2[i*1000+0]=array1[i]; to ...
array2[i*1000+999]=array1[i];)
And please note that for some reasons of my device, I do not want to
use another loop such as
for (i=0;i<N;i++)
for (j=0;j<30;j++)
array2[i*30+j]=array1[i];
If I write a macro as follows:
#define myMacro (array1, array2) \
for ( j=0;j<30;j++) \
array2[i*30+j]=array1[i]; \
Does it help? I doubt it because it will be the same as the case with
two loops that I want to avoid.
Thanks a lot.
Forget macros, they won't help here.
How about memset?
(untested)
memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
--
Ian Collins
I'm curious as to why an inner loop would be that bad.
Frankly, you're stuck; if you absolutely cannot use an inner loop,
then you'll have to write out each initialization manually. You could
write another program to automate the process for you, such as:
printf("for (i = 0; i < N; i++)\n{\n");
for (j = 0; j < K; j++) /** where K is the number of items */
{
printf(" array[i*30+%d] = array[i];\n", j);
}
printf("}\n");
and then paste that output into your code.
Another alternative would be to look into a variation of Duff's device
(see http://en.wikipedia.org/wiki/Duff's_device), where you use an
inner loop but partially unroll it.
for (i = 0; i < N; i++)
{
size_t j = (K + 7)/8;
switch(j % 8)
{
case 0 : do {
case 7 : array[i*30+j+7] = array[i];
case 6 : array[i*30+j+6] = array[i];
case 5 : array[i*30+j+5] = array[i];
case 4 : array[i*30+j+4] = array[i];
case 3 : array[i*30+j+3] = array[i];
case 2 : array[i*30+j+2] = array[i];
case 1 : array[i*30+j+1] = array[i];
} while (--j > 0);
}
}
It certainly looks like a job for a nested for loop (but see below).
> If there anyway to use macros to do it instead for listing out
> manually in the code like above?
> (the listing out such as the above example will be very bad in the
> case I need, say, array2[i*1000+0]=array1[i]; to ...
> array2[i*1000+999]=array1[i];)
There's not really any way for the preprocessor, given a number N, to
generate N statements.
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];
So you insist on having, say, 30 distinct assignment statements; an
equivalent loop that iterates 30 times and performs exactly the same
assignments isn't good enough.
It might help if we know *how* your device imposes this requirement.
If you just write the nested for loop, how exactly does that not work?
> If I write a macro as follows:
>
> #define myMacro (array1, array2) \
> for ( j=0;j<30;j++) \
> array2[i*30+j]=array1[i]; \
>
>
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.
Right, the generated code will almost certainly be the same whether
the loop is written out or generated by a macro.
What you're doing is loop unrolling. It's an optimization technique
that can result in faster but larger code. (In some cases, the
code can be slower just because it's larger, due to cache effects.)
Doing loop unrolling at the source level would usually be considered
premature optimization; a good optimizing compiler should be able
to do this for you.
But if you really need to do this, you might consider writing another
tool that generates C code from an input specification. (m4 is a
popular preprocessing tool; I don't know whether it's powerful enough
for this, but it's worth looking into.)
--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
No, memset sets each byte of the target to the same value. Unless
array1 and array2 are byte arrays, it won't help here.
Something similar to memset that works on elements bigger than bytes
would do the trick. But the obvious way to implement such a function
is to use a for loop, which the OP wants to avoid for some unknown
reason.
Let me explain why I do not want to use a inner loop.
for (i=0;i<N;i++)
for (j=0;j<M;j++)
array2[i*M+j]=array1[i];
for each for iteration of a for loop, we need a comparison and one
addition. In my system, each comparison takes 3 cycles and one
addition takes 1 cycle.
So each iteration takes 4 cycles.
If I use a inner loop, for each i, I need 4*M additional cycles. In
my case M , N both are large numbers and the system's time budget is
limited.
That is why I do not want to use this way.
On Mar 9, 3:53 pm, Keith Thompson <ks...@mib.org> wrote:
Bugger, I forgot about that... I agree with your other posting.
--
Ian Collins
[please don't top post]
> Thanks a lot for responses.
>
> Let me explain why I do not want to use a inner loop.
>
> for (i=0;i<N;i++)
> for (j=0;j<M;j++)
> array2[i*M+j]=array1[i];
>
> for each for iteration of a for loop, we need a comparison and one
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
> So each iteration takes 4 cycles.
>
> If I use a inner loop, for each i, I need 4*M additional cycles. In
> my case M , N both are large numbers and the system's time budget is
> limited.
>
> That is why I do not want to use this way.
Have you looked at the generated code with full optimisation on your
compiler? All those I've used will perform some degree of loop
unrolling, so you will only suffer an extra
(4*M)/L
where L is the number of iterations inlined by the compiler.
--
Ian Collins
How about a partially unrolled inner loop?
So in the the inner loop it does 10 assignments at a time, and the loop
executes M/10 times (with any odd assignments added at the end). Then the
overhead might only be 4*M/10 cycles, and you only ever have to write out
in, full, 10 assignments (or however many you want to do in one go).
--
Bartc
Use a second loop nested inside the first.
> (the listing out such as the above example will be very bad in the
> case I need, say, array2[i*1000+0]=array1[i]; to ...
> array2[i*1000+999]=array1[i];)
>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];
Use a second loop nested inside-- oh, wait, sorry. It's
the right solution, but if you don't want to use it ... What
are these mysterious "reasons" of yours? Tell us about them,
and maybe somebody will have a better idea than writing line
after line after repetitive line of code.
> If I write a macro as follows:
>
> #define myMacro (array1, array2) \
> for ( j=0;j<30;j++) \
> array2[i*30+j]=array1[i]; \
>
> Does it help? I doubt it because it will be the same as the case with
> two loops that I want to avoid.
It does the right thing, which is to use a second loop-- oh,
wait, we've been through that already.
The preprocessor has no iterative constructs.
--
Eric Sosman
eso...@ieee-dot-org.invalid
Just a question. Is the performance gain worth the maintenance and
code size (if embedded) costs that you're adding? I haven't
encountered a scenario where doing something like this is worth the
imposed maintenance cost, so I'm curious what the scenario is.
Maybe, but optimizing compilers are already pretty good
at unrolling loops -- especially loops whose iteration count
are compile-time constants.
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
> So each iteration takes 4 cycles.
It was reasonable to count cycles as recently as, oh, twenty
years ago. Since then, CPU's have become far more intricate and
cycle counts have become extremely difficult to compute. Unless
you're using an unusually simple processor (single-issue, little
or no pipelining, at most one level of memory cache, maybe only
a one- or two-slot store buffer, ...), these cycle counts are
only loosely related to elapsed time.
Your example showed thirty assignments, and you indicated you
might need as many as a thousand. How many cycles will you spend
in wait states with the CPU twiddling its thumbs while RAM ever so
slowly retrieves a fresh batch of instructions?
> If I use a inner loop, for each i, I need 4*M additional cycles. In
> my case M , N both are large numbers and the system's time budget is
> limited.
>
> That is why I do not want to use this way.
Write it both ways (John Bode's suggestion of a "helper"
program will save some time), and MEASURE the difference. Cycle
counting is nearly impossible unless you happen to know that most
of the data is already in CPU registers; a couple of cache misses
(even just in L1) can disrupt the whole calculation. Measure!
--
Eric Sosman
eso...@ieee-dot-org.invalid
If you *have* to use macro's, how about:
#define A(x) array2[i*30+x] = array1[i];
#define A_5(x) A(x) A(x+1) A(x+2) A(x+3) A(x+4)
#define A_15(x) A_5(x) A_5(x+5) A_5(x+10)
#define A_30(x) A_15(x) A_15(x+15)
for (i=0; i<N; i++)
{
A_30(i);
}
>
> And please note that for some reasons of my device, I do not want to
> use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];
>
Have you tried this? Depending on your processor, it's possible that the
read and the write will be overlapped with the loop overhead, and so the
loop might cost you no time at all (and hence give you more
readable/maintainable code for free)?
A wacky alternative if you have a C++ compiler for your target is to use
template meta-programming to generate the code.
--
Ian Collins
Why are you unrolling the loop manually?
> (the listing out such as the above example will be
> very bad in the case I need, say, array2[i*1000+0]=array1
> [i]; to ... array2[i*1000+999]=array1[i];)
Write another program that does...
for (i = 0; i < 1000; i++)
printf("array2[i * 1000 + %d] = array1[i];\n", i);
...and include the output of that program.
But, if you _really_ want a macro, then powers, e.g. 10^3,
are easy...
#define M000(M, p) \
M00(M, (p)*10 + 0) \
M00(M, (p)*10 + 1) \
M00(M, (p)*10 + 2) \
M00(M, (p)*10 + 3) \
M00(M, (p)*10 + 4) \
M00(M, (p)*10 + 5) \
M00(M, (p)*10 + 6) \
M00(M, (p)*10 + 7) \
M00(M, (p)*10 + 8) \
M00(M, (p)*10 + 9)
#define M00(M, p) \
M0(M, (p)*10 + 0) \
M0(M, (p)*10 + 1) \
M0(M, (p)*10 + 2) \
M0(M, (p)*10 + 3) \
M0(M, (p)*10 + 4) \
M0(M, (p)*10 + 5) \
M0(M, (p)*10 + 6) \
M0(M, (p)*10 + 7) \
M0(M, (p)*10 + 8) \
M0(M, (p)*10 + 9)
#define M0(M, p) \
M((p)*10 + 0) \
M((p)*10 + 1) \
M((p)*10 + 2) \
M((p)*10 + 3) \
M((p)*10 + 4) \
M((p)*10 + 5) \
M((p)*10 + 6) \
M((p)*10 + 7) \
M((p)*10 + 8) \
M((p)*10 + 9)
#define M(j) \
array2[i * 1000 + (j)] = array1[i];
M000(M, 0)
This can be modified for non-powers by taking a higher power
and changing the base macro...
#define M(j) \
if ((j) < 1000) array2[i * 1000 + (j)] = array1[i]; else ;
...although there is a limit to how much a given preprocessor
can handle.
> And please note that for some reasons of my device, I do
> not want to use another loop such as
>
> for (i=0;i<N;i++)
> for (j=0;j<30;j++)
> array2[i*30+j]=array1[i];
'some reasons' sounds dubious.
--
Peter
As Keith has pointed out memset won't do. memcpy, on the other hand,
can do it in 5 calls:
array2[i+30] = array1[i];
memcpy(array2+i*30 + 1, array2[i+30], sizeof array2[i]);
memcpy(array2+i*30 + 2, array2[i+30], 2 * sizeof array2[i]);
memcpy(array2+i*30 + 4, array2[i+30], 4 * sizeof array2[i]);
memcpy(array2+i*30 + 8, array2[i+30], 8 * sizeof array2[i]);
memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);
Note the final multiplier (if I've go it right). The initial
assignment can be skipped if array2 and array1 have the same type (by
copying from array1 rather than array2) but it is likely that it is
faster to do the first few by assignment anyway before switching to
memcpy. The only way to be sure is to measure. I'd also measure the
plain loop because as so many people have pointed out, it is often
hard to predict the costs of loops these days.
The 1000 element copy takes 10 calls. Of course even these calls can
be put in a loop that where the overhead is likely to be small
compared to the work done. The result would be for more maintainable
than huge pile of assignments.
--
Ben.
Okay.
> for each for iteration of a for loop, we need a comparison and one
> addition. In my system, each comparison takes 3 cycles and one
> addition takes 1 cycle.
Stop right there.
Write it the simple, clear, and correct way.
Now test: IS IT ACTUALLY TOO SLOW?
If the answer is "no", you're done.
DO NOT TRY TO OPTIMIZE WITHOUT TESTING FIRST. EVER.
You may actually find out that the for loop is faster. Why? Any of
several reasons. Optimizers are pretty good about inner loops; they
get a lot of practice. Furthermore, it is not uncommon for a processor
to be such that an inner loop will be faster than a fully-unrolled
loop, because the fully-unrolled loop breaks the cache. You think
branches are slow, wait until you see the cost of waiting for a single
byte of memory that wasn't in cache...
Seriously, though, you should NOT be trying to optimize stuff like this
at your level of experience. When you're regularly advising the experts
in the group on performance, then you could possibly benefit from talking
about cycle counts. At your level of experience, you're just wasting
your time.
-s
--
Copyright 2010, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
You could to this:
for (j = 0; j < 30 * N; j++)
{
array2[j] = array1[j/30];
}
Yes, but possibly not enough. The OP needs to check this.
Personally, if the OP needs to do the full unroll, I would be inclined
to write a small C program to run on the ost system to generate the code
to be compiled for the target.
>> addition. In my system, each comparison takes 3 cycles and one
>> addition takes 1 cycle.
>> So each iteration takes 4 cycles.
>
> It was reasonable to count cycles as recently as, oh, twenty
> years ago. Since then, CPU's have become far more intricate and
> cycle counts have become extremely difficult to compute. Unless
> you're using an unusually simple processor (single-issue, little
> or no pipelining, at most one level of memory cache, maybe only
> a one- or two-slot store buffer, ...), these cycle counts are
> only loosely related to elapsed time.
It's not *that* long ago that I was programming an embedded processor
that had a trick three stage pipeline (all instructions when through all
three stages even if they only did something in one of them), no
out-of-order execution and no cache. So excluding wait states for memory
all instructions took in three cycles. However, in practice, due to the
pipelining, all instructions effectively took one cycle except in the
presence of a branch. The processor, and it's derivatives, are still
available today.
> Your example showed thirty assignments, and you indicated you
> might need as many as a thousand. How many cycles will you spend
> in wait states with the CPU twiddling its thumbs while RAM ever so
> slowly retrieves a fresh batch of instructions?
On the embedded system I was processing the memory and clock speeds
where carefully selected so there were no wait states.
>> If I use a inner loop, for each i, I need 4*M additional cycles. In
>> my case M , N both are large numbers and the system's time budget is
>> limited.
>>
>> That is why I do not want to use this way.
>
> Write it both ways (John Bode's suggestion of a "helper"
> program will save some time), and MEASURE the difference. Cycle
> counting is nearly impossible unless you happen to know that most
> of the data is already in CPU registers; a couple of cache misses
> (even just in L1) can disrupt the whole calculation. Measure!
I agree with measure, although *if* it is a simple enough processor (and
they are still out there) you *can* count.
--
Flash Gordon
What processor? Just curious...
On many installations, you'll find that's a macro! ;-)
>> (untested)
>>
>> memset( array2+i*30, array1[i], 30*sizeof(array2[0]) );
>
> As Keith has pointed out memset won't do. memcpy, on the other hand,
> can do it in 5 calls:
>
> array2[i+30] = array1[i];
> memcpy(array2+i*30 + 1, array2[i+30], sizeof array2[i]);
> memcpy(array2+i*30 + 2, array2[i+30], 2 * sizeof array2[i]);
> memcpy(array2+i*30 + 4, array2[i+30], 4 * sizeof array2[i]);
> memcpy(array2+i*30 + 8, array2[i+30], 8 * sizeof array2[i]);
> memcpy(array2+i*30 + 16, array2[i+30], 14 * sizeof array2[i]);
>
> Note the final multiplier (if I've go it right).
14's right. The +30's should be *30's.
I timed this against a naive loop for int, long long, and
double types. Conclusions were inconclusive. For double,
the naive loop was faster when compiled for plain i386, but
slower when 'optimised' for the actual athlon-xp architecture.
For int and long long, the memcpys were faster than the loop.
The memcpys were in fact simply unrolled completely.
Given that there was a case where the memcpys were measurably
slower, it's not worth blindly going for the clever technique
until you've both found that the naive loop is slowing you down
and measured the two to make sure that the optimisation is in
fact an improvement. Writing a clever version for speed without
measuring whether it's faster or not is dumb. And frequently
done, alas.
Phil
--
I find the easiest thing to do is to k/f myself and just troll away
-- David Melville on r.a.s.f1
Of course. Thanks.
> I timed this against a naive loop for int, long long, and
> double types. Conclusions were inconclusive. For double,
> the naive loop was faster when compiled for plain i386, but
> slower when 'optimised' for the actual athlon-xp architecture.
> For int and long long, the memcpys were faster than the loop.
> The memcpys were in fact simply unrolled completely.
Was this for the *30 example or the latter *1000 example?
> Given that there was a case where the memcpys were measurably
> slower, it's not worth blindly going for the clever technique
> until you've both found that the naive loop is slowing you down
> and measured the two to make sure that the optimisation is in
> fact an improvement. Writing a clever version for speed without
> measuring whether it's faster or not is dumb. And frequently
> done, alas.
The measuring is important but you can't find out that it is not
always faster without writing it! Having written it, I'd be tempted
to keep it around in some sort of conditional compilation so that it
can be switched in when/if measurements show that it is better on some
architecture/implementation.
One does ave to weight the maintenance cost, so there must be at least
the possibility of a benefit.
--
Ben.
<snip>
>> It's not *that* long ago that I was programming an embedded processor
>> that had a trick three stage pipeline (all instructions when through all
>> three stages even if they only did something in one of them), no
>> out-of-order execution and no cache. So excluding wait states for memory
>> all instructions took in three cycles. However, in practice, due to the
>> pipelining, all instructions effectively took one cycle except in the
>> presence of a branch. The processor, and it's derivatives, are still
>> available today.
>
> What processor? Just curious...
Well, it seems the exact range I used is out of production, but the
family continues with similar attributes... the manual I just looked at
gives a 6 stage pipeline, but still not cache or out-of-order execution
or anything like that.
http://focus.ti.com/lit/ug/spru131g/spru131g.pdf
One advantage of such processors is that sometimes you have hard
real-time requirements where something has to take an *exact* amount of
time, and doing the output early is just as bad as doing it late. Or
sometimes, the exact time does not matter as long as every run is a
consistent amount of time. In such situations *not* having a cache or
anything else introducing variability is *very* useful, and you can
count clock cycles.
--
Flash Gordon
*30
>> Given that there was a case where the memcpys were measurably
>> slower, it's not worth blindly going for the clever technique
>> until you've both found that the naive loop is slowing you down
>> and measured the two to make sure that the optimisation is in
>> fact an improvement. Writing a clever version for speed without
>> measuring whether it's faster or not is dumb. And frequently
>> done, alas.
>
> The measuring is important but you can't find out that it is not
> always faster without writing it! Having written it, I'd be tempted
> to keep it around in some sort of conditional compilation so that it
> can be switched in when/if measurements show that it is better on some
> architecture/implementation.
Indeed. I hope that my comment wasn't taken as a "don't even
suggest it" one. Of course, you have to suggest possibilities
before they are demonstrated to be preferable (likewise inferior).
> One does ave to weight the maintenance cost, so there must be at least
> the possibility of a benefit.
A simple comment like "/* we now have 16, only need 14 more */" would
be more than enough in your memcpy version. As someone who's maintained
a lot of old code (not admitting to how much of it was originally mine -
having said that, having about a million lines of crap dumped on me once
probably dwarfs all the crap that I've written (I hope)), I do quite
often very clearly see the maintainter's view.
In fact one mantra I picked up recently was something along the lines of
"you are always writing code to be read by someone else - yourself in the
future".
The first time that I saw C code that I wrote,
but hadn't seen in six months,
I was surprised by how hard it was for me to understand it.
--
pete
I recently had occasion to dive into some code I had written roughly a
year and a half ago, and let me tell you, I am glad that there were
comments.
My favorite:
/* Want to mess with this? You will need a young priest and an old priest. */
That code is gone now. :)
Which is why all new code should have comprehensive unit tests. Want to
see examples of how to use the code? Read the tests. Want to see if you
changes break the code? Run the tests. Simple.
My tests certainly save me a lot of time and pain when going back to
update my old code.
--
Ian Collins