I think it's an x86 thing and I've also posted in that group, but someone
here might have ideas, and ultimately fixing it might mean messing about
with malloc():
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *p, *q;
int i;
#define n 8192
//#define n 8000
p=malloc(n);
q=malloc(n);
for (i=0; i<1000000; ++i)
memcpy(p,q,5000);
}
When n is 8192, the memcpy takes nearly twice as long to execute as when n
is 8000 (or 9000, or just about anything else except 8192/16384 and so on).
(And in assembler, a byte-oriented copy takes four times as long with n=8192
as the same byte-oriented copy with n=8000.)
Any ideas on what I can do with malloc to avoid this slowdown?
--
Bartc
Because malloc allocates n+few bytes for presumably some control
structure. So 8192 actually takes 3 4KB pages whereas 8000 fits in
two.
Tom
OK, but why would that cause a slowdown when copying just 5000 bytes between
two blocks? It shouldn't matter if the blocks are 8KB or 12KB. And anyway
n=9000 would also take 12KB and yet that is fast. This is a weird problem.
--
Bartc
My guest is you are hitting the boundaries of the L0 cache.
I cannot find actual specs, but the L0 cache could be 32K (= 4* 8K,
or whatever layout it has).
This will cause one in a few memory accesses
to actually access the bus (which typically is a few times slower than the CPU)
You can suffer similar (but smaller) ripples once you exceed the l1 or L2 cache.
(and a much larger effect when the VM system needs to use disk as a L666 cache)
(sorry for the long line; google for "matrix transposition l0 cache" )
HTH,
AvK
it may actually be a compiler specific issue, such as memory alignment...
for example, if you are using GCC, and are either developing on 64-bit
Linux, or have SSE turned on, ... then GCC will try to copy memory via SSE
registers (movups, generally).
the cost is, however, that there is a bit of a performance hit if these
accesses are not aligned on a 16-byte boundary (since reading/writing each
value takes 2 memory accesses rather than 1).
by default, I think malloc only really gives an 8 byte alignment (AFAIK,
this is the case both with MSVCRT as used in MSVC/MinGW, Newlib as used in
Cygwin, and maybe also glibc, but I am not certain).
note, all this should not be an issue with MSVC, since it does memcpy via
'rep movsb'...
now, assuming all this, then possibly for whatever reason you are getting a
16 byte alignment with non-power-of-2 mallocs, and an 8 byte one with
power-of-2 mallocs.
or such...
> --
> Bartc
Not weird at all; you may be hitting some L1 cache degenerate behavior.
You forgot to mention what specific x86 processor you are using.
Look up your processor's specs: how big is the L1 cache size?
How wide are the entries? How many ways set associative?
Huh.
> I think it's an x86 thing and I've also posted in that group, but someone
> here might have ideas, and ultimately fixing it might mean messing about
> with malloc():
No, it doesn't really.
> When n is 8192, the memcpy takes nearly twice as long to execute as when n
> is 8000 (or 9000, or just about anything else except 8192/16384 and so on).
> (And in assembler, a byte-oriented copy takes four times as long with n=8192
> as the same byte-oriented copy with n=8000.)
>
> Any ideas on what I can do with malloc to avoid this slowdown?
Experiment: Allocate 16,384 bytes, then convert the pointer to some kind
of int, and find the next multiple of exactly 8,192 above it, convert that
value back to a pointer, and then work with the 8,192 bytes following that
pointer. Totally nonportable, but I'd bet it'll work. In which case, what
you're seeing is that malloc is probably giving you a hunk of space which
starts just after a page boundary, so that when you hit a page size boundary
with your actual region that you're using, you end up crossing page boundaries
and making the cache work harder.
It's unlikely, though, that anything you do with a test program like this
will map very well to real-world workloads, where you're likely to be
accessing more than two hunks, at which point, the problem becomes a lot
less noticeable.
-s
--
Copyright 2009, all wrongs reversed. Peter Seebach / usenet...@seebs.net
http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) <-- get educated!
Don't copy five billion indeterminately-valued bytes. ;-)
Most likely you're running into cache-line issues, where
the two (n+epsilon)-byte allocations are an unfortunate distance
apart. The specifics are highly dependent on the malloc() and
memcpy() implementations and on the underlying hardware (read:
"Anything you do to improve the situation on system X may be
irrelevant or even a total washout on system Y"), but it might
be instructive to print the p,q values on fast and slow test
runs. Suggestive patterns may emerge.
If your hardware has counters for things like cache hits and
misses, and if your system has a way to read those counters,
studying them might also be helpful.
--
Eric Sosman
eso...@ieee-dot-org.invalid
> char *p, *q;
> int i;
> #define n 8192
> //#define n 8000
>
> p=malloc(n);
> q=malloc(n);
>
> for (i=0; i<1000000; ++i)
> memcpy(p,q,5000);
> When n is 8192, the memcpy takes nearly twice as long to execute as when n
> is 8000 (or 9000, or just about anything else except 8192/16384 and so
> on). (And in assembler, a byte-oriented copy takes four times as long with
> n=8192 as the same byte-oriented copy with n=8000.)
>
> Any ideas on what I can do with malloc to avoid this slowdown?
Thanks for all the replies. I will have to delve further into the
intricacies of level one caches and such when I have more time.
I've done a quick fix by just adding a few extra bytes to malloc() calls to
avoid power-of-two sizes. At least, it fixed the program where the problem
came up.
(I was benchmarking a new, smarter program with an older one. The old
program was outperforming the new one! The only difference was the old one
allocated only the bytes it needed, or just over. The new program allocated
only power-of-two sizes (to allow objects to grow efficiently).)
--
Bartc
This intrigued me, so I repeated the experiment.
No conclusions, I'll just report the results.
(1) The anomaly doesn't depend on some
memcpy() weirdness; it arises just with
int *s, *d;
...
d[0] = s[0];
d[1] = s[1];
d[2] = s[2];
...
(2) The anomaly occurs near 4k-alignment;
that is, memcpy() performs poorly when
the (hex) difference between source and dest
is 0ff8 or 1ff8 or 2ff8 etc.
(3) I experimented on my laptop:
> Pentium(R) Dual CPU T2390 1.86 GHz
> CPU: L1 I cache: 32K, L1 D cache: 32K
> CPU: L2 cache: 1024K
(4) Here are the timings for different
differences between source and dest.
Note that 1ff8 is almost ten times as
slow as "normal."
> other 0.422
> 1fb0 0.422
> 1fb8 0.577
> 1fc0 0.616
> 1fc8 0.731
> 1fd8 1.056
> 1fe0 1.254
> 1fe8 1.385
> 1ff0 2.063
> 1ff8 4.103
> 2000 0.420
> 2008 0.732
> 2010 0.733
> 2018 0.732
> 2020 0.729
> 2028 0.729
> 2030 0.731
> 2038 0.730
> 2040 0.421
Sorry can't be of further help; I know
nothing of x86 cache, etc. and failure might
depend on LRU breadth ....
I've encountered such anomalies over
the years (reporting some on Usenet).
Since, as seen, the speed degradation
can be unexpectedly severe, this *might*
be a potential major way towards speed
optimization.
(On pursuing such anomalies, I've "been
there, done that". For example, I recall
two bugs in two different drivers on Solaris
21 years ago that *when combined* led to a
wastage of VME-bus mapping slots, but which
manifested itself as slowdown at particular
addresses. As another example, for
World's Fastest Jpeg(tm) on Sparc, I needed
to keep inner-loop functions near each other
and do malloc() specially to get data work
area at appropriate relationship to code.)
James Dow Allen
This suggests cache conflicts. However, it may well depend on length of
cache lines, memory organisation and quite a few other things in the
hardware (like: is memory interleaved?).
--
dik t. winter, cwi, science park 123, 1098 xg amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Nothing portable, since the size(s) that cause this problem will vary by
CPU.
I suspect that the particular CPU you are testing this on has an n-way
set associative cache (n is often 4). If the difference between p and q
is a certain amount, the locations you are reading from and then writing
to (or is it writing to and then reading from in the next cycle?) will
end up in the same way. With such cache contention, that level of cache
can become nearly useless, limiting you to the performance of the next
level.
S
--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking
Well I first saw the problem with rep: movsb, but found I could replicate
the problem with memcpy() giving me the excuse to post here...
> ...
>
> (2) The anomaly occurs near 4k-alignment;
> that is, memcpy() performs poorly when
> the (hex) difference between source and dest
> is 0ff8 or 1ff8 or 2ff8 etc.
> (4) Here are the timings for different
> differences between source and dest.
> Note that 1ff8 is almost ten times as
> slow as "normal."
>
>> other 0.422
>> 1fb0 0.422
>> 1fb8 0.577
>> 1fc0 0.616
>> 1fc8 0.731
>> 1fd8 1.056
>> 1fe0 1.254
>> 1fe8 1.385
>> 1ff0 2.063
>> 1ff8 4.103
>> 2000 0.420
>> 2008 0.732
Yes, 2008 was my figure (malloc(8K)). With 1FF8 (malloc(8K-16)), the timing
was slower still.
>> 2010 0.733
>> 2018 0.732
I'm currently using malloc(8K+16), ie. 2018, but that seems to work properly
(ie. at expected speed) for me.
This is of course for a particular pattern of malloc() calls, in my case
originally in a loop freeing then reallocating, mostly, the same size block
(2K, then 4K then 8K, but making use of an increasing chunk of each).
Avoiding this difference is going to be difficult in general, but hopefully
it won't have a significant effect on the runtime of a real program.
--
Bartc
IMHO it has nothing to do with malloc.
The same phenomenon occurs when you have a statically or globally allocated
chunk of memory. Things get tight if both source and destination span two
cache slots. You will need four cache slots in that case. If you consider
the stack as well, you'll need five, and if only four L1 slots are available
, (at least) one of these will have to flip/flop between two memory areas.
HTH,
AvK
In summary, the
Table shows 0.1% chance of 3-fold slowdown, as well
as further chance of lesser slowdown, and possible
10-fold slowdown! This should seem like an important
slowdown indeed, especially for such a simple and
oft-used code.
It seems not at all far-fetched that a program will
have 1 or 2 inner loops which dominate run times
and emulate, more-or-less, memcpy() or something
similar. 3-fold or even 2-fold slowdown might be
a big bummer, even if 0.1% seems a small risk.
> Avoiding this difference is going to be difficult in general, but hopefully
> it won't have a significant effect on the runtime of a real program.
I agree it would be difficult in general, but *it* can
be done, albeit in a tedious non-portable way.
Judging by your example and my experience, this might be
appropriate when speed really is of great importance.
James
These probabilities are based on a random spread of block sizes in a
program?
Unfortunately my applications always uses, for allocations above 1K,
power-of-two sizes. So it would slow down 100% of the time for blocks over
2K, when copying between one object and another (and maybe 50% of the time
for 2K blocks).
Blocks larger than 1K are less frequent than smaller ones, but would have a
bigger impact.
It seems my choices are to use slightly bigger than power-of-two sizes, or
to do things like random small malloc() calls to break up the pattern.
--
Bartc